Monday, 21 December 2020

groupby on sparse DF taking a long time, What can I do to speed it up?

I have a DF with many fields (229350 rows × 14 columns ):

Name    Assets_%    Fund_Code   Fund_Exchange   UpdatedAt   Code    Exchange    Sector  Industry    Country Region  Owned   Change  Weight
0   MAGNETITE XIX LTD 2017-19 1.39775% 07/17/2030   5.02    AAA NYSE ARCA   2020-12-15  NaN NaN NaN NaN NaN NaN NaN NaN NaN

When I try to do one hot encoding it's very fast(less than 10 seconds) - note if sparse is False then process crashes due to lack of memory):

one_hot = pd.get_dummies(holdingsFullDF, columns=['Fund_Code'], sparse=True)

But each name has multiple entries in my dataset and I want to see for every row, all the categories its in so I added a groupby to the end of my command:

one_hot = pd.get_dummies(holdingsFullDF, columns=['Fund_Code'], sparse=True).groupby(['Name'], as_index=False).sum()

It's been running for over an hour with the groupby(when I group by after doing one_hot, the results are the same).

Is there another approach I can take to get one hot encoding that's aggregated on a large dataset?

update : As requested here is the dataset(after I one hot encode it using sparse):

Code    Fund_Code_AADR  Fund_Code_AAXJ  Fund_Code_ABEQ  Fund_Code_ACSI  Fund_Code_ACVF  Fund_Code_ACWD  Fund_Code_ACWF  Fund_Code_ACWI  Fund_Code_ACWV  ... Fund_Code_ZWC   Fund_Code_ZWEU  Fund_Code_ZWG   Fund_Code_ZWH   Fund_Code_ZWH-U Fund_Code_ZWK   Fund_Code_ZWP   Fund_Code_ZWS   Fund_Code_ZWU   Fund_Code_ZZZD
29  GFI 1   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
30  MELI    1   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
31  NICE    1   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
32  ARGX    1   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
33  TAL 1   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
... ... 

My original holding dataframe:

Fund_Code   Name
0   AAA MAGNETITE XIX LTD 2017-19 1.39775% 07/17/2030
1   AAA REGATTA II FDG L.P 2013-R2 1.48688% 01/16/2029
2   AAA VOYA CLO 2015-1R 1.17175% 01/18/2029
3   AAA DRYDEN XXVI SR LN FD 2013-26R A-R 1.13688% 04/...
4   AAA DRYDEN CLO LTD 2018-58 1.21775% 07/17/2031
... ... ...
6301390 ZVNIX   Veeva Systems Inc Class A
6301391 ZVNIX   Chegg Inc
6301392 ZVNIX   Seattle Genetics I
6301393 ZVNIX   CoStar Group Inc
6301394 ZVNIX   Chewy Inc

update 2: If it helps, I saw something I thought was promising. pd.crosstab(holdingsFullDF['Name'], holdingsFullDF['Fund_Code']) in other questions it seems to be doing exactly what I want to do(one hot then tie it all back to each unique company in "name" column) but it crashes the avail memory in my system(25GB).

update 3: I have a bad way to get the results I want(bad because SO taught me to not use for loops on dataframes and it has been running for a hour with only 5% progress of unique values of 'Name'):

one_hot = pd.get_dummies(holdingsFullDF, columns=['Fund_Code'], sparse=True)


def compressOneHotEncoding(name):
  tempOnehot = one_hot.loc[one_hot['Name'] == name]
  tempOnehotGrouped = tempOnehot.groupby('Name').sum()
  return tempOnehotGrouped


listOfGroupedDataSet = []
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=30) as executor:
  for tickerResults in executor.map(compressOneHotEncoding, holdingsFullDF['Name'].unique()): 
    listOfGroupedDataSet.append(tickerResults)


FullMergeDF = pd.concat(listOfGroupedDataSet)


from groupby on sparse DF taking a long time, What can I do to speed it up?

No comments:

Post a Comment