I have a DF with many fields (229350 rows × 14 columns ):
Name Assets_% Fund_Code Fund_Exchange UpdatedAt Code Exchange Sector Industry Country Region Owned Change Weight
0 MAGNETITE XIX LTD 2017-19 1.39775% 07/17/2030 5.02 AAA NYSE ARCA 2020-12-15 NaN NaN NaN NaN NaN NaN NaN NaN NaN
When I try to do one hot encoding it's very fast(less than 10 seconds) - note if sparse is False then process crashes due to lack of memory):
one_hot = pd.get_dummies(holdingsFullDF, columns=['Fund_Code'], sparse=True)
But each name has multiple entries in my dataset and I want to see for every row, all the categories its in so I added a groupby to the end of my command:
one_hot = pd.get_dummies(holdingsFullDF, columns=['Fund_Code'], sparse=True).groupby(['Name'], as_index=False).sum()
It's been running for over an hour with the groupby(when I group by after doing one_hot, the results are the same).
Is there another approach I can take to get one hot encoding that's aggregated on a large dataset?
update : As requested here is the dataset(after I one hot encode it using sparse):
Code Fund_Code_AADR Fund_Code_AAXJ Fund_Code_ABEQ Fund_Code_ACSI Fund_Code_ACVF Fund_Code_ACWD Fund_Code_ACWF Fund_Code_ACWI Fund_Code_ACWV ... Fund_Code_ZWC Fund_Code_ZWEU Fund_Code_ZWG Fund_Code_ZWH Fund_Code_ZWH-U Fund_Code_ZWK Fund_Code_ZWP Fund_Code_ZWS Fund_Code_ZWU Fund_Code_ZZZD
29 GFI 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
30 MELI 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
31 NICE 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
32 ARGX 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
33 TAL 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ...
My original holding dataframe:
Fund_Code Name
0 AAA MAGNETITE XIX LTD 2017-19 1.39775% 07/17/2030
1 AAA REGATTA II FDG L.P 2013-R2 1.48688% 01/16/2029
2 AAA VOYA CLO 2015-1R 1.17175% 01/18/2029
3 AAA DRYDEN XXVI SR LN FD 2013-26R A-R 1.13688% 04/...
4 AAA DRYDEN CLO LTD 2018-58 1.21775% 07/17/2031
... ... ...
6301390 ZVNIX Veeva Systems Inc Class A
6301391 ZVNIX Chegg Inc
6301392 ZVNIX Seattle Genetics I
6301393 ZVNIX CoStar Group Inc
6301394 ZVNIX Chewy Inc
update 2: If it helps, I saw something I thought was promising. pd.crosstab(holdingsFullDF['Name'], holdingsFullDF['Fund_Code']) in other questions it seems to be doing exactly what I want to do(one hot then tie it all back to each unique company in "name" column) but it crashes the avail memory in my system(25GB).
update 3: I have a bad way to get the results I want(bad because SO taught me to not use for loops on dataframes and it has been running for a hour with only 5% progress of unique values of 'Name'):
one_hot = pd.get_dummies(holdingsFullDF, columns=['Fund_Code'], sparse=True)
def compressOneHotEncoding(name):
tempOnehot = one_hot.loc[one_hot['Name'] == name]
tempOnehotGrouped = tempOnehot.groupby('Name').sum()
return tempOnehotGrouped
listOfGroupedDataSet = []
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=30) as executor:
for tickerResults in executor.map(compressOneHotEncoding, holdingsFullDF['Name'].unique()):
listOfGroupedDataSet.append(tickerResults)
FullMergeDF = pd.concat(listOfGroupedDataSet)
from groupby on sparse DF taking a long time, What can I do to speed it up?
No comments:
Post a Comment