My problem
I'm having trouble with the performance of resample function in combination with a groupby. The operation I'm doing is currently taking 8+ seconds on a data sample of 5000 rows which is totally unreasonable for my requirements.
Sample data (500 rows)
Pastebin with data as dict: https://pastebin.com/RPNdhXsy
The logic
I have data with dates in a quarterly interval which I want to group by a column and then resample the dates within the groups on a monthly basis.
Input:
isin report_date val
SE001 2018-12-31 1
SE001 2018-09-30 2
SE001 2018-06-31 3
US001 2018-10-31 4
US001 2018-07-31 5
Output:
isin report_date val
SE001 2018-12-31 1
2018-11-30 NaN
2018-10-31 NaN
2018-09-30 2
2018-08-31 NaN
2018-07-31 NaN
2018-06-30 3
US001 2018-10-30 4
2018-09-31 NaN
2018-08-31 NaN
2018-07-31 5
I used to have this operation:
df.groupby('isin').resample('M', on="report_date").first()[::-1]
Since it seems that asfreq()
has slightly better performance than using on=
in resample
, I instead do the following currently. It's still slow though. I reverse since resample
seems to non-optionally sort dates descending.
df.set_index('report_date').groupby('isin').resample('M').asfreq()[::-1]
As stated, with 5000 rows and around 16 columns this takes 15 seconds to run since I need to do it on two separate dataframes. With the sample data in the pastebin (500 rows) the operation takes me 0.7s which is way too long for me since my final data will have 800k rows.
EDIT: Timing of the different operations
Current way
setindex --- 0.001055002212524414 seconds ---
groupby --- 0.00033092498779296875 seconds ---
resample --- 0.004662036895751953 seconds ---
asfreq --- 0.8990700244903564 seconds ---
[::-1] --- 0.0013098716735839844 seconds ---
= 0.9056s
Old way
groupby --- 0.0005779266357421875 seconds ---
resample --- 0.0044629573822021484 seconds ---
first --- 1.6829369068145752 seconds ---
[::-1] --- 0.001600027084350586 seconds ---
= 1.6894s
Judging by this, it seems that converting from the pandas.core.resample.DatetimeIndexResamplerGroupby
to a df is taking very long. Now what?
EDIT2: Using reindex
df.set_index('report_date').groupby('isin').apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='M'), fill_value=0))[::-1]
This takes 0.28s which is a vast improvement. Still not very good though.
How can I speed this up? Is there another way to do the same thing?
from Pandas groupby resample poor performance
No comments:
Post a Comment