Friday, 16 October 2020

Pandas groupby resample poor performance

My problem

I'm having trouble with the performance of resample function in combination with a groupby. The operation I'm doing is currently taking 8+ seconds on a data sample of 5000 rows which is totally unreasonable for my requirements.

Sample data (500 rows)

Pastebin with data as dict: https://pastebin.com/RPNdhXsy


The logic

I have data with dates in a quarterly interval which I want to group by a column and then resample the dates within the groups on a monthly basis.

Input:
     isin  report_date   val
    SE001   2018-12-31     1
    SE001   2018-09-30     2
    SE001   2018-06-31     3
    US001   2018-10-31     4
    US001   2018-07-31     5

Output:
    isin   report_date      val        
    SE001   2018-12-31        1
            2018-11-30      NaN
            2018-10-31      NaN
            2018-09-30        2
            2018-08-31      NaN
            2018-07-31      NaN
            2018-06-30        3
    US001   2018-10-30        4    
            2018-09-31      NaN
            2018-08-31      NaN
            2018-07-31        5

I used to have this operation:

df.groupby('isin').resample('M', on="report_date").first()[::-1]

Since it seems that asfreq() has slightly better performance than using on= in resample, I instead do the following currently. It's still slow though. I reverse since resample seems to non-optionally sort dates descending.

df.set_index('report_date').groupby('isin').resample('M').asfreq()[::-1]

As stated, with 5000 rows and around 16 columns this takes 15 seconds to run since I need to do it on two separate dataframes. With the sample data in the pastebin (500 rows) the operation takes me 0.7s which is way too long for me since my final data will have 800k rows.

EDIT: Timing of the different operations

Current way

setindex --- 0.001055002212524414 seconds ---
groupby --- 0.00033092498779296875 seconds ---
resample --- 0.004662036895751953 seconds ---
asfreq --- 0.8990700244903564 seconds ---
[::-1] --- 0.0013098716735839844 seconds ---
= 0.9056s

Old way

groupby --- 0.0005779266357421875 seconds ---
resample --- 0.0044629573822021484 seconds ---
first --- 1.6829369068145752 seconds ---
[::-1] --- 0.001600027084350586 seconds ---
= 1.6894s

Judging by this, it seems that converting from the pandas.core.resample.DatetimeIndexResamplerGroupby to a df is taking very long. Now what?

EDIT2: Using reindex

df.set_index('report_date').groupby('isin').apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='M'), fill_value=0))[::-1]

This takes 0.28s which is a vast improvement. Still not very good though.


How can I speed this up? Is there another way to do the same thing?



from Pandas groupby resample poor performance

No comments:

Post a Comment