Saturday, 21 August 2021

Efficient way of Groupby Roll up/down for any kind of aggregates

TL;DR: How can we achieve something similar to Group By Roll Up with any kind of aggregates in pandas? (Credit to @Scott Boston for this term)

I have following dataframe:

       P   Q  R     S  T
0   PLAC  NR  F   HOL  F
1   PLAC  NR  F  NHOL  F
2   TRTB  NR  M  NHOL  M
3   PLAC  NR  M  NHOL  M
4   PLAC  NR  F  NHOL  F
5   PLAC   R  M  NHOL  M
6   TRTA   R  F   HOL  F
7   TRTA  NR  F   HOL  F
8   TRTB  NR  F  NHOL  F
9   PLAC  NR  F  NHOL  F
10  TRTB  NR  F  NHOL  F
11  TRTB  NR  M  NHOL  M
12  TRTA  NR  F   HOL  F
13  PLAC  NR  F   HOL  F
14  PLAC   R  F  NHOL  F

For a list of columns ['Q', 'R', 'S', 'T'], I want to calculate some aggregates on P column on following 4 list of grouping columns:

  1. ['Q']
  2. ['Q', 'R']
  3. ['Q', 'R', 'S']
  4. ['Q', 'R', 'S', 'T']

I've already written the code to group above dataframes in an increasing number of columns, and calculate the aggregate (using count for the shake of simplicity) on each of the groupby object, and finally concatenate them:

cols = list('QRST')
aggCol = 'P'
groupCols = []
result = []
for col in cols:
    groupCols.append(col)
    result.append(df.groupby(groupCols)[aggCol].agg(count='count').reset_index())
result = pd.concat(result)[groupCols+['count']]

However, I've strong feeling that above method is not so efficient in terms of CPU time. Is there a more efficient way to apply aggregate on such continuously increasing number of columns for grouping?

Why I think it is not so efficient is because: For above values, in first iteration, it groups the dataframe on Q column then calculates aggregate. Then in next iteration it groups the dataframe on Q and R, that means it again needs to group it by Q then R, but it was already grouped by Q in the first iteration, so the same operation is repeating. If there is some way to utilize the previously created groups, I think it'll be efficient.

OUTPUT:

    Q    R     S    T  count
0  NR  NaN   NaN  NaN     12
1   R  NaN   NaN  NaN      3
0  NR    F   NaN  NaN      9
1  NR    M   NaN  NaN      3
2   R    F   NaN  NaN      2
3   R    M   NaN  NaN      1
0  NR    F   HOL  NaN      4
1  NR    F  NHOL  NaN      5
2  NR    M  NHOL  NaN      3
3   R    F   HOL  NaN      1
4   R    F  NHOL  NaN      1
5   R    M  NHOL  NaN      1
0  NR    F   HOL    F      4
1  NR    F  NHOL    F      5
2  NR    M  NHOL    M      3
3   R    F   HOL    F      1
4   R    F  NHOL    F      1
5   R    M  NHOL    M      1

I already looked into Is there an equivalent of SQL GROUP BY ROLLUP in Python pandas? and Pandas Pivot tables row subtotals, they don't work in my case, I already tried them:

pd.pivot_table(df, aggCol, columns=cols, aggfunc='count', margins=True).T.reset_index()
    Q    R     S  T  P
0  NR    F   HOL  F  4
1  NR    F  NHOL  F  5
2  NR    M  NHOL  M  3
3  NR  All           3
4   R    F   HOL  F  1
5   R    F  NHOL  F  1
6   R    M  NHOL  M  1
7   R  All           3


from Efficient way of Groupby Roll up/down for any kind of aggregates

No comments:

Post a Comment