TL;DR: How can we achieve something similar to Group By Roll Up with any kind of aggregates in pandas? (Credit to @Scott Boston for this term)
I have following dataframe:
P Q R S T
0 PLAC NR F HOL F
1 PLAC NR F NHOL F
2 TRTB NR M NHOL M
3 PLAC NR M NHOL M
4 PLAC NR F NHOL F
5 PLAC R M NHOL M
6 TRTA R F HOL F
7 TRTA NR F HOL F
8 TRTB NR F NHOL F
9 PLAC NR F NHOL F
10 TRTB NR F NHOL F
11 TRTB NR M NHOL M
12 TRTA NR F HOL F
13 PLAC NR F HOL F
14 PLAC R F NHOL F
For a list of columns ['Q', 'R', 'S', 'T']
, I want to calculate some aggregates on P
column on following 4 list of grouping columns:
['Q']
['Q', 'R']
['Q', 'R', 'S']
['Q', 'R', 'S', 'T']
I've already written the code to group above dataframes in an increasing number of columns, and calculate the aggregate (using count
for the shake of simplicity) on each of the groupby object, and finally concatenate them:
cols = list('QRST')
aggCol = 'P'
groupCols = []
result = []
for col in cols:
groupCols.append(col)
result.append(df.groupby(groupCols)[aggCol].agg(count='count').reset_index())
result = pd.concat(result)[groupCols+['count']]
However, I've strong feeling that above method is not so efficient in terms of CPU time. Is there a more efficient way to apply aggregate on such continuously increasing number of columns for grouping?
Why I think it is not so efficient is because: For above values, in first iteration, it groups the dataframe on Q
column then calculates aggregate. Then in next iteration it groups the dataframe on Q
and R
, that means it again needs to group it by Q
then R
, but it was already grouped by Q
in the first iteration, so the same operation is repeating. If there is some way to utilize the previously created groups, I think it'll be efficient.
OUTPUT:
Q R S T count
0 NR NaN NaN NaN 12
1 R NaN NaN NaN 3
0 NR F NaN NaN 9
1 NR M NaN NaN 3
2 R F NaN NaN 2
3 R M NaN NaN 1
0 NR F HOL NaN 4
1 NR F NHOL NaN 5
2 NR M NHOL NaN 3
3 R F HOL NaN 1
4 R F NHOL NaN 1
5 R M NHOL NaN 1
0 NR F HOL F 4
1 NR F NHOL F 5
2 NR M NHOL M 3
3 R F HOL F 1
4 R F NHOL F 1
5 R M NHOL M 1
I already looked into Is there an equivalent of SQL GROUP BY ROLLUP in Python pandas? and Pandas Pivot tables row subtotals, they don't work in my case, I already tried them:
pd.pivot_table(df, aggCol, columns=cols, aggfunc='count', margins=True).T.reset_index()
Q R S T P
0 NR F HOL F 4
1 NR F NHOL F 5
2 NR M NHOL M 3
3 NR All 3
4 R F HOL F 1
5 R F NHOL F 1
6 R M NHOL M 1
7 R All 3
from Efficient way of Groupby Roll up/down for any kind of aggregates
No comments:
Post a Comment