Tuesday, 1 August 2023

Vectorized sum and product with list of pandas data frames

I have a list of list of Data Frames, each corresponding to a different time period t from 0 to N. Each data frame has multiple types, i need to preform the finally for each type in the data frame.

An example data set would be as follows, i made each df in the list the same values for simplicity but the calculation would remain the same.

d = {'type': ['a', 'b', 'c'], 'x': [1, 2, 3], 'y':[3,4,5]}
df = pd.DataFrame(data=d)

l = []
window = 20
[l.append(df.copy(deep=True)) for i in range(window)]

I need to compute a vectored sum $$\sum$$ and product $$\prod$$ using the above list of dataframes for each type (a, b, c) in an efficient manner. e.g. for each calculation below I need to filter by a single type df[df['type'] == 'a'] for every df in the list.

enter image description here

If i use a df.groupby('type') on every df in the list it could be very slow using a larger data set. The same goes if i use nested for loops for the sum and the product and filtering by type in each iteration of the for loop. How can I compute this sum product in an efficient manner ?

Update

One possible way as suggested in the comments is below, however this will start to be very inefficient if I use a larger dataset or window:

import pandas as pd
import math

d = {'type': ['a', 'b', 'c'], 'x': [1, 2, 3], 'y':[3, 4, 5]}
df = pd.DataFrame(data=d)

l = []
window = 20
for i in range(window):
    df['window'] = i
    l.append(df.copy(deep=True))
df = pd.concat(l)

sums = {}
types = df['type'].unique()
for s in types:
    sums[s] = 0
    tdf = df[df['type'] == s]
    for i in range(window):
        sums[s] += tdf[tdf['window'] == i]['x'].values[0] * math.prod(2 - tdf[tdf['window'] == i+1]['x'].values[0] for i in range(0, window-1)) * tdf[tdf['window'] == i]['y'].values[0]


from Vectorized sum and product with list of pandas data frames

No comments:

Post a Comment