Tuesday, 24 August 2021

Shift, concat and divide over 90 times on a 250MB dataframe efficiently in pandas

While trying to transform a simple dataset, I'm facing performance and memory consumption issues with pandas. Basically, I'm trying to shift a grouped (by index) dataset 90 times, and then divide the values of the shifted columns.

My base data frame is about 255MB in-memory (around 11M rows and 5 columns) and is loaded using parquet and has multi-index. Shifting and concatenating this much data, using methods I tried, causing either ridiculous memory consumption (I'm assuming it's because of the way pandas manages memory and copies of the data frame in the process).

Even if my machine is not limited-bound, the operation itself takes so long that I consider it impractical. I expect and assume this can be done in a matter of minutes. Am I wrong?

To be more specific, here is what I'm trying to perform:

  1. For each column in the data frame, create a new column with shifted data X days back, for X in range(1, 90);

  2. Divide the value of each column by the previous-column-shift. (so columnA.shift(90) is divided by columnA.shift(89);

  3. Concatenating everything into one data frame.

I find it a bit difficult to explain it purely in words, so here is an example of the data manipulation:

This is the original dataset (only 2 columns for the example):

data = pd.DataFrame(
    [[20, 43],
    [40, 52],
    [30, 34],
    [60, 52],
    [44, 66],
    [55, 72],
    [34, 34],
    [51, 17]],
    columns=['a', 'b'])

After the shifts:

After the division of the values:

Below are some code snippets I tried.

Concatenating all shifted columns in the loop (this without the divide operation). This is actually wrong, as I forgot to perform groupby() here:

for w in range(1, 91):
    data = pd.concat([
        data,
        (data[['c1', 'c2', 'c3', 'c4', 'c5']].shift(w).rename(
            columns={
                'c1': f'c1_{w}',
                'c2': f'c2_{w}',
                'c3': f'c3_{w}',
                'c4': f'c4_{w}',
                'c5': f'c5_{w}'
            }
        ))], axis=1)

Using groupby().apply():

def _myfunc(x):
    for w in range (1, 4):
        return (x.shift(w) / x.shift(w - 1)).rename({
            'c1': f'c1_{w}',
            'c2': f'c2_{w}',
            'c3': f'c3_{w}',
            'c4': f'c4_{w}',
            'c5': f'c5_{w}'
            })
        
data.groupby("mylevel")[['c1', 'c2', 'c3', 'c4', 'c5']].apply(_myfunc)

Will be glad for assistance here.



from Shift, concat and divide over 90 times on a 250MB dataframe efficiently in pandas

No comments:

Post a Comment