Hemant Vishwakarma: How to vectorize complex cumulative aggregation problem

I have a dataset like the following:

|   date       | time_index |    identifier   | value | cum_value | bar_index | desired_output |
|--------------|------------|-----------------|-------|-----------|-----------|----------------|
| 2023-06-01   |     1      | stackoverflow   |   5   |     5     |    NaN    |       0        |
| 2023-06-01   |     2      | stackoverflow   |  10   |    15     |    NaN    |       0        |
| 2023-06-01   |     3      | stackoverflow   |  10   |    25     |    NaN    |       1        |
| 2023-06-01   |     1      | cross_validated |   4   |     4     |    NaN    |       0        |
| 2023-06-01   |     2      | cross_validated |   6   |    10     |    NaN    |       0        |
| 2023-06-01   |     3      | cross_validated |  20   |    30     |    NaN    |       1        |
| 2023-06-01   |     4      | cross_validated |   5   |    35     |    NaN    |       2        |
| 2023-06-02   |     1      | stackoverflow   |   2   |     2     |    NaN    |       0        |
| 2023-06-02   |     2      | stackoverflow   |  10   |    12     |    NaN    |       0        |
| 2023-06-02   |     1      | cross_validated |  20   |    20     |    NaN    |       0        |
| 2023-06-02   |     2      | cross_validated |   3   |    23     |    NaN    |       1        |
| 2023-06-02   |     3      | cross_validated |   3   |    26     |    NaN    |       1        |

This is the code that generates the above dataset:

df = pd.DataFrame({
    "date": ["2023-06-01", "2023-06-01", "2023-06-01", "2023-06-01", "2023-06-01", "2023-06-01", "2023-06-01", "2023-06-02", "2023-06-02", "2023-06-02", "2023-06-02", "2023-06-02"],
    "time_index": [1, 2, 3, 1, 2, 3, 4, 1, 2, 1, 2, 3],
    "identifier": ["stackoverflow", "stackoverflow", "stackoverflow", "cross_validated", "cross_validated", "cross_validated", "cross_validated", 
               "stackoverflow", "stackoverflow", "cross_validated", "cross_validated", "cross_validated"],
    "value": [5, 10, 10, 4, 6, 20, 5, 2, 10, 20, 3, 3]
})
df["cum_value"] = df.groupby(["identifier", "date"])["value"].cumsum()
df["bar_index"] = np.nan
df["desired_output"] = [0, 0, 1, 0, 0, 1, 2, 0, 0, 0, 1, 1]

I want to sample bar_index for each identifier and date according to a fixed (for now) threshold τ=10, using a columns value and(or) cum_value.

τ = 10
date: 2023-06-01 = d1 & 2023-06-02 = d2
identifier: stackoverflow = id1 & cross_validated = id2
time_index ∈ {t1, t2,...,tn} ∀ d, id

In the above example, observation {id1, d1, t1} has a value less than the threshold of 10 so we continue to the next entires -> if we add the value of {id1, d1, t1} and {id1, d1, t2} together, we reach a cumulative value of 15, which exceeds the threshold. Therefore, we would sample {id1, d1, t1} as well as {id1, d1, t2} as bar_index 0
In addition, if we encounter an observation with a very large value, for example, {id2, d1, t3}, and the previous bar ended (cumulative value exeeced the threshold from the last trade), we would sample this observation along as a bar_index. The next observation starts a new accumulation (in theory)

below is my current non-vectorized apporach for a more comprehensive understanding:

def aggregate_bars(group, threshold):
    cum_value = 0
    bar_index = 0

    for i in range(len(group)):
        cum_value += group.iloc[i]["value"]
        if cum_value >= threshold:
            group["bar_index"].iloc[i] = bar_index
            bar_index += 1
            cum_value = 0
        elif cum_value < threshold:
            group["bar_index"].iloc[i] = bar_index

    return group

df = df.groupby(["identifier", "date"]).apply(lambda x: aggregate_bars(x, 10))
df

out:

|   date       | time_index |    identifier   | value | cum_value | bar_index | desired_output |
|--------------|------------|-----------------|-------|-----------|-----------|----------------|
| 2023-06-01   |     1      | stackoverflow   |   5   |     5     |    0.0    |       0        |
| 2023-06-01   |     2      | stackoverflow   |  10   |    15     |    0.0    |       0        |
| 2023-06-01   |     3      | stackoverflow   |  10   |    25     |    1.0    |       1        |
| 2023-06-01   |     1      | cross_validated |   4   |     4     |    0.0    |       0        |
| 2023-06-01   |     2      | cross_validated |   6   |    10     |    0.0    |       0        |
| 2023-06-01   |     3      | cross_validated |  20   |    30     |    1.0    |       1        |
| 2023-06-01   |     4      | cross_validated |   5   |    35     |    2.0    |       2        |
| 2023-06-02   |     1      | stackoverflow   |   2   |     2     |    0.0    |       0        |
| 2023-06-02   |     2      | stackoverflow   |  10   |    12     |    0.0    |       0        |
| 2023-06-02   |     1      | cross_validated |  20   |    20     |    0.0    |       0        |
| 2023-06-02   |     2      | cross_validated |   3   |    23     |    1.0    |       1        |
| 2023-06-02   |     3      | cross_validated |   3   |    26     |    1.0    |       1        |

I am looking to vectorizing the code so that I can run with Rapids workflow to process trillions of rows.

from How to vectorize complex cumulative aggregation problem

Hemant Vishwakarma

Sunday, 16 July 2023

How to vectorize complex cumulative aggregation problem

No comments:

Post a Comment