Sunday, 16 July 2023

How to vectorize complex cumulative aggregation problem

I have a dataset like the following:

|   date       | time_index |    identifier   | value | cum_value | bar_index | desired_output |
|--------------|------------|-----------------|-------|-----------|-----------|----------------|
| 2023-06-01   |     1      | stackoverflow   |   5   |     5     |    NaN    |       0        |
| 2023-06-01   |     2      | stackoverflow   |  10   |    15     |    NaN    |       0        |
| 2023-06-01   |     3      | stackoverflow   |  10   |    25     |    NaN    |       1        |
| 2023-06-01   |     1      | cross_validated |   4   |     4     |    NaN    |       0        |
| 2023-06-01   |     2      | cross_validated |   6   |    10     |    NaN    |       0        |
| 2023-06-01   |     3      | cross_validated |  20   |    30     |    NaN    |       1        |
| 2023-06-01   |     4      | cross_validated |   5   |    35     |    NaN    |       2        |
| 2023-06-02   |     1      | stackoverflow   |   2   |     2     |    NaN    |       0        |
| 2023-06-02   |     2      | stackoverflow   |  10   |    12     |    NaN    |       0        |
| 2023-06-02   |     1      | cross_validated |  20   |    20     |    NaN    |       0        |
| 2023-06-02   |     2      | cross_validated |   3   |    23     |    NaN    |       1        |
| 2023-06-02   |     3      | cross_validated |   3   |    26     |    NaN    |       1        |

This is the code that generates the above dataset:

df = pd.DataFrame({
    "date": ["2023-06-01", "2023-06-01", "2023-06-01", "2023-06-01", "2023-06-01", "2023-06-01", "2023-06-01", "2023-06-02", "2023-06-02", "2023-06-02", "2023-06-02", "2023-06-02"],
    "time_index": [1, 2, 3, 1, 2, 3, 4, 1, 2, 1, 2, 3],
    "identifier": ["stackoverflow", "stackoverflow", "stackoverflow", "cross_validated", "cross_validated", "cross_validated", "cross_validated", 
               "stackoverflow", "stackoverflow", "cross_validated", "cross_validated", "cross_validated"],
    "value": [5, 10, 10, 4, 6, 20, 5, 2, 10, 20, 3, 3]
})
df["cum_value"] = df.groupby(["identifier", "date"])["value"].cumsum()
df["bar_index"] = np.nan
df["desired_output"] = [0, 0, 1, 0, 0, 1, 2, 0, 0, 0, 1, 1]

I want to sample bar_index for each identifier and date according to a fixed (for now) threshold τ=10, using a columns value and(or) cum_value.

  • τ = 10
  • date: 2023-06-01 = d1 & 2023-06-02 = d2
  • identifier: stackoverflow = id1 & cross_validated = id2
  • time_index ∈ {t1, t2,...,tn} ∀ d, id
  1. In the above example, observation {id1, d1, t1} has a value less than the threshold of 10 so we continue to the next entires -> if we add the value of {id1, d1, t1} and {id1, d1, t2} together, we reach a cumulative value of 15, which exceeds the threshold. Therefore, we would sample {id1, d1, t1} as well as {id1, d1, t2} as bar_index 0

  2. In addition, if we encounter an observation with a very large value, for example, {id2, d1, t3}, and the previous bar ended (cumulative value exeeced the threshold from the last trade), we would sample this observation along as a bar_index. The next observation starts a new accumulation (in theory)

below is my current non-vectorized apporach for a more comprehensive understanding:

def aggregate_bars(group, threshold):
    cum_value = 0
    bar_index = 0

    for i in range(len(group)):
        cum_value += group.iloc[i]["value"]
        if cum_value >= threshold:
            group["bar_index"].iloc[i] = bar_index
            bar_index += 1
            cum_value = 0
        elif cum_value < threshold:
            group["bar_index"].iloc[i] = bar_index

    return group

df = df.groupby(["identifier", "date"]).apply(lambda x: aggregate_bars(x, 10))
df
out:

|   date       | time_index |    identifier   | value | cum_value | bar_index | desired_output |
|--------------|------------|-----------------|-------|-----------|-----------|----------------|
| 2023-06-01   |     1      | stackoverflow   |   5   |     5     |    0.0    |       0        |
| 2023-06-01   |     2      | stackoverflow   |  10   |    15     |    0.0    |       0        |
| 2023-06-01   |     3      | stackoverflow   |  10   |    25     |    1.0    |       1        |
| 2023-06-01   |     1      | cross_validated |   4   |     4     |    0.0    |       0        |
| 2023-06-01   |     2      | cross_validated |   6   |    10     |    0.0    |       0        |
| 2023-06-01   |     3      | cross_validated |  20   |    30     |    1.0    |       1        |
| 2023-06-01   |     4      | cross_validated |   5   |    35     |    2.0    |       2        |
| 2023-06-02   |     1      | stackoverflow   |   2   |     2     |    0.0    |       0        |
| 2023-06-02   |     2      | stackoverflow   |  10   |    12     |    0.0    |       0        |
| 2023-06-02   |     1      | cross_validated |  20   |    20     |    0.0    |       0        |
| 2023-06-02   |     2      | cross_validated |   3   |    23     |    1.0    |       1        |
| 2023-06-02   |     3      | cross_validated |   3   |    26     |    1.0    |       1        |

I am looking to vectorizing the code so that I can run with Rapids workflow to process trillions of rows.



from How to vectorize complex cumulative aggregation problem

No comments:

Post a Comment