I have a dataset like the following:
| date | time_index | identifier | value | cum_value | bar_index | desired_output |
|--------------|------------|-----------------|-------|-----------|-----------|----------------|
| 2023-06-01 | 1 | stackoverflow | 5 | 5 | NaN | 0 |
| 2023-06-01 | 2 | stackoverflow | 10 | 15 | NaN | 0 |
| 2023-06-01 | 3 | stackoverflow | 10 | 25 | NaN | 1 |
| 2023-06-01 | 1 | cross_validated | 4 | 4 | NaN | 0 |
| 2023-06-01 | 2 | cross_validated | 6 | 10 | NaN | 0 |
| 2023-06-01 | 3 | cross_validated | 20 | 30 | NaN | 1 |
| 2023-06-01 | 4 | cross_validated | 5 | 35 | NaN | 2 |
| 2023-06-02 | 1 | stackoverflow | 2 | 2 | NaN | 0 |
| 2023-06-02 | 2 | stackoverflow | 10 | 12 | NaN | 0 |
| 2023-06-02 | 1 | cross_validated | 20 | 20 | NaN | 0 |
| 2023-06-02 | 2 | cross_validated | 3 | 23 | NaN | 1 |
| 2023-06-02 | 3 | cross_validated | 3 | 26 | NaN | 1 |
This is the code that generates the above dataset:
df = pd.DataFrame({
"date": ["2023-06-01", "2023-06-01", "2023-06-01", "2023-06-01", "2023-06-01", "2023-06-01", "2023-06-01", "2023-06-02", "2023-06-02", "2023-06-02", "2023-06-02", "2023-06-02"],
"time_index": [1, 2, 3, 1, 2, 3, 4, 1, 2, 1, 2, 3],
"identifier": ["stackoverflow", "stackoverflow", "stackoverflow", "cross_validated", "cross_validated", "cross_validated", "cross_validated",
"stackoverflow", "stackoverflow", "cross_validated", "cross_validated", "cross_validated"],
"value": [5, 10, 10, 4, 6, 20, 5, 2, 10, 20, 3, 3]
})
df["cum_value"] = df.groupby(["identifier", "date"])["value"].cumsum()
df["bar_index"] = np.nan
df["desired_output"] = [0, 0, 1, 0, 0, 1, 2, 0, 0, 0, 1, 1]
I want to sample bar_index for each identifier and date according to a fixed (for now) threshold τ=10, using a columns value and(or) cum_value.
τ= 10- date: 2023-06-01 =
d1& 2023-06-02 =d2 - identifier: stackoverflow =
id1& cross_validated =id2 - time_index ∈
{t1, t2,...,tn} ∀ d, id
-
In the above example, observation {id1, d1, t1} has a value less than the threshold of 10 so we continue to the next entires -> if we add the
valueof {id1, d1, t1} and {id1, d1, t2} together, we reach acumulative valueof 15, which exceeds the threshold. Therefore, we would sample {id1, d1, t1} as well as {id1, d1, t2} asbar_index 0 -
In addition, if we encounter an observation with a very large value, for example, {id2, d1, t3}, and the previous bar ended (cumulative value exeeced the threshold from the last trade), we would sample this observation along as a bar_index. The next observation starts a new accumulation (in theory)
below is my current non-vectorized apporach for a more comprehensive understanding:
def aggregate_bars(group, threshold):
cum_value = 0
bar_index = 0
for i in range(len(group)):
cum_value += group.iloc[i]["value"]
if cum_value >= threshold:
group["bar_index"].iloc[i] = bar_index
bar_index += 1
cum_value = 0
elif cum_value < threshold:
group["bar_index"].iloc[i] = bar_index
return group
df = df.groupby(["identifier", "date"]).apply(lambda x: aggregate_bars(x, 10))
df
out:
| date | time_index | identifier | value | cum_value | bar_index | desired_output |
|--------------|------------|-----------------|-------|-----------|-----------|----------------|
| 2023-06-01 | 1 | stackoverflow | 5 | 5 | 0.0 | 0 |
| 2023-06-01 | 2 | stackoverflow | 10 | 15 | 0.0 | 0 |
| 2023-06-01 | 3 | stackoverflow | 10 | 25 | 1.0 | 1 |
| 2023-06-01 | 1 | cross_validated | 4 | 4 | 0.0 | 0 |
| 2023-06-01 | 2 | cross_validated | 6 | 10 | 0.0 | 0 |
| 2023-06-01 | 3 | cross_validated | 20 | 30 | 1.0 | 1 |
| 2023-06-01 | 4 | cross_validated | 5 | 35 | 2.0 | 2 |
| 2023-06-02 | 1 | stackoverflow | 2 | 2 | 0.0 | 0 |
| 2023-06-02 | 2 | stackoverflow | 10 | 12 | 0.0 | 0 |
| 2023-06-02 | 1 | cross_validated | 20 | 20 | 0.0 | 0 |
| 2023-06-02 | 2 | cross_validated | 3 | 23 | 1.0 | 1 |
| 2023-06-02 | 3 | cross_validated | 3 | 26 | 1.0 | 1 |
I am looking to vectorizing the code so that I can run with Rapids workflow to process trillions of rows.
from How to vectorize complex cumulative aggregation problem
No comments:
Post a Comment