I have a dataframe, filled with values that look like this. I only have rows for months where entities generated revenue, but entities may have had lifetimes longer than what is depicted by the revenue they're showing.
| entity | revenue_generated | first_purchase | months_since_first_purchase |
|---|---|---|---|
| A | 20 | 2022-01 | 0 |
| A | 60 | 2022-01 | 2 |
| A | 80 | 2022-01 | 3 |
| A | 15 | 2022-01 | 5 |
| B | 30 | 2022-03 | 0 |
| B | 10 | 2022-03 | 1 |
| B | 12 | 2022-03 | 2 |
| G | 25 | 2022-01 | 0 |
| G | 19 | 2022-01 | 1 |
| G | 90 | 2022-01 | 2 |
For a quick explanation of a lifetime in this context, Entity A first purchased during January 2022. 5 months since their first purchase, where they generated $15 is the maximum of the lifetime. In other words, since I am writing this in June 2022, we have no visibility of their purchase history in July 2022. So their maximum potential lifetime as a customer is 5 months, (representing the 6 observable months of Jan, Feb, March, April, May and June and starting the count from 0)
For simplicity's sake, let's say B first purchased in March 2022, so their maximum is 3. So A's maximum potential lifetime is represented in the dataset, but B's and G's are not.
G also made their first purchased in January 2022. So their maximum months_since_first_purchase value is also 5, but they did not generate any revenue during that month so they are not represented.
I want to change the dataset so it includes all of the months_since_first_purchase for each entity and includes their revenue in that month as 0. So, my goal dataset (with additions emphasised) is:
| entity | revenue_generated | first_purchase | months_since_first_purchase |
|---|---|---|---|
| A | 20 | 2022-01 | 0 |
| A | 0 | 2022-01 | 1 |
| A | 60 | 2022-01 | 2 |
| A | 80 | 2022-01 | 3 |
| A | 0 | 2022-01 | 4 |
| A | 15 | 2022-01 | 5 |
| B | 30 | 2022-03 | 0 |
| B | 10 | 2022-03 | 1 |
| B | 12 | 2022-03 | 2 |
| B | 0 | 2022-03 | 3 |
| G | 25 | 2022-01 | 0 |
| G | 19 | 2022-01 | 1 |
| G | 90 | 2022-01 | 2 |
| G | 0 | 2022-01 | 3 |
| G | 0 | 2022-01 | 4 |
| G | 0 | 2022-01 | 5 |
I currently have this implemented in a for loop where I iterate over a set of the entities, and build a new dataframe for each of them and concatenate it to a new main dataframe but this is quite slow. Is there a more pythonic way of approaching this problem with Pandas that doesn't involve iterating and rebuilding a new dataframe?
from Adding rows to a pandas dataframe with respect to existing entities within the dataframe
No comments:
Post a Comment