Friday, 3 June 2022

Adding rows to a pandas dataframe with respect to existing entities within the dataframe

I have a dataframe, filled with values that look like this. I only have rows for months where entities generated revenue, but entities may have had lifetimes longer than what is depicted by the revenue they're showing.

entity revenue_generated first_purchase months_since_first_purchase
A 20 2022-01 0
A 60 2022-01 2
A 80 2022-01 3
A 15 2022-01 5
B 30 2022-03 0
B 10 2022-03 1
B 12 2022-03 2
G 25 2022-01 0
G 19 2022-01 1
G 90 2022-01 2

For a quick explanation of a lifetime in this context, Entity A first purchased during January 2022. 5 months since their first purchase, where they generated $15 is the maximum of the lifetime. In other words, since I am writing this in June 2022, we have no visibility of their purchase history in July 2022. So their maximum potential lifetime as a customer is 5 months, (representing the 6 observable months of Jan, Feb, March, April, May and June and starting the count from 0)

For simplicity's sake, let's say B first purchased in March 2022, so their maximum is 3. So A's maximum potential lifetime is represented in the dataset, but B's and G's are not.

G also made their first purchased in January 2022. So their maximum months_since_first_purchase value is also 5, but they did not generate any revenue during that month so they are not represented.

I want to change the dataset so it includes all of the months_since_first_purchase for each entity and includes their revenue in that month as 0. So, my goal dataset (with additions emphasised) is:

entity revenue_generated first_purchase months_since_first_purchase
A 20 2022-01 0
A 0 2022-01 1
A 60 2022-01 2
A 80 2022-01 3
A 0 2022-01 4
A 15 2022-01 5
B 30 2022-03 0
B 10 2022-03 1
B 12 2022-03 2
B 0 2022-03 3
G 25 2022-01 0
G 19 2022-01 1
G 90 2022-01 2
G 0 2022-01 3
G 0 2022-01 4
G 0 2022-01 5

I currently have this implemented in a for loop where I iterate over a set of the entities, and build a new dataframe for each of them and concatenate it to a new main dataframe but this is quite slow. Is there a more pythonic way of approaching this problem with Pandas that doesn't involve iterating and rebuilding a new dataframe?



from Adding rows to a pandas dataframe with respect to existing entities within the dataframe

No comments:

Post a Comment