Sunday, 25 July 2021

Conditional running count in pandas based on conditions in 2 columns (counting number of people in a queue based on timestamps)

I have the following dataframe in python pandas which is over a million rows,

ID      BOOKING_TIME    ENTRY_TIME              
23239   1/1/2020 0:00   1/1/2020 0:40                 
51042   1/1/2020 0:11   1/1/2020 0:42                 
73373   1/1/2020 0:15   1/1/2020 0:56                   
14222   1/1/2020 0:22   1/1/2020 1:00                   
27116   1/1/2020 0:55   1/1/2020 1:15                    
....

The two columns have been merged from 2 different sql tables.

The same user could make a booking further down the table again.

Each row shows a user creating a booking (entering the queue to enter the venue) and the time the user enters the venue.

I am trying to create a column that shows the number of people that are currently queuing based on the time a user makes a new booking.

When a booking is made, an entry under BOOKING_TIME is recorded and they join a queue.

When they enter the venue, the ENTRY_TIME is recorded and they leave the queue.

I want to do a running count of the people that are in the queue. Meaning that in each row, if a booking time is hit, it does +1 in the IN_QUEUE column and once the entry time (for any of the cases in the previous timestamps) is hit, I want to -1 from the IN_QUEUE.

It will create a new column like this. (Ignore the working column, that is just to show the process)


ID      BOOKING_TIME    ENTRY_TIME      IN_QUEUE           (working)
23239   1/1/2020 0:00   1/1/2020 0:40   1                   +1
51042   1/1/2020 0:11   1/1/2020 0:42   2                   +1
73373   1/1/2020 0:15   1/1/2020 0:56   3                   +1
14222   1/1/2020 0:22   1/1/2020 1:00   4                   +1
27116   1/1/2020 0:55   1/1/2020 1:15   3                   -1   +1
....

Row 5 will -1 first because before 0:55am on 1/1/2020, the person from the first row entered the venue at 0:40am.

I think I have to use an apply function with a conditional function on every row but I am not sure how it would work.

I was thinking maybe for each row, count all rows where BOOKING_TIME < CURRENT BOOKING_TIME and ENTRY_TIME > CURRENT BOOKING_TIME, but I'm not sure how to code it or if my logic is even correct.

I saw this post Pandas: conditional rolling count but I cant figure out how to make it work with my case.



from Conditional running count in pandas based on conditions in 2 columns (counting number of people in a queue based on timestamps)

No comments:

Post a Comment