Hemant Vishwakarma: Group values if a date condition is respected in pandas

Saturday 31 October 2020

Group values if a date condition is respected in pandas

I have this pandas dataframe

import pandas as pd    
d1 = pd.DataFrame({
    'date' : ['2020-08-01', '2020-08-03', '2020-08-07', 
              '2020-08-02', '2020-08-04', '2020-08-10'],
    'user_id' : ['u123', 'u123', 'u123', 
                 'u321', 'u321', 'u321'],
    'item_id' : ['i1', 'i2', 'i3',
                 'i3', 'i1', 'i2' ],
    'like' : [1, 1, 0, 
             0, 1, 0]
})

My goal is to get to a dataframe like this

For each row, I want 3 new columns:

List of all previously liked items (like == 1 and date of previously liked items for that user < current date of the row)
List of all previously disliked items (like == 0 and date of previously disliked items for that user < current date of the row)
Mean of the column 'like' for all previous interactions (date of previous interactions < current date of the row)

More infos :

My complete dataframe is very big so I'm looking for an efficient way to do this in term of memory and computational costs
The data is already ordered by date and by user_id
I have no problem using other librairies than pandas
I have access to a multithreaded cpu and to a GPU

Here's a non-efficient code to achieve what I'm trying to do. Do you have suggestions on how I can improve this ?

import pandas as pd
d1 = pd.DataFrame({
    'date' : ['2020-08-01', '2020-08-03', '2020-08-07', 
              '2020-08-02', '2020-08-04', '2020-08-10'],
    'user_id' : ['u123', 'u123', 'u123', 
                 'u321', 'u321', 'u321'],
    'item_id' : ['i1', 'i2', 'i3',
                 'i3', 'i1', 'i2' ],
    'like' : [1, 1, 0, 
             0, 1, 0]
})

d1['previously_liked_item_ids'] = ""
d1['previously_disliked_item_ids'] = ""
d1['previous_like_avg'] = ""

for index, row in d1.iterrows():
    d1_u = d1[(d1.user_id == row['user_id']) & (d1.date < row['date'])]
    d1.at[index, 'previously_liked_item_ids'] = d1_u[d1_u.like == 1].item_id,
    d1.at[index, 'previously_disliked_item_ids'] = d1_u[d1_u.like == 0].item_id,
    d1.at[index, 'previous_like_avg'] = d1_u.like.mean()
print(d1)

from Group values if a date condition is respected in pandas

Hemant Vishwakarma

Saturday 31 October 2020

Group values if a date condition is respected in pandas

No comments:

Post a Comment