Saturday, 31 October 2020

Group values if a date condition is respected in pandas

I have this pandas dataframe

import pandas as pd    
d1 = pd.DataFrame({
    'date' : ['2020-08-01', '2020-08-03', '2020-08-07', 
              '2020-08-02', '2020-08-04', '2020-08-10'],
    'user_id' : ['u123', 'u123', 'u123', 
                 'u321', 'u321', 'u321'],
    'item_id' : ['i1', 'i2', 'i3',
                 'i3', 'i1', 'i2' ],
    'like' : [1, 1, 0, 
             0, 1, 0]
})

Original_df

My goal is to get to a dataframe like this

Desired_df

For each row, I want 3 new columns:

  1. List of all previously liked items (like == 1 and date of previously liked items for that user < current date of the row)
  2. List of all previously disliked items (like == 0 and date of previously disliked items for that user < current date of the row)
  3. Mean of the column 'like' for all previous interactions (date of previous interactions < current date of the row)

More infos :

  • My complete dataframe is very big so I'm looking for an efficient way to do this in term of memory and computational costs
  • The data is already ordered by date and by user_id
  • I have no problem using other librairies than pandas
  • I have access to a multithreaded cpu and to a GPU

Here's a non-efficient code to achieve what I'm trying to do. Do you have suggestions on how I can improve this ?

import pandas as pd
d1 = pd.DataFrame({
    'date' : ['2020-08-01', '2020-08-03', '2020-08-07', 
              '2020-08-02', '2020-08-04', '2020-08-10'],
    'user_id' : ['u123', 'u123', 'u123', 
                 'u321', 'u321', 'u321'],
    'item_id' : ['i1', 'i2', 'i3',
                 'i3', 'i1', 'i2' ],
    'like' : [1, 1, 0, 
             0, 1, 0]
})

d1['previously_liked_item_ids'] = ""
d1['previously_disliked_item_ids'] = ""
d1['previous_like_avg'] = ""

for index, row in d1.iterrows():
    d1_u = d1[(d1.user_id == row['user_id']) & (d1.date < row['date'])]
    d1.at[index, 'previously_liked_item_ids'] = d1_u[d1_u.like == 1].item_id,
    d1.at[index, 'previously_disliked_item_ids'] = d1_u[d1_u.like == 0].item_id,
    d1.at[index, 'previous_like_avg'] = d1_u.like.mean()
print(d1)


from Group values if a date condition is respected in pandas

No comments:

Post a Comment