I have this pandas dataframe
import pandas as pd
d1 = pd.DataFrame({
'date' : ['2020-08-01', '2020-08-03', '2020-08-07',
'2020-08-02', '2020-08-04', '2020-08-10'],
'user_id' : ['u123', 'u123', 'u123',
'u321', 'u321', 'u321'],
'item_id' : ['i1', 'i2', 'i3',
'i3', 'i1', 'i2' ],
'like' : [1, 1, 0,
0, 1, 0]
})
My goal is to get to a dataframe like this
For each row, I want 3 new columns:
- List of all previously liked items (like == 1 and date of previously liked items for that user < current date of the row)
- List of all previously disliked items (like == 0 and date of previously disliked items for that user < current date of the row)
- Mean of the column 'like' for all previous interactions (date of previous interactions < current date of the row)
More infos :
- My complete dataframe is very big so I'm looking for an efficient way to do this in term of memory and computational costs
- The data is already ordered by date and by user_id
- I have no problem using other librairies than pandas
- I have access to a multithreaded cpu and to a GPU
Here's a non-efficient code to achieve what I'm trying to do. Do you have suggestions on how I can improve this ?
import pandas as pd
d1 = pd.DataFrame({
'date' : ['2020-08-01', '2020-08-03', '2020-08-07',
'2020-08-02', '2020-08-04', '2020-08-10'],
'user_id' : ['u123', 'u123', 'u123',
'u321', 'u321', 'u321'],
'item_id' : ['i1', 'i2', 'i3',
'i3', 'i1', 'i2' ],
'like' : [1, 1, 0,
0, 1, 0]
})
d1['previously_liked_item_ids'] = ""
d1['previously_disliked_item_ids'] = ""
d1['previous_like_avg'] = ""
for index, row in d1.iterrows():
d1_u = d1[(d1.user_id == row['user_id']) & (d1.date < row['date'])]
d1.at[index, 'previously_liked_item_ids'] = d1_u[d1_u.like == 1].item_id,
d1.at[index, 'previously_disliked_item_ids'] = d1_u[d1_u.like == 0].item_id,
d1.at[index, 'previous_like_avg'] = d1_u.like.mean()
print(d1)
from Group values if a date condition is respected in pandas
No comments:
Post a Comment