Tuesday 20 October 2020

Efficiently compute temporal features with pandas

I have the following .csv file :

Match_idx,Date,Player_1,Player_2,Player_1_wins
0,2020-01-01,p1,p2,1
1,2020-01-02,p2,p3,0
2,2020-01-03,p3,p1,1
3,2020-01-04,p4,p1,1

I want to compute some more columns to obtain the following output .csv file :

Match_idx,Date,Player_1,Player_2,Player_1_wins,Player_1_winrate,Player_2_winrate,Player_1_matches,Player_2_matches,Head_to_head
0,2020-01-01,p1,p2,1,0,0,0,0,0,''
1,2020-01-02,p2,p3,0,0,0,1,0,0,''
2,2020-01-03,p3,p1,1,1,1,1,1,0,''
3,2020-01-04,p4,p1,1,0,1/2,0,2,0,''
4,2020-01-05,p1,p3,0,1/2,2/2,3,2,'0'
5,2020-01-06,p3,p1,1,1/3,3/3,4,3,'11'

The semantic of each column :

  • Match_idx, Date, Player_1, Player_2 : straightforward
  • Player_1_wins : did Player_1 win the match ? 1 : 0

Those columns will be maintained and I want to add these ones :

  • Player_1_winrate : number_of_wins_for_player_1_before_this_one / number_of_matches_played_by_player_1_before_this_one

  • Player_2_winrate : same as above for player_2

  • Player_1_matches : number_of_matches_played_by_player_1_before_this_one

  • Player_2_matches : same as above for player_2

  • Head_to_head : outcomes of previous matches between Player_1 and Player_2. Encoded as a string of {'0' and '1'} with '1' if Player_1 won the match, else '0'.

What I have done

I am using pandas library to manipulate this file. The naive approach I've been thinking of is as follow : select each match, lost or won, played by a player, and order by date. After that, for the win ratio feature, apply the two following functions to a match.

def get_matches_won_before_by_player(df: pd.DataFrame, player: str, before: str):
    mask_player_won = (
        ((df['Player_1_wins'] == 1) & (df['Player_1'] == player)) | 
        ((df['Player_1_wins'] == 0) & (df['Player_2'] == player))
    )

    req = df[(df['Date'] < before) & mask_player_won]
    req.sort_values(by='Date', inplace=True)
    return req

def get_matches_played_before_by_player(df: pd.DataFrame, player: str, before: str):
    mask_player_played = (
        (df['Player_1'] == player) | 
        (df['Player_2'] == player)
    )

    req = df[(df['Date'] < before) & mask_player_played]
    req.sort_values(by='Date', inplace=True)
    return req

I could apply that logic to every match but this would involve to run those functions for each match, which is very very ineffective.

What I would like to do

How can I compute my features efficiently using only the last match of each player of a given match ? For example, updating the win rate of each player could be done with the following logic :

  1. Initialize each column to 0.
  2. Update win ratio as follow : (M/M+1) + (W/N+1), with M the current win ratio, N the current number of matches played and W = 1 if player won, else 0.

Any help or idea to organize such a process is much appreciated.



from Efficiently compute temporal features with pandas

No comments:

Post a Comment