Hemant Vishwakarma: How to optimize iteration over pandas dataframe when vectorization is not possible

I have a rather complex computation to make which I want to optimize. I would leverage pandas/numpy vectorization but not sure if it possible to solve this task with vectorization. Currently I am using pandas dataframe but using for loop to iterate over dataframe.

Task Explanation:

The lookup table gives the mapping. For example the old "1" is split between new "11" and "12" with the split proportion given as weight. 60% of "1" was passed to "11" and 40% was passed to "12". Another example is where old "2" was renamed to new "20". So the split ratio here is 100%. Similarly for old "3" which is split into new "31", "32", "33".

The actual values for "1", "2", "3" are given in table df. In the resulting table new_df the old values need to be replaced with new by multiplying the respective weights/ratio.

I hope this explanation will suffice. The tables below will help more.

weights (it is a lookup table)

     old   new    Weight
0    1     11     0.6
1    1     12     0.4
2    2     20     1
3    3     31     0.2
4    3     32     0.5
5    3     33     0.3

df ( actual data table/ matrix)

     A1  A2  value
0    1   1   50   
1    1   2   40    
2    1   3   30     
3    2   1   10    
4    2   2   20    
5    2   3   70    
6    3   1   80
7    3   2   90

new_df ( resulting dataframe)

        A1_new  A2_new  value_new  | calculation (only for explanation)
   0    11      11      30         | 50 * 0.6
   1    12      12      20         | 50 * 0.4
   2    11      20      24         | 40 * 1 * 0.6
   3    12      20      16         | 40 * 1 * 0.4
   4    11      31      3.6        | 30 * 0.6 * 0.2
   5    11      32      9          | 30 * 0.6 * 0.5
   6    11      33      5.4        | 30 * 0.6 * 0.3
   7    12      31      2.4        | 30 * 0.4 * 0.2
   8    12      32      6          | 30 * 0.4 * 0.5
   9    12      33      3.6        | 30 * 0.4 * 0.3
   10   31      11      9.6        | 80 * 0.2 * 0.6
   11   32      11      24         | 80 * 0.5 * 0.6
   12   33      11      14.4       | 80 * 0.3 * 0.6 
   13   31      12      6.4        | 80 * 0.2 * 0.4 
   14   32      12      16         | 80 * 0.5 * 0.4 
   15   33      12      9.6        | 80 * 0.3 * 0.4 
   16   31      20      16         | 80 * 0.2 * 1
   17   32      20      40         | 80 * 0.5 * 1
   18   33      20      24         | 80 * 0.3 * 1

Below is the code I am using right now which does the job. However, the example I have given is only a sample of data. There are several thousand rows so I need to optimize it somehow.

l=[]

for i in range(len(df)) :
    print(i)

    ## check if both A2 and A1 has changes in df. 
    
    if weights['old'].isin([df.loc[i,'A1']]).any().any() == True and weights['old'].isin([df.loc[i,'A2']]).any().any() == True:

            #print('BOTH CHANGED' ,df.loc[i,'A1'], df.loc[i,'A2'])   
            for j in range(len(weights[weights['old']==df.loc[i,'A1']].values)):

                    for k in range(len(weights[weights['old']==df.loc[i,'A2']].values)):
                        n=[None]*3

                        n[0]=weights[weights['old']==df.loc[i,'A1']].values[j][1]
                        n[1]=weights[weights['old']==df.loc[i,'A2']].values[k][1]    
                        n[2]=df.loc[i,'value']*weights.loc[weights['old']==df.loc[i,'A1']].values[j][2]*weights[weights['old']==df.loc[i,'A2']].values[k][2]
                    
                        l.append(n)
    
    ## check if only A1 has any changes. 
              
    
    elif weights['old'].isin([df.loc[i,'A1']]).any().any() == True:

                for j in range(len(weights[weights['old']==df.loc[i,'A1']].values)):

                    n=[None]*3

                    n[0]=weights[weights['old']==df.loc[i,'A1']].values[j][1]
                    n[1]=df.loc[i,'A2']
                    n[2]=df.loc[i,'value']*weights[weights['old']==df.loc[i,'A1']].values[j][2]
                

                    l.append(n)  
                    

        
    
    ## check if only A2 has any changes. 

    
    elif weights['old'].isin([df.loc[i,'A2']]).any().any() == True and weights['old'].isin([df.loc[i,'A1']]).any().any() == False:
   


                for j in range(len(weights[weights['old']==df.loc[i,'A2']].values)):
                    #print(j)
                    n=[None]*3
                
                    n[0]=df.loc[i,'A1']
                    n[1]=weights[weights['old']==df.loc[i,'A2']].values[j][1]
                    n[2]=df.loc[i,'value']*weights[weights['old']==df.loc[i,'A2']].values[j][2]
                

                    l.append(n)   


                    
cols =  ['A1_new','A2_new', 'value_new' ]

new_df = pd.DataFrame(l, columns=cols)

from How to optimize iteration over pandas dataframe when vectorization is not possible

Hemant Vishwakarma

Monday, 16 November 2020

How to optimize iteration over pandas dataframe when vectorization is not possible

No comments:

Post a Comment