Monday, 16 November 2020

How to optimize iteration over pandas dataframe when vectorization is not possible

I have a rather complex computation to make which I want to optimize. I would leverage pandas/numpy vectorization but not sure if it possible to solve this task with vectorization. Currently I am using pandas dataframe but using for loop to iterate over dataframe.

Task Explanation:

The lookup table gives the mapping. For example the old "1" is split between new "11" and "12" with the split proportion given as weight. 60% of "1" was passed to "11" and 40% was passed to "12". Another example is where old "2" was renamed to new "20". So the split ratio here is 100%. Similarly for old "3" which is split into new "31", "32", "33".

The actual values for "1", "2", "3" are given in table df. In the resulting table new_df the old values need to be replaced with new by multiplying the respective weights/ratio.

I hope this explanation will suffice. The tables below will help more.

  1. weights (it is a lookup table)

         old   new    Weight
    0    1     11     0.6
    1    1     12     0.4
    2    2     20     1
    3    3     31     0.2
    4    3     32     0.5
    5    3     33     0.3
    
  2. df ( actual data table/ matrix)

         A1  A2  value
    0    1   1   50   
    1    1   2   40    
    2    1   3   30     
    3    2   1   10    
    4    2   2   20    
    5    2   3   70    
    6    3   1   80
    7    3   2   90
    

new_df ( resulting dataframe)

        A1_new  A2_new  value_new  | calculation (only for explanation)
   0    11      11      30         | 50 * 0.6
   1    12      12      20         | 50 * 0.4
   2    11      20      24         | 40 * 1 * 0.6
   3    12      20      16         | 40 * 1 * 0.4
   4    11      31      3.6        | 30 * 0.6 * 0.2
   5    11      32      9          | 30 * 0.6 * 0.5
   6    11      33      5.4        | 30 * 0.6 * 0.3
   7    12      31      2.4        | 30 * 0.4 * 0.2
   8    12      32      6          | 30 * 0.4 * 0.5
   9    12      33      3.6        | 30 * 0.4 * 0.3
   10   31      11      9.6        | 80 * 0.2 * 0.6
   11   32      11      24         | 80 * 0.5 * 0.6
   12   33      11      14.4       | 80 * 0.3 * 0.6 
   13   31      12      6.4        | 80 * 0.2 * 0.4 
   14   32      12      16         | 80 * 0.5 * 0.4 
   15   33      12      9.6        | 80 * 0.3 * 0.4 
   16   31      20      16         | 80 * 0.2 * 1
   17   32      20      40         | 80 * 0.5 * 1
   18   33      20      24         | 80 * 0.3 * 1 

Below is the code I am using right now which does the job. However, the example I have given is only a sample of data. There are several thousand rows so I need to optimize it somehow.

l=[]

for i in range(len(df)) :
    print(i)

    ## check if both A2 and A1 has changes in df. 
    
    if weights['old'].isin([df.loc[i,'A1']]).any().any() == True and weights['old'].isin([df.loc[i,'A2']]).any().any() == True:

            #print('BOTH CHANGED' ,df.loc[i,'A1'], df.loc[i,'A2'])   
            for j in range(len(weights[weights['old']==df.loc[i,'A1']].values)):

                    for k in range(len(weights[weights['old']==df.loc[i,'A2']].values)):
                        n=[None]*3

                        n[0]=weights[weights['old']==df.loc[i,'A1']].values[j][1]
                        n[1]=weights[weights['old']==df.loc[i,'A2']].values[k][1]    
                        n[2]=df.loc[i,'value']*weights.loc[weights['old']==df.loc[i,'A1']].values[j][2]*weights[weights['old']==df.loc[i,'A2']].values[k][2]
                    
                        l.append(n)
    
    ## check if only A1 has any changes. 
              
    
    elif weights['old'].isin([df.loc[i,'A1']]).any().any() == True:

                for j in range(len(weights[weights['old']==df.loc[i,'A1']].values)):

                    n=[None]*3

                    n[0]=weights[weights['old']==df.loc[i,'A1']].values[j][1]
                    n[1]=df.loc[i,'A2']
                    n[2]=df.loc[i,'value']*weights[weights['old']==df.loc[i,'A1']].values[j][2]
                

                    l.append(n)  
                    

        
    
    ## check if only A2 has any changes. 

    
    elif weights['old'].isin([df.loc[i,'A2']]).any().any() == True and weights['old'].isin([df.loc[i,'A1']]).any().any() == False:
   


                for j in range(len(weights[weights['old']==df.loc[i,'A2']].values)):
                    #print(j)
                    n=[None]*3
                
                    n[0]=df.loc[i,'A1']
                    n[1]=weights[weights['old']==df.loc[i,'A2']].values[j][1]
                    n[2]=df.loc[i,'value']*weights[weights['old']==df.loc[i,'A2']].values[j][2]
                

                    l.append(n)   


                    
cols =  ['A1_new','A2_new', 'value_new' ]

new_df = pd.DataFrame(l, columns=cols)


from How to optimize iteration over pandas dataframe when vectorization is not possible

No comments:

Post a Comment