I have a rather complex computation to make which I want to optimize. I would leverage pandas/numpy vectorization but not sure if it possible to solve this task with vectorization. Currently I am using pandas dataframe but using for loop to iterate over dataframe.
Task Explanation:
The lookup table gives the mapping. For example the old
"1" is split between new
"11" and "12" with the split proportion given as weight
. 60% of "1" was passed to "11" and 40% was passed to "12". Another example is where old
"2" was renamed to new
"20". So the split ratio here is 100%. Similarly for old
"3" which is split into new
"31", "32", "33".
The actual values for "1", "2", "3" are given in table df
. In the resulting table new_df
the old
values need to be replaced with new
by multiplying the respective weights
/ratio.
I hope this explanation will suffice. The tables below will help more.
-
weights (it is a lookup table)
old new Weight 0 1 11 0.6 1 1 12 0.4 2 2 20 1 3 3 31 0.2 4 3 32 0.5 5 3 33 0.3
-
df ( actual data table/ matrix)
A1 A2 value 0 1 1 50 1 1 2 40 2 1 3 30 3 2 1 10 4 2 2 20 5 2 3 70 6 3 1 80 7 3 2 90
new_df ( resulting dataframe)
A1_new A2_new value_new | calculation (only for explanation)
0 11 11 30 | 50 * 0.6
1 12 12 20 | 50 * 0.4
2 11 20 24 | 40 * 1 * 0.6
3 12 20 16 | 40 * 1 * 0.4
4 11 31 3.6 | 30 * 0.6 * 0.2
5 11 32 9 | 30 * 0.6 * 0.5
6 11 33 5.4 | 30 * 0.6 * 0.3
7 12 31 2.4 | 30 * 0.4 * 0.2
8 12 32 6 | 30 * 0.4 * 0.5
9 12 33 3.6 | 30 * 0.4 * 0.3
10 31 11 9.6 | 80 * 0.2 * 0.6
11 32 11 24 | 80 * 0.5 * 0.6
12 33 11 14.4 | 80 * 0.3 * 0.6
13 31 12 6.4 | 80 * 0.2 * 0.4
14 32 12 16 | 80 * 0.5 * 0.4
15 33 12 9.6 | 80 * 0.3 * 0.4
16 31 20 16 | 80 * 0.2 * 1
17 32 20 40 | 80 * 0.5 * 1
18 33 20 24 | 80 * 0.3 * 1
Below is the code I am using right now which does the job. However, the example I have given is only a sample of data. There are several thousand rows so I need to optimize it somehow.
l=[]
for i in range(len(df)) :
print(i)
## check if both A2 and A1 has changes in df.
if weights['old'].isin([df.loc[i,'A1']]).any().any() == True and weights['old'].isin([df.loc[i,'A2']]).any().any() == True:
#print('BOTH CHANGED' ,df.loc[i,'A1'], df.loc[i,'A2'])
for j in range(len(weights[weights['old']==df.loc[i,'A1']].values)):
for k in range(len(weights[weights['old']==df.loc[i,'A2']].values)):
n=[None]*3
n[0]=weights[weights['old']==df.loc[i,'A1']].values[j][1]
n[1]=weights[weights['old']==df.loc[i,'A2']].values[k][1]
n[2]=df.loc[i,'value']*weights.loc[weights['old']==df.loc[i,'A1']].values[j][2]*weights[weights['old']==df.loc[i,'A2']].values[k][2]
l.append(n)
## check if only A1 has any changes.
elif weights['old'].isin([df.loc[i,'A1']]).any().any() == True:
for j in range(len(weights[weights['old']==df.loc[i,'A1']].values)):
n=[None]*3
n[0]=weights[weights['old']==df.loc[i,'A1']].values[j][1]
n[1]=df.loc[i,'A2']
n[2]=df.loc[i,'value']*weights[weights['old']==df.loc[i,'A1']].values[j][2]
l.append(n)
## check if only A2 has any changes.
elif weights['old'].isin([df.loc[i,'A2']]).any().any() == True and weights['old'].isin([df.loc[i,'A1']]).any().any() == False:
for j in range(len(weights[weights['old']==df.loc[i,'A2']].values)):
#print(j)
n=[None]*3
n[0]=df.loc[i,'A1']
n[1]=weights[weights['old']==df.loc[i,'A2']].values[j][1]
n[2]=df.loc[i,'value']*weights[weights['old']==df.loc[i,'A2']].values[j][2]
l.append(n)
cols = ['A1_new','A2_new', 'value_new' ]
new_df = pd.DataFrame(l, columns=cols)
from How to optimize iteration over pandas dataframe when vectorization is not possible
No comments:
Post a Comment