I have a data frame with IDs, and choices that have made by those IDs. Every choice is related to some city. The choices set is a list of integers: [10, 20, 30, 40, 50, 60], and the cities set is a list of strings ['XX', 'YY', 'ZZ']. Note: One or more choices might related to the same city. For example: the choices 20 and 30 are related to the city 'YY'.
This is the data-frame:
ID choice city
1 10 XX
1 10 XX
1 20 YY
1 10 XX
1 30 YY
1 40 ZZ
2 20 YY
2 50 ZZ
2 50 ZZ
2 50 ZZ
2 10 XX
3 30 YY
3 30 YY
3 60 ZZ
3 60 ZZ
3 60 ZZ
3 10 XX
This is the choice-city data-frame:
choice city
10 XX
20 YY
30 YY
40 ZZ
50 ZZ
60 ZZ
Another data-frame tells us how many choices are related to every city:
city count
XX 1
YY 2
ZZ 3
I want to create a variable for each choice: '10_Var', '20_Var', '30_Var', '40_Var', '50_Var', '60_Var'. At the first row of each ID, if the first choice is related to the city 'XX' for example, so the variable '10_Var' will get the value 0.6 / # of choices that related to this city (0.6 is some parameter), and each of the other variables that are not related to the same city will get the value (1 - 0.6) / (# of choices - # of choices that related to the city 'XX').
How should the data look like after the step above:
ID choice city 10_Var 20_Var 30_Var 40_Var 50_Var 60_Var
1 10 XX 0.80 0.04 0.04 0.04 0.04 0.04
1 10 XX NaN NaN NaN NaN NaN NaN
1 20 YY NaN NaN NaN NaN NaN NaN
1 10 XX NaN NaN NaN NaN NaN NaN
1 30 YY NaN NaN NaN NaN NaN NaN
1 40 ZZ NaN NaN NaN NaN NaN NaN
2 20 YY 0.05 0.40 0.40 0.05 0.05 0.05
2 50 ZZ NaN NaN NaN NaN NaN NaN
2 50 ZZ NaN NaN NaN NaN NaN NaN
2 50 ZZ NaN NaN NaN NaN NaN NaN
2 10 XX NaN NaN NaN NaN NaN NaN
3 30 YY 0.05 0.40 0.40 0.05 0.05 0.05
3 30 YY NaN NaN NaN NaN NaN NaN
3 60 ZZ NaN NaN NaN NaN NaN NaN
3 60 ZZ NaN NaN NaN NaN NaN NaN
3 60 ZZ NaN NaN NaN NaN NaN NaN
3 10 XX NaN NaN NaN NaN NaN NaN
From the second row and so on (for each ID), the variable '10_Var' for example, will get the value: (0.6 * Previous-value) + (1 - 0.6) * {1 if the **last** choice is related to the city 'XX', 0 otherwise} / # of choices that related to the city 'XX', and so for each variable.
Note: It should be done for each ID.
Expected result:
ID choice city 10_Var 20_Var 30_Var 40_Var 50_Var 60_Var
1 10 XX 0.800000 0.040000 0.040000 0.040000 0.040000 0.040000
1 10 XX 0.840000 0.032000 0.032000 0.032000 0.032000 0.032000
1 20 YY 0.872000 0.025600 0.025600 0.025600 0.025600 0.025600
1 10 XX 0.697600 0.120480 0.120480 0.020480 0.020480 0.020480
1 30 YY 0.758080 0.096384 0.096384 0.016384 0.016384 0.016384
1 40 ZZ 0.606464 0.177107 0.177107 0.013107 0.013107 0.013107
2 20 YY 0.050000 0.400000 0.400000 0.050000 0.050000 0.050000
2 50 ZZ 0.040000 0.420000 0.420000 0.040000 0.040000 0.040000
2 50 ZZ 0.032000 0.336000 0.336000 0.098667 0.098667 0.098667
2 50 ZZ 0.025600 0.268800 0.268800 0.145600 0.145600 0.145600
2 10 XX 0.020480 0.215040 0.215040 0.183147 0.183147 0.183147
3 30 YY 0.050000 0.400000 0.400000 0.050000 0.050000 0.050000
3 30 YY 0.040000 0.420000 0.420000 0.040000 0.040000 0.040000
3 60 ZZ 0.032000 0.436000 0.436000 0.032000 0.032000 0.032000
3 60 ZZ 0.025600 0.348800 0.348800 0.092267 0.092267 0.092267
3 60 ZZ 0.020480 0.279040 0.279040 0.140480 0.140480 0.140480
3 10 XX 0.016384 0.223232 0.223232 0.179051 0.179051 0.179051
This question may help: Creating "exponential smoothing" variables - Pandas
from How to create “exponential smoothing” variables (hard one)
No comments:
Post a Comment