Monday, 18 November 2019

How to create “exponential smoothing” variables (hard one)

I have a data frame with IDs, and choices that have made by those IDs. Every choice is related to some city. The choices set is a list of integers: [10, 20, 30, 40, 50, 60], and the cities set is a list of strings ['XX', 'YY', 'ZZ']. Note: One or more choices might related to the same city. For example: the choices 20 and 30 are related to the city 'YY'.

This is the data-frame:

 ID  choice city
  1      10   XX
  1      10   XX
  1      20   YY
  1      10   XX
  1      30   YY
  1      40   ZZ
  2      20   YY
  2      50   ZZ
  2      50   ZZ
  2      50   ZZ
  2      10   XX
  3      30   YY
  3      30   YY
  3      60   ZZ
  3      60   ZZ
  3      60   ZZ
  3      10   XX

This is the choice-city data-frame:

 choice city
     10   XX
     20   YY
     30   YY
     40   ZZ
     50   ZZ
     60   ZZ

Another data-frame tells us how many choices are related to every city:

city  count
  XX      1
  YY      2
  ZZ      3

I want to create a variable for each choice: '10_Var', '20_Var', '30_Var', '40_Var', '50_Var', '60_Var'. At the first row of each ID, if the first choice is related to the city 'XX' for example, so the variable '10_Var' will get the value 0.6 / # of choices that related to this city (0.6 is some parameter), and each of the other variables that are not related to the same city will get the value (1 - 0.6) / (# of choices - # of choices that related to the city 'XX').

How should the data look like after the step above:

 ID  choice city  10_Var  20_Var  30_Var  40_Var  50_Var  60_Var
  1      10   XX    0.80    0.04    0.04    0.04    0.04    0.04
  1      10   XX     NaN     NaN     NaN     NaN     NaN     NaN
  1      20   YY     NaN     NaN     NaN     NaN     NaN     NaN
  1      10   XX     NaN     NaN     NaN     NaN     NaN     NaN
  1      30   YY     NaN     NaN     NaN     NaN     NaN     NaN
  1      40   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
  2      20   YY    0.05    0.40    0.40    0.05    0.05    0.05
  2      50   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
  2      50   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
  2      50   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
  2      10   XX     NaN     NaN     NaN     NaN     NaN     NaN
  3      30   YY    0.05    0.40    0.40    0.05    0.05    0.05
  3      30   YY     NaN     NaN     NaN     NaN     NaN     NaN
  3      60   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
  3      60   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
  3      60   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
  3      10   XX     NaN     NaN     NaN     NaN     NaN     NaN

From the second row and so on (for each ID), the variable '10_Var' for example, will get the value: (0.6 * Previous-value) + (1 - 0.6) * {1 if the **last** choice is related to the city 'XX', 0 otherwise} / # of choices that related to the city 'XX', and so for each variable.

Note: It should be done for each ID.

Expected result:

 ID  choice city    10_Var    20_Var    30_Var    40_Var    50_Var    60_Var
  1      10   XX  0.800000  0.040000  0.040000  0.040000  0.040000  0.040000
  1      10   XX  0.840000  0.032000  0.032000  0.032000  0.032000  0.032000
  1      20   YY  0.872000  0.025600  0.025600  0.025600  0.025600  0.025600
  1      10   XX  0.697600  0.120480  0.120480  0.020480  0.020480  0.020480
  1      30   YY  0.758080  0.096384  0.096384  0.016384  0.016384  0.016384
  1      40   ZZ  0.606464  0.177107  0.177107  0.013107  0.013107  0.013107
  2      20   YY  0.050000  0.400000  0.400000  0.050000  0.050000  0.050000
  2      50   ZZ  0.040000  0.420000  0.420000  0.040000  0.040000  0.040000
  2      50   ZZ  0.032000  0.336000  0.336000  0.098667  0.098667  0.098667
  2      50   ZZ  0.025600  0.268800  0.268800  0.145600  0.145600  0.145600
  2      10   XX  0.020480  0.215040  0.215040  0.183147  0.183147  0.183147
  3      30   YY  0.050000  0.400000  0.400000  0.050000  0.050000  0.050000
  3      30   YY  0.040000  0.420000  0.420000  0.040000  0.040000  0.040000
  3      60   ZZ  0.032000  0.436000  0.436000  0.032000  0.032000  0.032000
  3      60   ZZ  0.025600  0.348800  0.348800  0.092267  0.092267  0.092267
  3      60   ZZ  0.020480  0.279040  0.279040  0.140480  0.140480  0.140480
  3      10   XX  0.016384  0.223232  0.223232  0.179051  0.179051  0.179051

This question may help: Creating "exponential smoothing" variables - Pandas



from How to create “exponential smoothing” variables (hard one)

No comments:

Post a Comment