Wednesday, 2 September 2020

Optimise a step function in Pandas using data

A colleague recommends that I use scipy.optimize.linprog to optimise my algorithm. I'm not sure though because I have a dataframe linkage problem, not a function. I'll explain.

I have two dataframes (df_1 & df_2) and some variables (A,B,C).

My algorithm uses A,B,C to calculate a score (S) for each row in df_2.

As A,B,C are varied, my algorithm finds the row in df_2 with the highest score (S).

For each value of column 'O' in df_1, I want the top scoring row in df_2 to ideally have the same value of column 'M' as df_1. To do that, I want to maximise p_hat, a measure of the similarity between them.

I want to find the values of A,B,C that give the maximum p_hat.

I can vary A,B or C to see which gives the maximum p_hat but I would like to use an optimisation algorithm to make sure that I get the maximum value please.

enter image description here

By varying A,B,C how can I get the maximum p_hat and what optimisation code can I do this with please? Does scipy.optimize.linprog work for this type of thing please?

df_1 = pd.DataFrame({'O' : [1,2,3],
                     'M' : [2,8,3]})

df_2 = pd.DataFrame({'O' : [1,1,1,
                            2,2,2,
                            3,3,3],
                     'M' : [9,2,4,
                            6,7,8,
                            5,3,4],
                     'X' : [2,4,6,
                            4,8,7,
                            3,1,9],
                     'Y' : [3,6,1,
                            4,6,5,
                            1,0,7],
                     'Z' : [2,4,8,
                            3,5,4,
                            7,5,1]})

df_1 = df_1.set_index('O')
M_G = df_1.M

# Variables
# Feature 1
F_min = -10
F_max = 10
F_ste = (F_max - F_min) / 300
L_1 = np.arange(F_min,F_max+F_ste,F_ste).tolist()

# Feature 2
L_2 = [0.0, 0.5, 1.0]

# Results
di_Re = {}

for F_2 in L_2:
    # Results
    df_Re = pd.DataFrame(data={'O':df_1.index})
    df_Re = df_Re.set_index('O')

    for F_1 in L_1:
        A = F_1
        B = F_2
        C = 2
        
        # Score
        df_2['S'] = df_2['X']*A + df_2['Y']*B + df_2['Z']*C
        
        # Top score
        Ma_To = df_2.sort_values(['S', 'X', 'M'], ascending=[False, True, True])
        Ma_Or = Ma_To.set_index('O')
        M_Top = Ma_Or[~Ma_Or.index.duplicated(keep='first')].M
        
        # Compare the top scoring Row for each T to df_1
        M_Top = M_Top.sort_index()
        M_G = M_G.sort_index()
        R_G = M_G.reindex(M_Top.index)
        T_N_T = M_Top == R_G
    
        # Record the results
        df_R_ = pd.DataFrame({'T_N_T':T_N_T})
        df_R_.columns = [F_1]
        df_Re = pd.concat([df_Re, df_R_], axis=1)
    
    # p hat
    df_Re.loc['p_hat'] = df_Re.sum()/len(df_Re.index)
    df_RT = df_Re.T
    
    di_Re[F_2] = df_RT

# Plot
ax = plt.gca()

# Line style
style = {
  L_2[0]: ':',
  L_2[1]: '--',
  L_2[2]: '-'}

for k in di_Re:
    di_Re[k].plot(ax=ax, y='p_hat', label=round(k, 2), linewidth=4, style=style[k], c='k')

plt.xlabel('A')
plt.ylabel('p\u0302')

ax.legend(title='B',
          loc="center left",
          bbox_to_anchor=(1, 0, 0.5, 1))

font = {'weight' : 'normal',
        'size'   : 30}
plt.rc('font', **font)

fig = plt.gcf()
fig.set_size_inches(6,6)
plt.xlim([F_min,F_max])
Tick_step = (F_max - F_min) / 2
plt.xticks(np.arange(F_min,F_max+Tick_step,Tick_step))

plt.minorticks_on()
plt.grid(b=True, which='major', color='b')
plt.grid(b=True, which='minor', color='b')

plt.show()


from Optimise a step function in Pandas using data

No comments:

Post a Comment