Tuesday, 21 March 2023

Calculate AUC by different segments in python

I have a dataset which contains id, datetime, model features, ground truth labels and the predicted probability.

id     datetime   feature1   feature2   feature3   ...      label      probability
001   2023-01-01   a1          b3          c1      ...     Rejected       0.98
002   2023-01-04   a2          b1          c1      ...     Approved       0.28
003   2023-01-04   a1          b2          c1      ...     Rejected       0.81
004   2023-01-08   a2          b3          c2      ...     Rejected       0.97
005   2023-01-09   a2          b1          c1      ...     Approved       0.06
006   2023-01-09   a2          b2          c2      ...     Approved       0.06
007   2023-01-10   a1          b1          c2      ...     Approved       0.13
008   2023-01-11   a2          b2          c1      ...     Approved       0.18
009   2023-01-12   a2          b1          c1      ...     Approved       0.16
010   2023-01-12   a1          b1          c2      ...     Rejected       0.96
011   2023-01-09   a2          b3          c2      ...     Approved       0.16
...

I want to know what is the AUC of each segment under different features. How can I manipulate the dataset to get results?

What I have done is to use the groupby method on date to get the monthly AUC for all features together.

def group_auc(x, col_tar, col_scr):
    from sklearn import metrics
    return metrics.roc_auc_score(x[col_tar], x[col_scr])
def map_y(x):
    if x == 'Rejected':
        return 1
    elif x == 'Approved':
        return 0
    return x
## example
y_name = 'label'
df[y_name] = df[y_name].apply(map_y)
# Remove NA rows
df = df.dropna(subset = [y_name])
df['Month_Year'] = df['datetime'].dt.to_period('M')
group_data_monthly = df.groupby('Month_Year').apply(group_auc, y_name, 'probability').reset_index().rename(columns={0:'AUC'})

My expected output will be like,

datetime     features   value     AUC
2023-01-01   feature1    a1      0.98
2023-01-01   feature1    a2      ...
2023-01-01   feature1    a3      ...
2023-01-01   feature2    b1      ...
2023-01-01   feature2    b2      ...
2023-01-01   feature2    b3      ...
2023-01-01   feature3    c1      ...
2023-01-01   feature3    c2      ...
2023-01-04   feature1    a1      ...
2023-01-04   feature1    a2      ...
2023-01-04   feature1    a3      ...
2023-01-04   feature2    b1      ...
...

I have also tried to use stack method to transpose the dataframe, but the script failed due to the huge size of the dataframe.



from Calculate AUC by different segments in python

No comments:

Post a Comment