I have a dataset which contains id, datetime, model features, ground truth labels and the predicted probability.
id datetime feature1 feature2 feature3 ... label probability
001 2023-01-01 a1 b3 c1 ... Rejected 0.98
002 2023-01-04 a2 b1 c1 ... Approved 0.28
003 2023-01-04 a1 b2 c1 ... Rejected 0.81
004 2023-01-08 a2 b3 c2 ... Rejected 0.97
005 2023-01-09 a2 b1 c1 ... Approved 0.06
006 2023-01-09 a2 b2 c2 ... Approved 0.06
007 2023-01-10 a1 b1 c2 ... Approved 0.13
008 2023-01-11 a2 b2 c1 ... Approved 0.18
009 2023-01-12 a2 b1 c1 ... Approved 0.16
010 2023-01-12 a1 b1 c2 ... Rejected 0.96
011 2023-01-09 a2 b3 c2 ... Approved 0.16
...
I want to know what is the AUC of each segment under different features. How can I manipulate the dataset to get results?
What I have done is to use the groupby method on date to get the monthly AUC for all features together.
def group_auc(x, col_tar, col_scr):
from sklearn import metrics
return metrics.roc_auc_score(x[col_tar], x[col_scr])
def map_y(x):
if x == 'Rejected':
return 1
elif x == 'Approved':
return 0
return x
## example
y_name = 'label'
df[y_name] = df[y_name].apply(map_y)
# Remove NA rows
df = df.dropna(subset = [y_name])
df['Month_Year'] = df['datetime'].dt.to_period('M')
group_data_monthly = df.groupby('Month_Year').apply(group_auc, y_name, 'probability').reset_index().rename(columns={0:'AUC'})
My expected output will be like,
datetime features value AUC
2023-01-01 feature1 a1 0.98
2023-01-01 feature1 a2 ...
2023-01-01 feature1 a3 ...
2023-01-01 feature2 b1 ...
2023-01-01 feature2 b2 ...
2023-01-01 feature2 b3 ...
2023-01-01 feature3 c1 ...
2023-01-01 feature3 c2 ...
2023-01-04 feature1 a1 ...
2023-01-04 feature1 a2 ...
2023-01-04 feature1 a3 ...
2023-01-04 feature2 b1 ...
...
I have also tried to use stack method to transpose the dataframe, but the script failed due to the huge size of the dataframe.
from Calculate AUC by different segments in python
No comments:
Post a Comment