Sunday, 3 October 2021

CV and under sampling on a test fold

I am a bit lost on building a ML classifier with imbalanced data (80:20). The dataset has 30 columns; the target is Label. I want to predict the major class. I am trying to reproduce the following steps:

  • Split the data on train/test
  • Perform CV on trains set
  • Apply undersampling only on a test fold
  • After the model has been chosen with the help of CV, undersample the train set and train the classifier
  • Estimate the performance on the untouched test set (recall)

What I have done is shown below:

    y = df['Label']
    X = df.drop('Label',axis=1)
    X.shape, y.shape

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
    X_train.shape, X_test.shape

    tree = DecisionTreeClassifier(max_depth = 5)

    tree.fit(X_train, y_train)

    y_test_tree = tree.predict(X_test)
    y_train_tree = tree.predict(X_train)

    acc_train_tree = accuracy_score(y_train,y_train_tree)
    acc_test_tree = accuracy_score(y_test,y_test_tree)

I have some doubts on how to perform CV on trains set, apply under sampling on a test fold and undersample the train set and train the classifier. Are you familiar with these steps? If you are, I would appreciate your help.

If I do as follows:

y = df['Label']
X = df.drop('Label',axis=1)
X.shape, y.shape

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
X_train.shape, X_test.shape

tree = DecisionTreeClassifier(max_depth = 5)

tree.fit(X_train, y_train)

y_test_tree = tree.predict(X_test)
y_train_tree = tree.predict(X_train)

acc_train_tree = accuracy_score(y_train,y_train_tree)
acc_test_tree = accuracy_score(y_test,y_test_tree)
# CV
scores = cross_val_score(tree,X_train, y_train,cv = 3, scoring = "accuracy")
ypred = cross_val_predict(tree,X_train,y_train,cv = 3)

print(classification_report(y_train,ypred))
accuracy_score(y_train,ypred)
confusion_matrix(y_train,ypred)

I get this output

             precision    recall  f1-score   support

      -1       0.73      0.99      0.84       291
       1       0.00      0.00      0.00       105

accuracy                           0.73       396
macro avg       0.37      0.50      0.42       396
weighted avg       0.54      0.73      0.62       396

I guess I have missed something in the code above or doing something wrong.

Sample of data:

Have_0 Have_1 Have_2 Have_letters Label
1        0      1         1         1
0        0      0         1        -1 
1        1      1         1        -1
0        1      0         0         1
1        1      0         0         1
1        0      0         1        -1
1        0      0         0         1


from CV and under sampling on a test fold

No comments:

Post a Comment