Hemant Vishwakarma: CV and under sampling on a test fold

Sunday, 3 October 2021

CV and under sampling on a test fold

I am a bit lost on building a ML classifier with imbalanced data (80:20). The dataset has 30 columns; the target is Label. I want to predict the major class. I am trying to reproduce the following steps:

Split the data on train/test
Perform CV on trains set
Apply undersampling only on a test fold
After the model has been chosen with the help of CV, undersample the train set and train the classifier
Estimate the performance on the untouched test set (recall)

What I have done is shown below:

    y = df['Label']
    X = df.drop('Label',axis=1)
    X.shape, y.shape

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
    X_train.shape, X_test.shape

    tree = DecisionTreeClassifier(max_depth = 5)

    tree.fit(X_train, y_train)

    y_test_tree = tree.predict(X_test)
    y_train_tree = tree.predict(X_train)

    acc_train_tree = accuracy_score(y_train,y_train_tree)
    acc_test_tree = accuracy_score(y_test,y_test_tree)

I have some doubts on how to perform CV on trains set, apply under sampling on a test fold and undersample the train set and train the classifier. Are you familiar with these steps? If you are, I would appreciate your help.

If I do as follows:

y = df['Label']
X = df.drop('Label',axis=1)
X.shape, y.shape

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
X_train.shape, X_test.shape

tree = DecisionTreeClassifier(max_depth = 5)

tree.fit(X_train, y_train)

y_test_tree = tree.predict(X_test)
y_train_tree = tree.predict(X_train)

acc_train_tree = accuracy_score(y_train,y_train_tree)
acc_test_tree = accuracy_score(y_test,y_test_tree)
# CV
scores = cross_val_score(tree,X_train, y_train,cv = 3, scoring = "accuracy")
ypred = cross_val_predict(tree,X_train,y_train,cv = 3)

print(classification_report(y_train,ypred))
accuracy_score(y_train,ypred)
confusion_matrix(y_train,ypred)

I get this output

             precision    recall  f1-score   support

      -1       0.73      0.99      0.84       291
       1       0.00      0.00      0.00       105

accuracy                           0.73       396
macro avg       0.37      0.50      0.42       396
weighted avg       0.54      0.73      0.62       396

I guess I have missed something in the code above or doing something wrong.

Sample of data:

Have_0 Have_1 Have_2 Have_letters Label
1        0      1         1         1
0        0      0         1        -1 
1        1      1         1        -1
0        1      0         0         1
1        1      0         0         1
1        0      0         1        -1
1        0      0         0         1

from CV and under sampling on a test fold

Hemant Vishwakarma

Sunday, 3 October 2021

CV and under sampling on a test fold

No comments:

Post a Comment