Wednesday 22 November 2023

StackingClassifier with base-models trained on feature subsets

I can best describe my goal using a synthetic dataset. Suppose I have the following:

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

df = pd.DataFrame(X, columns=list('ABCDEFGHIJ'))

X_train, X_test, y_train, y_test = train_test_split(
    df, y, test_size=0.3, random_state=42)

X_train.head()
         A       B           C        D         E       F          G         H       I        J
541 -0.277848 1.022357 -0.950125 -2.100213  0.883638 0.821387  1.154613  0.075376  1.176242 -0.470087
440  1.089665 0.841446 -1.701004 -1.036256 -1.229357 0.345068  1.876470 -0.750067  0.080685 -1.318271
482  0.016010 0.025488 -1.189296 -1.052935 -0.623029 0.669521  1.518927  0.690019 -0.045486 -0.494186
422 -0.133358 -2.16219  1.170989 -0.942150  1.933444 -0.55118 -0.059908 -0.938672 -0.924097 -0.796185
778  0.901954 1.479360 -2.639176 -2.588845 -0.753915 -1.650621 2.727146  0.075260  1.330432 -0.941594

After conducting a feature importance analysis, the discovered that each of the 3-classes in the dataset can best be predicted using feature subset, as oppose to the whole. For example:

class  | optimal predictors
-------+-------------------
   0   |  A, B, C
   1   |  D, E, F, G
   2   |  G, H, I, J
-------+-------------------

At this point, I would like to use 3 one-ve-rest classifiers to train sub-models, one for each class and using the class's best predictors (as the base models). And then a StackingClassifier for final prediction.

I have high-level understanding of the StackingClassifier, where different base models can be trained (e.g. DT, SVC, KNN etc) and a meta classifier using another model e.g. Logistice Regression.

In this case however, the base model is one DT classifier, only that each is to be trained using feature subset best for the class, as above.

Then finally make predictions on the X_test.

But I am not sure how this can be done. So I give the description of my work using pseudo data as above.

How to design this to train the base models, and a final prediction?



from StackingClassifier with base-models trained on feature subsets

No comments:

Post a Comment