Hemant Vishwakarma: Creating an ensemble of classifiers based on predefined feature subsets

The following MWE creates an ensemble method from the features selected using SelectKBest algorithm and RandomForest classifier.

# required import
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.pipeline import Pipeline

# ensemble created from features selected
def get_ensemble(n_features):
  # define base models
  models = []
  # enumerate the features in the training dataset
  for i in range(1, n_features + 1):
    # feature selection transform
    fs = SelectKBest(score_func=f_classif, k=i)
    # create the model
    model = RandomForestClassifier(n_estimators=50)
    # create the pipeline
    pipe = Pipeline([('fs', fs), ('m', model)])
    # list of tuple of models for voting
    models.append((str(i), pipe))

  # define the voting ensemble
  ensemble_clf = VotingClassifier(estimators=models, voting='hard')

  return ensemble_clf

So, to use the ensemble model:

# generate data for a 3-class classification
X, y = make_classification(n_samples=1000, n_features=10, n_classes=3,
                             n_informative=3)

X = pd.DataFrame(X, columns=list('ABCDEFGHIJ'))

X_train, X_test, y_train, y_test = train_test_split(X, y,
         test_size=0.3, random_state=42)

X_train.head()
       A       B       C       D       E       F       G       H       I      J
541  0.1756 -0.3772 -1.6396 -0.7524  0.2138  0.3113 -1.4906 -0.2885  0.1226  0.2057
440 -0.4381 -0.3302  0.7514 -0.4684 -1.2477 -0.5081 -0.7934 -0.3138  0.8423 -0.4038
482 -0.6648  1.2337 -0.2878 -1.6737 -1.2377 -0.4479 -1.1843 -0.2424 -0.9935 -1.4537
422  0.6099  0.2475  0.9612 -0.7339  0.6926 -1.5761 -1.6061 -0.3879 -0.1895  1.3738
778 -1.4893  0.5234  1.6126  0.8704 -2.7363 -1.3818 -0.2196 -0.7894 -1.1755 -2.8779

# get the ensemble model
ensemble_clssifier = get_ensemble(X_train.shape[1])

ensemble_clssifier.fit(X_train, y_train)

Creates 10 base models (n_features=10) and then an ensemble VotingClassifier based on majority (voting = hard).

Question:

The MWE described above works fine. However, I would like to replace the SelectKBest feature selection process in the get_ensemble function.

I have conducted a different feature selection process, and discovered the "optimal" feature subset for each class in this dataset as follows:

             | best predictors
-------------+-------------------
   class 0   |  A, B, C
   class 1   |  D, E, F, G
   class 2   |  G, H, I, J
-------------+-------------------

So the modification I would like to make to get_ensemble is that, instead of iterating over the number of available features, creating n base-models, it should create 3 (no. of classes) base models, where:

base-model 1 will be fitted using the feature subset ['A', 'B', 'C'].
base-model 2 will be fitted using the feature subset ['D', 'E', 'F', 'G'].
base-model 3 will be fitted using the feature subset ['G', 'H', 'I', 'J'].
finally the ensemble_classifier based on majority voting of the sub-models output.

That's, I when I make the call to:

ensemble_clssifier.fit(X_train, y_train)

It proceeds like so:

# 1st base model on fitted on its feature subset
model.fit(X_train[['A', 'B', 'C']], y_train)
# 2nd base model
model.fit(X_train[['D', 'E', 'F', 'G']], y_train)
# 3rd model also
model.fit(X_train[['G', 'H', 'I', 'J']], y_train)

This scenario should apply as well during prediction, making sure each base model selects the appropriate feature subset from X_test to make its prediction on ensemble_clssifier.fit(X_test) before the final voting.

I am not sure how to proceed. Any ideas?

EDIT

Regarding this question, I made some changes (e.g. not using the VotingClassifier) to further train the final ensemble on the output of the base models (base models confidences). Then finally make predictions.

I created the following ensemble class:

from sklearn.base import clone

class CustomEnsemble:
    def __init__(self, base_model, best_feature_subsets):
        self.base_models = {class_label: clone(base_model) for class_label in best_feature_subsets}
        self.best_feature_subsets = best_feature_subsets
        self.final_model = base_model

    def train_base_models(self, X_train, y_train):
        for class_label, features in self.best_feature_subsets.items():
            model = self.base_models[class_label]
            model.fit(X_train[features], (y_train == class_label))
        
        return self
    
    def train_final_model(self, X_train, y_train):
        """
        Probably better to implement the train methods (base models & ensemble)
        in one method suc as the train_base_models  altogether.
        """
        predictions = pd.DataFrame()

        for class_label, model in self.base_models.items():
            predictions[class_label] = model.predict_proba(X_train[self.best_feature_subsets[class_label]])[:, 1]

        self.final_model.fit(predictions, y_train)


    def predict_base_models(self, X_test):
        predictions = pd.DataFrame()

        for class_label, model in self.base_models.items():
            predictions[class_label] = model.predict_proba(X_test[self.best_feature_subsets[class_label]])[:, 1]

        return predictions

    def predict(self, X_test):
        base_model_predictions = self.predict_base_models(X_test)
        return self.final_model.predict(base_model_predictions)

    def predict_proba_base_models(self, X_test):
        predictions = pd.DataFrame()

        for class_label, model in self.base_models.items():
            predictions[class_label] = model.predict_proba(X_test[self.best_feature_subsets[class_label]])[:, 1]

        return predictions

    def predict_proba(self, X_test):
        base_model_predictions = self.predict_proba_base_models(X_test)
        return self.final_model.predict_proba(base_model_predictions)

Usage:

Define dictionary of best feature subsets for classes:

optimal_features = {
    0: ['A', 'B', 'C'],

    1: ['D', 'E', 'F', 'G'],

    2: ['G', 'H', 'I', 'J']
}

Instantiate class and train models:

classifier = RandomForestClassifier()
ensemble   = CustomEnsemble(classifier, optimal_features)

Train models:

# first, train base models
ensemble.train_base_models(X_train, y_train)
# then, train the ensemble
ensemble.train_final_model(X_train, y_train)

Make predictions:

yhat = ensemble.predict(X_test)
yhat_proba = ensemble.predict_proba(X_test) # so as to calculate roc_auc_score()

However, it appears I am not doing things right. I am not training the ensemble on the output of base models, but on the original input features.
Also, I am not sure if separating train_base_models() and train_final_model() is the best approach (this implies fitting twice: base models then final model as in the usage). Or better to combine these into one method (say train_ensemble()).

from Creating an ensemble of classifiers based on predefined feature subsets

Hemant Vishwakarma

Sunday, 26 November 2023

Creating an ensemble of classifiers based on predefined feature subsets

No comments:

Post a Comment