The following MWE creates an ensemble method from the features selected using SelectKBest algorithm and RandomForest classifier.
# required import
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.pipeline import Pipeline
# ensemble created from features selected
def get_ensemble(n_features):
# define base models
models = []
# enumerate the features in the training dataset
for i in range(1, n_features + 1):
# feature selection transform
fs = SelectKBest(score_func=f_classif, k=i)
# create the model
model = RandomForestClassifier(n_estimators=50)
# create the pipeline
pipe = Pipeline([('fs', fs), ('m', model)])
# list of tuple of models for voting
models.append((str(i), pipe))
# define the voting ensemble
ensemble_clf = VotingClassifier(estimators=models, voting='hard')
return ensemble_clf
So, to use the ensemble model:
# generate data for a 3-class classification
X, y = make_classification(n_samples=1000, n_features=10, n_classes=3,
n_informative=3)
X = pd.DataFrame(X, columns=list('ABCDEFGHIJ'))
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=42)
X_train.head()
A B C D E F G H I J
541 0.1756 -0.3772 -1.6396 -0.7524 0.2138 0.3113 -1.4906 -0.2885 0.1226 0.2057
440 -0.4381 -0.3302 0.7514 -0.4684 -1.2477 -0.5081 -0.7934 -0.3138 0.8423 -0.4038
482 -0.6648 1.2337 -0.2878 -1.6737 -1.2377 -0.4479 -1.1843 -0.2424 -0.9935 -1.4537
422 0.6099 0.2475 0.9612 -0.7339 0.6926 -1.5761 -1.6061 -0.3879 -0.1895 1.3738
778 -1.4893 0.5234 1.6126 0.8704 -2.7363 -1.3818 -0.2196 -0.7894 -1.1755 -2.8779
# get the ensemble model
ensemble_clssifier = get_ensemble(X_train.shape[1])
ensemble_clssifier.fit(X_train, y_train)
Creates 10 base models (n_features=10
) and then an ensemble VotingClassifier based on majority (voting = hard
).
Question:
The MWE described above works fine. However, I would like to replace the SelectKBest
feature selection process in the get_ensemble
function.
I have conducted a different feature selection process, and discovered the "optimal" feature subset for each class in this dataset as follows:
| best predictors
-------------+-------------------
class 0 | A, B, C
class 1 | D, E, F, G
class 2 | G, H, I, J
-------------+-------------------
So the modification I would like to make to get_ensemble
is that, instead of iterating over the number of available features, creating n
base-models, it should create 3 (no. of classes) base models, where:
-
base-model 1
will be fitted using the feature subset['A', 'B', 'C']
. -
base-model 2
will be fitted using the feature subset['D', 'E', 'F', 'G']
. -
base-model 3
will be fitted using the feature subset['G', 'H', 'I', 'J']
. -
finally the
ensemble_classifier
based on majority voting of the sub-models output.
That's, I when I make the call to:
ensemble_clssifier.fit(X_train, y_train)
It proceeds like so:
# 1st base model on fitted on its feature subset
model.fit(X_train[['A', 'B', 'C']], y_train)
# 2nd base model
model.fit(X_train[['D', 'E', 'F', 'G']], y_train)
# 3rd model also
model.fit(X_train[['G', 'H', 'I', 'J']], y_train)
This scenario should apply as well during prediction, making sure each base model selects the appropriate feature subset from X_test
to make its prediction on ensemble_clssifier.fit(X_test)
before the final voting.
I am not sure how to proceed. Any ideas?
EDIT
Regarding this question, I made some changes (e.g. not using the VotingClassifier
) to further train the final ensemble on the output of the base models (base models confidences). Then finally make predictions.
I created the following ensemble class:
from sklearn.base import clone
class CustomEnsemble:
def __init__(self, base_model, best_feature_subsets):
self.base_models = {class_label: clone(base_model) for class_label in best_feature_subsets}
self.best_feature_subsets = best_feature_subsets
self.final_model = base_model
def train_base_models(self, X_train, y_train):
for class_label, features in self.best_feature_subsets.items():
model = self.base_models[class_label]
model.fit(X_train[features], (y_train == class_label))
return self
def train_final_model(self, X_train, y_train):
"""
Probably better to implement the train methods (base models & ensemble)
in one method suc as the train_base_models altogether.
"""
predictions = pd.DataFrame()
for class_label, model in self.base_models.items():
predictions[class_label] = model.predict_proba(X_train[self.best_feature_subsets[class_label]])[:, 1]
self.final_model.fit(predictions, y_train)
def predict_base_models(self, X_test):
predictions = pd.DataFrame()
for class_label, model in self.base_models.items():
predictions[class_label] = model.predict_proba(X_test[self.best_feature_subsets[class_label]])[:, 1]
return predictions
def predict(self, X_test):
base_model_predictions = self.predict_base_models(X_test)
return self.final_model.predict(base_model_predictions)
def predict_proba_base_models(self, X_test):
predictions = pd.DataFrame()
for class_label, model in self.base_models.items():
predictions[class_label] = model.predict_proba(X_test[self.best_feature_subsets[class_label]])[:, 1]
return predictions
def predict_proba(self, X_test):
base_model_predictions = self.predict_proba_base_models(X_test)
return self.final_model.predict_proba(base_model_predictions)
Usage:
- Define dictionary of best feature subsets for classes:
optimal_features = {
0: ['A', 'B', 'C'],
1: ['D', 'E', 'F', 'G'],
2: ['G', 'H', 'I', 'J']
}
- Instantiate class and train models:
classifier = RandomForestClassifier()
ensemble = CustomEnsemble(classifier, optimal_features)
- Train models:
# first, train base models
ensemble.train_base_models(X_train, y_train)
# then, train the ensemble
ensemble.train_final_model(X_train, y_train)
- Make predictions:
yhat = ensemble.predict(X_test)
yhat_proba = ensemble.predict_proba(X_test) # so as to calculate roc_auc_score()
-
However, it appears I am not doing things right. I am not training the ensemble on the output of base models, but on the original input features.
-
Also, I am not sure if separating
train_base_models()
andtrain_final_model()
is the best approach (this implies fitting twice: base models then final model as in the usage). Or better to combine these into one method (saytrain_ensemble()
).
from Creating an ensemble of classifiers based on predefined feature subsets
No comments:
Post a Comment