Tuesday, 21 December 2021

Passing different features to a multioutput regressor in scikit learn pipeline

Dear colleagues I have created an scikit learn pipeline to traing and tune different HistBoostRegressors.

data_train, data_test, target_train, target_test = train_test_split(
    df.drop(columns=TARGETS), 
    df[target_dict], 
    random_state=42)

pipeline_hist_boost_mimo_inside = Pipeline([('scaler', StandardScaler()),
                             ('variance_selector', VarianceThreshold(threshold=0.03)), 
                             ('estimator', MultiOutputRegressor(HistGradientBoostingRegressor(loss='poisson')))])

from scipy.stats import loguniform

class loguniform_int:
    """Integer valued version of the log-uniform distribution"""
    def __init__(self, a, b):
        self._distribution = loguniform(a, b)

    def rvs(self, *args, **kwargs):
        """Random variable sample"""
        return self._distribution.rvs(*args, **kwargs).astype(int)

parameters = {
    'estimator__estimator__l2_regularization': loguniform(1e-6, 1e3),
    'estimator__estimator__learning_rate': loguniform(0.001, 10),
    'estimator__estimator__max_leaf_nodes': loguniform_int(2, 256),
    'estimator__estimator__max_leaf_nodes': loguniform_int(2, 256),
    'estimator__estimator__min_samples_leaf': loguniform_int(1, 100),
    'estimator__estimator__max_bins': loguniform_int(2, 255),
}

random_grid_inside = RandomizedSearchCV(estimator=pipeline_hist_boost_mimo_inside, param_distributions=parameters, random_state=0, n_iter=50,
                                       n_jobs=-1, refit=True, cv=3, verbose=True,
                                       pre_dispatch='2*n_jobs', 
                                       return_train_score=True)

results_inside_train = random_grid_inside.fit(data_train, target_train)

However now I would like to know if it would be possible to pass different feature names to the step pipeline_hist_boost_mimo_inside["estimator"].

I have noticed that in the documentation of the multi output regressor we have a parameter call feature_names:

feature_names_in_ndarray of shape (n_features_in_,) Names of features seen during fit. Only defined if the underlying estimators expose such an attribute when fit.

New in version 1.0.

I have also found some documentation in scikit learn column selector which has the argument:

https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector

patternstr, default=None Name of columns containing this regex pattern will be included. If None, column selection will not be selected based on pattern.

The problem is that this pattern will depend on the target that I am fitting.

Is there a way to do this elegantly?



from Passing different features to a multioutput regressor in scikit learn pipeline

No comments:

Post a Comment