Tuesday, 18 June 2019

Is it possible to specify handle_unknown = 'ignore' for certain columns and 'error' for others inside OneHotEncoder?

I have a dataframe with all categorical columns which i am encoding using a onehotencoder from sklearn.preprocessing. My code is as below:

from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline


steps = [('OneHotEncoder', OneHotEncoder(handle_unknown ='ignore')) ,('LReg', LinearRegression())]

pipeline = Pipeline(steps)

As seen inside the OneHotEncoder the handle_unknown parameter takes either "error" or "ignore". I want to know if there is a way to selectively ignore unknown categories for certain columns whereas give error for the other columns?

import pandas as pd

df = pd.DataFrame({'Country':['USA','USA','IND','UK','UK','UK'],
                   'Fruits':['Apple','Strawberry','Mango','Berries','Banana','Grape'],
                   'Flower':   ['Rose','Lily','Orchid','Petunia','Lotus','Dandelion'],
                   'Result':[1,2,3,4,5,6,]})

from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

steps = [('OneHotEncoder', OneHotEncoder(handle_unknown ='ignore')) ,('LReg', LinearRegression())]

pipeline = Pipeline(steps)

from sklearn.model_selection import train_test_split

X = df[["Country","Flower","Fruits"]]
Y = df["Result"]
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.3, random_state=30, shuffle =True)

print("X_train.shape:", X_train.shape)
print("y_train.shape:", y_train.shape)
print("X_test.shape:", X_test.shape)
print("y_test.shape:", y_test.shape)

pipeline.fit(X_train,y_train)

y_pred = pipeline.predict(X_test)

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

#Mean Squared Error:
MSE = mean_squared_error(y_test,y_pred)

print("MSE", MSE)

#Root Mean Squared Error:
from math import sqrt

RMSE = sqrt(MSE)
print("RMSE", RMSE)

#R-squared score:
R2_score = r2_score(y_test,y_pred)

print("R2_score", R2_score)

in this case for all the columns thats country, fruits and flowers if there is a new value which comes the model would still be able to predict an output.

I want to know if there is a way to ignore unknown categories for fruits and flowers but however raise an error for unknown country?



from Is it possible to specify handle_unknown = 'ignore' for certain columns and 'error' for others inside OneHotEncoder?

No comments:

Post a Comment