I have some pandas.Series
– s
, below – that I want to one-hot-encode. I've found through research that the 'b'
level is not important for my predictive modeling task. I can exclude it from my analysis like so:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
s = pd.Series(['a', 'b', 'c']).values.reshape(-1, 1)
enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='error')
enc.fit_transform(s)
# array([[1., 0.],
# [0., 0.],
# [0., 1.]])
enc.get_feature_names()
# array(['x0_a', 'x0_c'], dtype=object)
But when I go to transform a new series, one containing both 'b'
and a new level, 'd'
, I get an error:
new_s = pd.Series(['a', 'b', 'c', 'd']).values.reshape(-1, 1)
enc.transform(new_s)
Traceback (most recent call last): File "", line 1, in File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 390, in transform X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown) File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 124, in _transform raise ValueError(msg) ValueError: Found unknown categories ['d'] in column 0 during transform
This is to be expected since I set handle_unknown='error'
above. However, I'd like to completely ignore all classes except for ['a', 'c']
in both the fitting and subsequent transforming steps. I tried this:
enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='ignore')
enc.fit_transform(s)
enc.transform(new_s)
Traceback (most recent call last): File "", line 1, in File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 371, in fit_transform self._validate_keywords() File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 289, in _validate_keywords "
handle_unknown
must be 'error' when the drop parameter is " ValueError:handle_unknown
must be 'error' when the drop parameter is specified, as both would create categories that are all zero.
It seems this pattern is not supported in scikit-learn. Does anyone know a scikit-learn-compatible pattern to accomplish this task?
from sklearn.preprocessing.OneHotEncoder: using drop and handle_unknown='ignore'
No comments:
Post a Comment