Sunday, 5 January 2020

Dummify categorical variables for logistic regression with pandas and scikit (OneHotEncoder)

I read this blog about new things in scikit. The OneHotEncoder taking strings seems like a useful feature. Below my attempt to use this

import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

train_df = pd.read_csv('../../data/train.csv', usecols=cols)
test_df = pd.read_csv('../../data/test.csv', usecols=[e for e in cols if e != 'Survived'])

train_df.dropna(inplace=True)
test_df.dropna(inplace=True)

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.copy()

ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False), ['Sex', 'Embarked'])], remainder='passthrough')

X_train_t = ct.fit_transform(train_df)
X_test_t  = ct.fit_transform(test_df)

print(X_train_t[0])
print(X_test_t[0])

# [ 0.    1.    0.    0.    1.    0.    3.   22.    1.    0.    7.25]
# [ 0.    1.    0.    1.    0.          3. 34.5     0.    0.  7.8292]

logreg = LogisticRegression(max_iter=5000)
logreg.fit(X_train_t, Y_train)
Y_pred = logreg.predict(X_test_t) # ValueError: X has 10 features per sample; expecting 11
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)

print(acc_log)

I encounter a python error with this code ValueError: X has 10 features per sample; expecting 11 and also i have some additional concerns.

To start from the beginning .. this script is written for the "titanic" dataset from kaggle. We have five numerical columns Pclass, Age, SibSp, Parch and Fare. The columns Sex and Embarked are categories male/female and Q/S/C (which is an abbreviation for a city name).

What I understood from the OneHotEncoder is that it creates dummy variables by placing additional columns. Well actually the output of ct.fit_transform() is no longer a pandas dataframe but a numpy array now. But as seen in the print debug statement there are more than the original 7 columns now.

There are three problems i encounter:

  1. For some reason the test.csv has one less column. That would indicate to me that there is on less option in one of the categories. To fix that i would have to find all the available options in the categories over both train + test data. And then use these options (such as male/female) to transform the train and the test data separately. I have no idea how to do this with the tools i'm working with (pandas, scikit, etc). On second thought .. after inspecting the data i can not find the missing option in the test.csv ..

  2. I want to avoid the "dummy variable trap". Right now it seems that there are too many columns created. I was expecting 1 column for Sex (total options 2 - 1 to avoid trap) and 2 for embarked. With the additional 5 numerical columns that would come to 8 total.

  3. I don't recognize the output of the transform anymore. I would rather prefer a new dataframe where the new dummy columns have given their own name, such as Sex_male (1/0) Embarked_Q (1/0) and Embarked_S(1/0)

I'm only used to using gretl, there dummifying a variable and leaving out one option is very natural. I don't know in python if i'm doing it wrong or if this scenario is not part of the standard scikit toolkit. Any advice? Maybe I should write a custom encoder for this?



from Dummify categorical variables for logistic regression with pandas and scikit (OneHotEncoder)

No comments:

Post a Comment