I have the following data:
# Libraries
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.metrics.pairwise import nan_euclidean_distances
# Data set
toy_example = pd.DataFrame(data = {"Color": ["Blue", "Red", "Green", "Blue", np.nan],
"Size": ["S", "M", "L", np.nan, "S"],
"Weight": [10, np.nan, 15, 12, np.nan],
"Age": [2, 4, np.nan, 3, 1]})
toy_example
I want to impute the variables Color
(nominal), Size
(ordinal), Weight
(numerical) and Age
(numerical) where I want to use KNN imputer using the distance metric nan_euclidean
from sklearn.impute.KNNImputer
I now that I need to pre-process the data first. Therefore I came up with the following 2 solutions
a. One hot encoding for the nominal variable where the NaN
values are encoded as a category
# Preprocessing the data
color_encoder = OneHotEncoder()
color_encoder.fit(X=toy_example[["Color"]])
## Checking categories and names
### A na dummy is included by default
color_encoder.categories_
color_encoder.get_feature_names_out()
# Create a new DataFrame with the one-hot encoded "Color" column
color_encoded = pd.DataFrame(color_encoder.transform(toy_example[["Color"]]).toarray(),
columns=color_encoder.get_feature_names_out(["Color"]))
color_encoded
# Create a dictionary to map the ordinal values of the "Size" column to numerical values
size_map = {"S": 1, "M": 2, "L": 3}
size_map
toy_example["Size"] = toy_example["Size"].map(size_map)
# Concatenate encoded variables with numerical variables
preprocessed_data = pd.concat([color_encoded, toy_example[["Size", "Weight", "Age"]]],
axis=1)
preprocessed_data
## Matrix of euclidean distances
matrix_nan_euclidean = nan_euclidean_distances(X=preprocessed_data)
matrix_nan_euclidean
# Perform nearest neighbors imputation
imputer = KNNImputer(n_neighbors=2)
imputed_df = pd.DataFrame(imputer.fit_transform(preprocessed_data),
columns=preprocessed_data.columns)
## Here I have a problem where the NaN value in the variable
## "Color" in relation to the 5th row is not imputed
### I was expecting a 0 in the Color_nan and a positive value
### in any of the columns Color_Blue, Color_Green, Color_Red
imputed_df
As I mention in the comments of the code this solution is not feasible for the case of the nominal variable because I obtain the following result where the nominal variable is not imputed:
Color_Blue Color_Green Color_Red Color_nan Size Weight Age
0 1.0 0.0 0.0 0.0 1.0 10.0 2.0
1 0.0 0.0 1.0 0.0 2.0 13.5 4.0
2 0.0 1.0 0.0 0.0 3.0 15.0 2.5
3 1.0 0.0 0.0 0.0 1.5 12.0 3.0
4 0.0 0.0 0.0 1.0 1.0 12.5 1.0
For the case of the ordinal variable at least the value is imputed where I need to decide the appropiate roundig method to apply (classical rounding, ceiling or floor)
b. One hot encoding for the nominal variable where the NaN
values are not encoded as a category and the rest of the dummy variables are considered NaN
# Preprocessing the data
color_encoder = OneHotEncoder()
color_encoder.fit(X=toy_example[["Color"]])
## Checking categories and names
### A na dummy is included by default
color_encoder.categories_
color_encoder.get_feature_names_out()
# Create a new DataFrame with the one-hot encoded "Color" column
color_encoded = pd.DataFrame(color_encoder.transform(toy_example[["Color"]]).toarray(),
columns=color_encoder.get_feature_names_out(["Color"]))
color_encoded
## Don't take into account the nan values as a separate category
color_encoded = color_encoded.loc[:, "Color_Blue":"Color_Red"]
## Because I don't know in advance the values of the dummy variables
## I will replace them with NaN values which is a logical solution taking
## into account that I don't know the value of this observation in relation
## to the "Color" variable
color_encoded.iloc[4, :] = np.nan
color_encoded
# Create a dictionary to map the ordinal values of the "Size" column to numerical values
size_map = {"S": 1, "M": 2, "L": 3}
size_map
toy_example["Size"] = toy_example["Size"].map(size_map)
# Concatenate encoded variables with numerical variables
preprocessed_data = pd.concat([color_encoded, toy_example[["Size", "Weight", "Age"]]],
axis=1)
preprocessed_data
## Matrix of euclidean distances
matrix_nan_euclidean = nan_euclidean_distances(X=preprocessed_data)
matrix_nan_euclidean
# Perform nearest neighbors imputation
imputer = KNNImputer(n_neighbors=2)
imputed_df = pd.DataFrame(imputer.fit_transform(preprocessed_data),
columns=preprocessed_data.columns)
## Here I have a problem because I will need to decide
## how to round the values using classical rounding,
## ceiling or floor in relation to the 5th row. However
## any of this methods are inconsistent because an
## observation cannot be Blue and Green at the same time
## but it needs to be at least Blue, Green or Red
imputed_df
As I mention in the comments of the code this solution is not feasible for the case of the nominal variable because I obtain the following result where the nominal variable takes 2 values or doesn't take any value:
Color_Blue Color_Green Color_Red Size Weight Age
0 1.0 0.0 0.0 1.0 10.0 2.0
1 0.0 0.0 1.0 2.0 13.5 4.0
2 0.0 1.0 0.0 3.0 15.0 3.5
3 1.0 0.0 0.0 1.5 12.0 3.0
4 0.5 0.5 0.0 1.0 12.5 1.0
Taking into account that a. and b. doesn't work, anyone knows how to impute a nominal variable in a consistent way using multivariate imputation?
So, how can I impute the observations of the toy_example
for the case of the nominal variable using multivariate imputation?
from KNN imputer with nominal, ordinal and numerical variables
No comments:
Post a Comment