Monday, 12 April 2021

Correspondence between R's caret knnImpute and Python's sklearn.impute KNNImputer results

I am trying to replicate a machine learning process across different programming platforms but I am getting different imputed values from R Caret's preProcessing function compared to python's sklearn process. Using the example dataset and process below:

library(caret)

set.seed(197)
simulated.ds <- data.frame(
  var1 = rbinom(n=10000, size=1, prob=0.05),
  var2 = rbinom(n=10000, size=1, prob=0.4),
  var3 = rbinom(n=10000, size=1, prob=0.2),
  var4 = rbinom(n=10000, size=1, prob=0.03),
  var5 = rbinom(n=10000, size=1, prob=0.7),
  var6 = rbinom(n=10000, size=1, prob=0.1),
  var7 = rbinom(n=10000, size=1, prob=0.2)
)


set.seed(50)
ind1 <- sample(c(1:10000), 1250)
simulated.ds$var1[ind1] <- NA

set.seed(150)
ind2 <- sample(c(1:10000), 1250)
simulated.ds$var2[ind2] <- NA

set.seed(1000)
ind5 <- sample(c(1:10000), 1250)
simulated.ds$var5[ind5] <- NA

set.seed(500)
ind6 <- sample(c(1:10000), 1250)
simulated.ds$var6[ind6] <- NA

write.csv(simulated.ds, "rawDataR.csv", row.names = F)

prepRoutine <- caret::preProcess(simulated.ds, method = "knnImpute", k=5)
imputed_dataset <- predict(prepRoutine, simulated.ds)

write.csv(imputed_dataset, "imputedDataR.csv", row.names = F)

When I preprocess the same simulated dataset by trying to replicate the imputation process in Python 3 using the code below, the results from the imputed dataset are different.

import random
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

rawDataR = pd.read_csv("C:/Users/AfrikanaScholar/Documents/rawDataR.csv")
imputedDataR = pd.read_csv("C:/Users/AfrikanaScholar/Documents/imputedDataR.csv")

random.seed(197)
scaler = StandardScaler()
imputer = KNNImputer(n_neighbors=5, weights="distance")

data_pipeline = Pipeline([
    ('scaler', scaler),
    ('imputer', imputer)
])

imputedDataPy = data_pipeline.fit_transform(rawDataR)

dataPy = pd.DataFrame(imputedDataPy).round(3)
dataR = imputedDataR.round(3) 

np.array_equal(dataR.values, dataPy.values)
>> ***False***

My questions are:

  1. Why would the different platforms, taking the same approach, using the same data, produce such differences? For example, using the imputed values from the 5th column,

    From python 3:

     Counter({
       -1.522: 2644,
       -1.086: 55,
       -0.65: 193,
       -0.215: 485,
        0.221: 478,
        0.398: 1,
        0.657: 6144        
        })
    

    From R:

         -1.522 -1.086  -0.65 -0.215  0.221  0.657 
          2638      3    222    272    582   6283
    
  2. What could be done to ensure that the values between the different platforms are as similar as possible?

Such discrepancies would make the modelling process produce varying findings across different platforms.



from Correspondence between R's caret knnImpute and Python's sklearn.impute KNNImputer results

No comments:

Post a Comment