Hemant Vishwakarma: Running two dask-ml imputers simultaneously instead of sequentially

Sunday, 27 December 2020

Running two dask-ml imputers simultaneously instead of sequentially

I can impute the mean and most frequent value using dask-ml like so, this works fine:

mean_imputer = impute.SimpleImputer(strategy='mean')
most_frequent_imputer = impute.SimpleImputer(strategy='most_frequent')
data = [[100, 2, 5], [np.nan, np.nan, np.nan], [70, 7, 5]]
df = pd.DataFrame(data, columns = ['Weight', 'Age', 'Height']) 
df.iloc[:, [0,1]] = mean_imputer.fit_transform(df.iloc[:,[0,1]])
df.iloc[:, [2]] = most_frequent_imputer.fit_transform(df.iloc[:,[2]])
print(df)


    Weight  Age   Height
0   100.0   2.0   5.0
1   85.0    4.5   5.0
2   70.0    7.0   5.0

But what if I have 100 million rows of data it seems that dask would do two loops when it could have done only one, is it possible to run both imputers simultaneously and/or in parallel instead of sequentially? What would be a sample code to achieve that?

from Running two dask-ml imputers simultaneously instead of sequentially

Hemant Vishwakarma

Sunday, 27 December 2020

Running two dask-ml imputers simultaneously instead of sequentially

No comments:

Post a Comment