I have the following function that randomly shuffle the values of one column of the dataframe and use RandomForestClassifier on the overall dataframe including that column that is being randomly shuffled to get the accuracy score.
And I would like to run this function concurrently to each column of the dataframe, as dataframe is pretty large and contains 500k rows and 1k columns. The key is to only randomly shuffle one column at a time.
However, I am struggling to understand why is ProcessPoolExecutor much slower than ThreadPoolExecutor. I thought ThreadPoolExecutor is only suppose to be faster for I/O task. In this case, it doesn't involve reading from or writing to any files.
Or have I done anything wrong here ? Is there a more efficient or better way to optimize this code to make it do things concurrently and run faster?
def randomShuffle(colname, X, y, fit):
out = {'col_name': colname}
X_= X.copy(deep = True)
np.random.shuffle(X_[colname].values) # permutation of a single column
pred = fit.predict(X_)
out['scr'] = accuracy_score(y, pred)
return out
def runConcurrent(classifier, X,y):
skf = KFold(n_splits=5, shuffle = False)
acc_scr0, acc_scr1 = pd.Series(), pd.DataFrame(columns = X.columns)
# split data to training and validation
for i, (train_idx, val_idx) in enumerate(skf.split(X,y)):
X_train, y_train = X.iloc[train_idx,:], y.iloc[train_idx]
X_val, y_val = X.iloc[val_idx,:], y.iloc[val_idx]
fit = classifier.fit(X=X_train, y=y_train)
# accuracy score
pred = fit.predict(X_val)
acc_scr0.loc[i] = accuracy_score(y_val, pred)
# with concurrent.futures.ProcessPoolExecutor() as executor:
with concurrent.futures.ThreadPoolExecutor() as executor:
results = [executor.submit(randomShuffle, colname = j, X= X_val, y= y_val, fit = fit, labels = classifier.classes_) for j in X.columns]
for res in concurrent.futures.as_completed(results):
acc_scr1.loc[i, res.result()['col_name']] = res.result()['acc_scr']
return None
from Python: ProcessPoolExecutor vs ThreadPoolExecutor
No comments:
Post a Comment