Hemant Vishwakarma: Python: ProcessPoolExecutor vs ThreadPoolExecutor

Friday, 7 July 2023

Python: ProcessPoolExecutor vs ThreadPoolExecutor

I have the following function that randomly shuffle the values of one column of the dataframe and use RandomForestClassifier on the overall dataframe including that column that is being randomly shuffled to get the accuracy score.

And I would like to run this function concurrently to each column of the dataframe, as dataframe is pretty large and contains 500k rows and 1k columns. The key is to only randomly shuffle one column at a time.

However, I am struggling to understand why is ProcessPoolExecutor much slower than ThreadPoolExecutor. I thought ThreadPoolExecutor is only suppose to be faster for I/O task. In this case, it doesn't involve reading from or writing to any files.

Or have I done anything wrong here ? Is there a more efficient or better way to optimize this code to make it do things concurrently and run faster?

def randomShuffle(colname, X, y, fit):
    out = {'col_name': colname}
    X_= X.copy(deep = True)
    np.random.shuffle(X_[colname].values) # permutation of a single column
    pred = fit.predict(X_)
    out['scr'] = accuracy_score(y, pred)
    return out

def runConcurrent(classifier, X,y):
    skf = KFold(n_splits=5, shuffle = False)
    acc_scr0, acc_scr1 = pd.Series(), pd.DataFrame(columns = X.columns)
    # split data to training and validation
    for i, (train_idx, val_idx) in enumerate(skf.split(X,y)):
        X_train, y_train = X.iloc[train_idx,:], y.iloc[train_idx]
        X_val, y_val = X.iloc[val_idx,:], y.iloc[val_idx]
        
        fit = classifier.fit(X=X_train, y=y_train)
        # accuracy score
        pred = fit.predict(X_val)
        acc_scr0.loc[i] = accuracy_score(y_val, pred)
        
        # with concurrent.futures.ProcessPoolExecutor() as executor:
        with concurrent.futures.ThreadPoolExecutor() as executor:
            results = [executor.submit(randomShuffle, colname = j, X= X_val, y= y_val, fit = fit, labels = classifier.classes_) for j in X.columns]
            for res in concurrent.futures.as_completed(results):
                acc_scr1.loc[i, res.result()['col_name']] = res.result()['acc_scr']
    return None

from Python: ProcessPoolExecutor vs ThreadPoolExecutor

Hemant Vishwakarma

Friday, 7 July 2023

Python: ProcessPoolExecutor vs ThreadPoolExecutor

No comments:

Post a Comment