Wednesday, 22 September 2021

Multiprocessing Pool gets successively slower after multiple calls

I want to iteratively train 1000 random forests on a dataset. To speed things up, I'm trying to utilize multiple cores during the iterated training loop. An working example is below:

from sklearn.ensemble import RandomForestClassifier
from multiprocessing import Pool,cpu_count
import numpy as np
import pandas as pd
from time import time

n = 2000
ndims = 5000

X = pd.DataFrame(np.random.normal(0,1,n*ndims).reshape((n,ndims)))
y = pd.Series(np.random.choice([0, 1], size=(n)))


def draw_batches(n,size=100):
    steps = np.arange(0,n,size)
    if not n%size == 0:
        steps = np.append(steps,n%size)[1:]
    for step in steps:
        if not step%size == 0:
            yield step
        else:
            yield size


def pool(method,iters):
    output = []
    p = Pool(4)
    try:
        output = p.map(method,iters)
    except Exception as e:
        print(e)
        pass
    finally:
        p.close()
        p.join()
        del p
    return output


def importances(args):
    model, i = args
    y_ = y.copy()
    model.fit(X,y_)
    return model.feature_importances_

n_iters = 100
model_cls = RandomForestClassifier

for batch in draw_batches(n_iters,4):
    print(batch)
    t = time()
    train_args = [(model_cls(n_estimators=50),i) for i in np.arange(batch)]
    imps = pool(importances,train_args)
    print((time()-t)/batch)

Though not as pronounced as in the code that I'm working, the above displays that the processing time per model gradually increases the more batches you run. I wouldn't expect this to be the case as the pool processing is all contained and everything is deleted at the end of each run.

What is causing the slow down?



from Multiprocessing Pool gets successively slower after multiple calls

No comments:

Post a Comment