Thursday, 3 January 2019

How to use shared memory instead of passing objects via pickling between multiple processes

I am working on a CPU intensive ML problem which is centered around an additive model. Since addition is the main operation I can divide the input data into pieces and spawn multiple models which are then merged by the overriden __add__ method.

The code relating to the multiprocessing looks like this:

def pool_worker(filename, doshuffle):
    print(f"Processing file: {filename}")
    with open(filename, 'r') as f:
        partial = FragmentModel(order=args.order, indata=f, shuffle=doshuffle)
        return partial

def generateModel(is_mock=False, save=True):
    model = None
    with ThreadPool(args.nthreads) as pool:
        from functools import partial
        partial_models = pool.imap_unordered(partial(pool_worker, doshuffle=is_mock), args.input)
        i = 0
        for m in partial_models:
            logger.info(f'Starting to merge model {i}')
            if model is None:
                import copy
                model = copy.deepcopy(m)
            else:
                model += m
            logger.info(f'Done merging...')
            i += 1

    return model

The issue is that the memory consumption scales exponentially as the model order increases, so at order 4 each instance of the model is about 4-5 GB, which causes the threadpool to crash as the intermediate model objects are then not pickleable.

I read about this a bit and it appears as even if the pickling is not an issue, it's still extremely inefficient to pass data like this, as commented to this answer.

There is very little guidance as to how one can use shared memory for this purpose, however. Is it possible to avoid this problem without having to change the internals of the model object?



from How to use shared memory instead of passing objects via pickling between multiple processes

No comments:

Post a Comment