What factors determine an optimal chunksize
argument to methods like multiprocessing.Pool.map()
?
From the docs:
The
chunksize
argument is the same as the one used by themap()
method. For very long iterables using a large value forchunksize
can make the job complete much faster than using the default value of 1.
Example - say that I am:
- Passing an
iterable
to.map()
that has ~15 million elements; - Working on a machine with 24 cores and using the default
processes = os.cpu_count()
withinmultiprocessing.Pool()
.
My naive thinking is to give each of 24 workers an equally-sized chunk, i.e. 15_000_000 / 24
or 625,000. Large chunks should reduce turnover/overhead while fully utilizing all workers. But it seems that this is missing some potential downsides of giving large batches to each worker. Is this an incomplete picture, and what am I missing?
Part of my question stems from the default logic for if chunksize=None
: both .map()
and .starmap()
call .map_async()
, which looks like this:
def _map_async(self, func, iterable, mapper, chunksize=None, callback=None,
error_callback=None):
# ... (materialize `iterable` to list if it's an iterator)
if chunksize is None:
chunksize, extra = divmod(len(iterable), len(self._pool) * 4) # ????
if extra:
chunksize += 1
if len(iterable) == 0:
chunksize = 0
What's the logic behind divmod(len(iterable), len(self._pool) * 4)
? This implies that the chunksize will be closer to 15_000_000 / (24 * 4) == 156_250
. What's the intention in multiplying len(self._pool)
by 4?
This makes the resulting chunksize a factor of 4 smaller than my "naive logic" from above, which consists of just dividing the length of the iterable by number of workers in pool._pool
.
Related answer that is helpful but a bit too high-level: Python multiprocessing: why are large chunksizes slower?.
from Python multiprocessing: understanding logic behind `chunksize`
No comments:
Post a Comment