Tuesday, 19 October 2021

Dask dataframe: Can `set_index` put a single index into multiple partitions?

Empirically it seems that whenever you set_index on a Dask dataframe, Dask will always put rows with equal indexes into a single partition, even if it results in wildly imbalanced partitions.

Here is a demonstration:

import pandas as pd
import dask.dataframe as dd

users = [1]*1000 + [2]*1000 + [3]*1000

df = pd.DataFrame({'user': users})
ddf = dd.from_pandas(df, npartitions=1000)

ddf = ddf.set_index('user')

counts = ddf.map_partitions(lambda x: len(x)).compute()
counts.loc[counts > 0]
# 500    1000
# 999    2000
# dtype: int64

However, I found no guarantee of this behaviour anywhere.

I have tried to sift through the code myself but gave up. I believe one of these inter-related functions probably holds the answer:

When you set_index, is it the case that a single index can never be in two different partitions? If not, then under what conditions does this property hold?


Bounty: I will award the bounty to an answer that draws from a reputable source. For example, referring to the implementation to show that this property has to hold.



from Dask dataframe: Can `set_index` put a single index into multiple partitions?

No comments:

Post a Comment