Hemant Vishwakarma: Multiple merge in Dask and field names

Wednesday, 20 February 2019

Multiple merge in Dask and field names

I am trying to merge multiple pandas dataframes onto a large Dask dataframe with fields ["a_id", "b_id", "c_id"]. Each pandas dataframe "A", "B", and "C" has a unique field ("a_id", "b_id", and "c_id") that joins it to the Dask dataframe. "B" and "C" also have a field "b_Field1":

import pandas as pd
import dask.dataframe as dd

A = pd.DataFrame({'a_id': [1, 2, 3], 'a_Field1': [0, 0, 0]})
B = pd.DataFrame({'b_id': [3, 4, 5], 'b_Field1': [7, 8, 9]})
C = pd.DataFrame({'c_id': [4, 5, 6], 'b_Field1': [6, 7, 8], 'c_Field1': [10, 11, 12]})

pdf = pd.DataFrame({'a_id': [1, 2], 'b_id': [3, 4], 'c_id': [5, 6]})
pdf = pdf.merge(A, how="left", on="a_id")
pdf = pdf.merge(B, how="left", on="b_id")
pdf = pdf.merge(C, how="left", on=["c_id", "b_Field1"])

print(pdf)

"""
Result:
   a_id  b_id  c_id  a_Field1  b_Field1  c_Field1
0     1     3     5         0         7        11
1     2     4     6         0         8        12
"""

dA = dd.from_pandas(A, npartitions=1)
dB = dd.from_pandas(B, npartitions=1)
dC = dd.from_pandas(C, npartitions=1)
ddf = dd.from_pandas(pdf, npartitions=1)

ddf = ddf.merge(dA, how="left", on="a_id")
ddf = ddf.merge(dB, how="left", on="b_id")
ddf = ddf.merge(dC, how="left", on=["c_id", "b_Field1"])

This fails, saying that there is no field "b_Field1" in ddf. My thinking was that I need to execute a .compute() command between merging B and C, but this causes Dask to hang endlessly with 40% on the progress bar (eventally dies with a MemoryError).

Is the compute necessary before doing the second join? And if so, any reason it would hang? These datasets are just barely small enough to merge in pure Pandas, and the merge happens quickly, but I'm trying to make this deployable on machines with less RAM.

from Multiple merge in Dask and field names

Hemant Vishwakarma

Wednesday, 20 February 2019

Multiple merge in Dask and field names

No comments:

Post a Comment