I am trying to merge multiple pandas dataframes onto a large Dask dataframe with fields ["a_id", "b_id", "c_id"]. Each pandas dataframe "A", "B", and "C" has a unique field ("a_id", "b_id", and "c_id") that joins it to the Dask dataframe. "B" and "C" also have a field "b_Field1":
import pandas as pd
import dask.dataframe as dd
A = pd.DataFrame({'a_id': [1, 2, 3], 'a_Field1': [0, 0, 0]})
B = pd.DataFrame({'b_id': [3, 4, 5], 'b_Field1': [7, 8, 9]})
C = pd.DataFrame({'c_id': [4, 5, 6], 'b_Field1': [6, 7, 8], 'c_Field1': [10, 11, 12]})
pdf = pd.DataFrame({'a_id': [1, 2], 'b_id': [3, 4], 'c_id': [5, 6]})
pdf = pdf.merge(A, how="left", on="a_id")
pdf = pdf.merge(B, how="left", on="b_id")
pdf = pdf.merge(C, how="left", on=["c_id", "b_Field1"])
print(pdf)
"""
Result:
a_id b_id c_id a_Field1 b_Field1 c_Field1
0 1 3 5 0 7 11
1 2 4 6 0 8 12
"""
dA = dd.from_pandas(A, npartitions=1)
dB = dd.from_pandas(B, npartitions=1)
dC = dd.from_pandas(C, npartitions=1)
ddf = dd.from_pandas(pdf, npartitions=1)
ddf = ddf.merge(dA, how="left", on="a_id")
ddf = ddf.merge(dB, how="left", on="b_id")
ddf = ddf.merge(dC, how="left", on=["c_id", "b_Field1"])
This fails, saying that there is no field "b_Field1" in ddf. My thinking was that I need to execute a .compute() command between merging B and C, but this causes Dask to hang endlessly with 40% on the progress bar (eventally dies with a MemoryError).
Is the compute necessary before doing the second join? And if so, any reason it would hang? These datasets are just barely small enough to merge in pure Pandas, and the merge happens quickly, but I'm trying to make this deployable on machines with less RAM.
from Multiple merge in Dask and field names
No comments:
Post a Comment