Wednesday, 20 February 2019

Concatenating a dask dataframe and a pandas dataframe

I have a dask dataframe (df) with around 250 million rows (from a 10Gb CSV file). I have another pandas dataframe (ndf) of 25,000 rows. I would like to add the first column of pandas dataframe to the dask dataframe by repeating every item 10,000 times each.

Here's the code that I tried. I have reduced the problem to a smaller size.

import dask.dataframe as dd
import pandas as pd
import numpy as np

pd.DataFrame(np.random.rand(25000, 2)).to_csv("tempfile.csv")
df = dd.read_csv("tempfile.csv")
ndf = pd.DataFrame(np.random.randint(1000, 3500, size=2500))
df['Node'] = np.repeat(ndf[0], 10)

With this code, I end up with an error.

ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.

I can perform a reset_index() followed by a set_index() to make df.known_divisions True for the dask dataframe. But it is a time consuming operation. Is there a better faster way to do what I am trying to do? Can I do this using pandas itself?

The end goal is to find rows from ndf where any of the corresponding rows from df matches some criteria.



from Concatenating a dask dataframe and a pandas dataframe

No comments:

Post a Comment