I have a dask dataframe (df
) with around 250 million rows (from a 10Gb CSV file). I have another pandas dataframe (ndf
) of 25,000 rows. I would like to add the first column of pandas dataframe to the dask dataframe by repeating every item 10,000 times each.
Here's the code that I tried. I have reduced the problem to a smaller size.
import dask.dataframe as dd
import pandas as pd
import numpy as np
pd.DataFrame(np.random.rand(25000, 2)).to_csv("tempfile.csv")
df = dd.read_csv("tempfile.csv")
ndf = pd.DataFrame(np.random.randint(1000, 3500, size=2500))
df['Node'] = np.repeat(ndf[0], 10)
With this code, I end up with an error.
ValueError: Not all divisions are known, can't align partitions. Please use
set_index
to set the index.
I can perform a reset_index()
followed by a set_index()
to make df.known_divisions
True
for the dask dataframe. But it is a time consuming operation. Is there a better faster way to do what I am trying to do? Can I do this using pandas itself?
The end goal is to find rows from ndf
where any of the corresponding rows from df
matches some criteria.
from Concatenating a dask dataframe and a pandas dataframe
No comments:
Post a Comment