Monday, 5 August 2019

Drop rows efficiently in Dask

I'm trying to drop null values on a dask dataframe, the example in the documentaton works well for columns:

import dask.dataframe as dd
df = dd.read_csv("test.csv",assume_missing=True)
df.dropna(how='all', subset=None, thresh=None).compute()

But if I try to specify axis 0 in order to filter by rows, I get this error:

import dask.dataframe as dd
df = dd.read_csv("test.csv",assume_missing=True)
df.dropna(how='all', subset=None, thresh=None,axis=0).compute()

The documentaton also says:

axis:{0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

So I wrote this as a walkaround:

df = dd.read_csv("test.csv",assume_missing=True)
filter_ = ~(df.isnull().all(axis=1).reset_index()[0])
df.loc[filter_].compute()

But it does not look pythonic. Also, I'm resetting the index, and as far as I know that is an inefficient operation in dask.



from Drop rows efficiently in Dask

No comments:

Post a Comment