Monday, 27 September 2021

Explosion of memory when using pandas .loc with umatching indices + assignment giving duplicate axis error

This is an observation from Most pythonic way to concatenate pandas cells with conditions I am not able to understand why third solution one takes more memory compared to first one.

  • If I don't sample the third solution does not give runtime error, clearly something is weird

  • To emulate large dataframe I tried to resample, but never expected to run into this kind of error

Background

Pretty self explanatory, one line, looks pythonic

df['city'] + (df['city'] == 'paris')*('_' + df['arr'].astype(str))
s = """city,arr,final_target
paris,11,paris_11
paris,12,paris_12
dallas,22,dallas
miami,15,miami
paris,16,paris_16"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(s)).sample(1000000, replace=True)
df

Speeds

%%timeit
df['city'] + (df['city'] == 'paris')*('_' + df['arr'].astype(str))
# 877 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['final_target'] = np.where(df['city'].eq('paris'), 
                              df['city'] + '_' + df['arr'].astype(str), 
                              df['city'])
# 874 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If I dont sample, there is no error and output also match exactly

Error(Updated)(Only happens when I sample from dataframe)

%%timeit
df['final_target'] = df['city']
df.loc[df['city'] == 'paris', 'final_target'] +=  '_' + df['arr'].astype(str)
MemoryError: Unable to allocate 892. GiB for an array with shape (119671145392,) and data type int64

For smaller input(sample size 100) we get different error, telling a problem due to different sizes, but whats up with memory allocations and sampling?

ValueError: cannot reindex from a duplicate axis
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-57c5b10090b2> in <module>
      1 df['final_target'] = df['city']
----> 2 df.loc[df['city'] == 'paris', 'final_target'] +=  '_' + df['arr'].astype(str)

~/anaconda3/lib/python3.8/site-packages/pandas/core/ops/methods.py in f(self, other)
     99             # we are updating inplace so we want to ignore is_copy
    100             self._update_inplace(
--> 101                 result.reindex_like(self, copy=False), verify_is_copy=False
    102             )
    103 

I rerun them from scratch each time

Update

This is part of what I figured

s = """city,arr,final_target
paris,11,paris_11
paris,12,paris_12
dallas,22,dallas
miami,15,miami
paris,16,paris_16"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(s)).sample(10, replace=True)
df
    city    arr final_target
1   paris   12  paris_12
0   paris   11  paris_11
2   dallas  22  dallas
2   dallas  22  dallas
3   miami   15  miami
3   miami   15  miami
2   dallas  22  dallas
1   paris   12  paris_12
0   paris   11  paris_11
3   miami   15  miami
  • Indices are repeated when sampled with replacement

  • So resetting the indices resolved the problem even if df.arr and df.loc have essentially different sizes or replacing with df.loc[df['city'] == 'paris', 'arr'].astype(str) will solve it. Just as 2e0byo pointed out.

  • Still can someone explain how .loc works and also explosion of memory When indices have duplicates in them and don't match?!



from Explosion of memory when using pandas .loc with umatching indices + assignment giving duplicate axis error

No comments:

Post a Comment