This is an observation from Most pythonic way to concatenate pandas cells with conditions I am not able to understand why third solution one takes more memory compared to first one.
-
If I don't sample the third solution does not give runtime error, clearly something is weird
-
To emulate large dataframe I tried to resample, but never expected to run into this kind of error
Background
Pretty self explanatory, one line, looks pythonic
df['city'] + (df['city'] == 'paris')*('_' + df['arr'].astype(str))
s = """city,arr,final_target
paris,11,paris_11
paris,12,paris_12
dallas,22,dallas
miami,15,miami
paris,16,paris_16"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(s)).sample(1000000, replace=True)
df
Speeds
%%timeit
df['city'] + (df['city'] == 'paris')*('_' + df['arr'].astype(str))
# 877 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['final_target'] = np.where(df['city'].eq('paris'),
df['city'] + '_' + df['arr'].astype(str),
df['city'])
# 874 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If I dont sample, there is no error and output also match exactly
Error(Updated)(Only happens when I sample from dataframe)
%%timeit
df['final_target'] = df['city']
df.loc[df['city'] == 'paris', 'final_target'] += '_' + df['arr'].astype(str)
MemoryError: Unable to allocate 892. GiB for an array with shape (119671145392,) and data type int64
For smaller input(sample size 100) we get different error, telling a problem due to different sizes, but whats up with memory allocations and sampling?
ValueError: cannot reindex from a duplicate axis
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-57c5b10090b2> in <module>
1 df['final_target'] = df['city']
----> 2 df.loc[df['city'] == 'paris', 'final_target'] += '_' + df['arr'].astype(str)
~/anaconda3/lib/python3.8/site-packages/pandas/core/ops/methods.py in f(self, other)
99 # we are updating inplace so we want to ignore is_copy
100 self._update_inplace(
--> 101 result.reindex_like(self, copy=False), verify_is_copy=False
102 )
103
I rerun them from scratch each time
Update
This is part of what I figured
s = """city,arr,final_target
paris,11,paris_11
paris,12,paris_12
dallas,22,dallas
miami,15,miami
paris,16,paris_16"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(s)).sample(10, replace=True)
df
city arr final_target
1 paris 12 paris_12
0 paris 11 paris_11
2 dallas 22 dallas
2 dallas 22 dallas
3 miami 15 miami
3 miami 15 miami
2 dallas 22 dallas
1 paris 12 paris_12
0 paris 11 paris_11
3 miami 15 miami
-
Indices are repeated when sampled with replacement
-
So resetting the indices resolved the problem even if df.arr and df.loc have essentially different sizes or replacing with
df.loc[df['city'] == 'paris', 'arr'].astype(str)will solve it. Just as 2e0byo pointed out. -
Still can someone explain how .loc works and also explosion of memory When indices have duplicates in them and don't match?!
from Explosion of memory when using pandas .loc with umatching indices + assignment giving duplicate axis error
No comments:
Post a Comment