Tuesday, 7 August 2018

Curious memory consumtion of pandas.unique()

While profiling the memory consumption of my algorithm, I was surprised that sometimes for smaller inputs more memory was needed.

It all boils down to the following usage of pandas.unique():

import numpy as np
import pandas as pd
import sys

N=int(sys.argv[1])

a=np.arange(N, dtype=np.int64)
b=pd.unique(a)

with N=6*10^7 it needs 3.7GB peak memory, but with N=8*10^7 "only" 3GB.

Scanning different input-size yields the following graph:

enter image description here

Out of curiosity and for self-education: How can the counterintuitive behavior (i.e. more memory for smaller input size) around N=5*10^7, N=1.3*10^7 be explained?


Here are the scripts for producing the memory consumption graph on Linux:

pandas_unique_test.py:

import numpy as np
import pandas as pd
import sys

N=int(sys.argv[1])    
a=np.arange(N, dtype=np.int64)
b=pd.unique(a)

show_memory.py:

import sys
import matplotlib.pyplot as plt   
ns=[]
mems=[]
for line in sys.stdin.readlines():
    n,mem = map(int, line.strip().split(" "))
    ns.append(n)
    mems.append(mem)
plt.plot(ns, mems, label='peak-memory')
plt.xlabel('n')
plt.ylabel('peak memory in KB')
ymin, ymax = plt.ylim()
plt.ylim(0,ymax)
plt.legend()
plt.show()

run_perf_test.sh:

WRAPPER="/usr/bin/time -f%M" #peak memory in Kb
N=1000000
while [ $N -lt 100000000 ]
do
   printf "$N "
   $WRAPPER python pandas_unique_test.py $N
   N=`expr $N + 1000000`
done 

And now:

sh run_perf_tests.sh  2>&1 | python show_memory.py



from Curious memory consumtion of pandas.unique()

No comments:

Post a Comment