I have an awkward CSV file which has multiple delimiters: the delimiter for the non-numeric part is ',', for the numeric part ';'. I want to construct a dataframe only out of the numeric part as efficiently as possible.
I have made 2 attempts utilising the converters argument of pd.read_csv, and 1 attempt using regex with engine='python'. They are all more than 2x slower than reading the entire CSV file with no conversions. This is prohibitively slow for my use case.
I understand the comparison isn't like-for-like, but it does demonstrate the overall poor performance is not driven by I/O. Is there a more efficient way to read in the data into a numeric Pandas dataframe? Or the equivalent NumPy array?
The below string can be used for benchmarking purposes.
# Python 3.6.5, Pandas 0.19.2
from io import StringIO
import pandas as pd
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6
def csv_reader_1(x):
df = pd.read_csv(x, usecols=[3], header=None, delimiter=',',
converters={3: lambda x: x.split(';')})
return df.join(pd.DataFrame(df.pop(3).values.tolist(), dtype=float))
def csv_reader_2(x):
df = pd.read_csv(x, header=None, delimiter=';',
converters={0: lambda x: x.rsplit(',')[-1]}, dtype=float)
return df.astype(float)
def csv_reader_3(x):
return pd.read_csv(x, usecols=[3, 4, 5], header=None, sep=',|;', engine='python')
Checks:
res1 = csv_reader_1(StringIO(x))
res2 = csv_reader_2(StringIO(x))
res2 = csv_reader_2(StringIO(x))
print(res1.head(3))
# 0 1 2
# 0 34.23 562.45 213.5432
# 1 56.23 63.45 625.2340
# 2 34.23 562.45 213.5432
assert res1.equals(res2)
assert res1.equals(res3)
Benchmarking results:
%timeit csv_reader_1(StringIO(x)) # 5.28 s per loop
%timeit csv_reader_2(StringIO(x)) # 3.98 s per loop
%timeit csv_reader_3(StringIO(x)) # 12.2 s per loop
%timeit pd.read_csv(StringIO(x)) # 1.69 s per loop
from Reading data from CSV into dataframe with multiple delimiters efficiently
No comments:
Post a Comment