Monday, 7 January 2019

Reading data from CSV into dataframe with multiple delimiters efficiently

I have an awkward CSV file which has multiple delimiters: the delimiter for the non-numeric part is ',', for the numeric part ';'. I want to construct a dataframe only out of the numeric part as efficiently as possible.

I have made 2 attempts utilising the converters argument of pd.read_csv, and 1 attempt using regex with engine='python'. They are all more than 2x slower than reading the entire CSV file with no conversions. This is prohibitively slow for my use case.

I understand the comparison isn't like-for-like, but it does demonstrate the overall poor performance is not driven by I/O. Is there a more efficient way to read in the data into a numeric Pandas dataframe? Or the equivalent NumPy array?

The below string can be used for benchmarking purposes.

# Python 3.6.5, Pandas 0.19.2

from io import StringIO
import pandas as pd

x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6

def csv_reader_1(x):
    df = pd.read_csv(x, usecols=[3], header=None, delimiter=',',
                     converters={3: lambda x: x.split(';')})
    return df.join(pd.DataFrame(df.pop(3).values.tolist(), dtype=float))

def csv_reader_2(x):
    df = pd.read_csv(x, header=None, delimiter=';',
                     converters={0: lambda x: x.rsplit(',')[-1]}, dtype=float)
    return df.astype(float)

def csv_reader_3(x):
    return pd.read_csv(x, usecols=[3, 4, 5], header=None, sep=',|;', engine='python')

Checks:

res1 = csv_reader_1(StringIO(x))
res2 = csv_reader_2(StringIO(x))
res2 = csv_reader_2(StringIO(x))

print(res1.head(3))

#        0       1         2
# 0  34.23  562.45  213.5432
# 1  56.23   63.45  625.2340
# 2  34.23  562.45  213.5432

assert res1.equals(res2)
assert res1.equals(res3)

Benchmarking results:

%timeit csv_reader_1(StringIO(x))  # 5.28 s per loop
%timeit csv_reader_2(StringIO(x))  # 3.98 s per loop
%timeit csv_reader_3(StringIO(x))  # 12.2 s per loop
%timeit pd.read_csv(StringIO(x))   # 1.69 s per loop



from Reading data from CSV into dataframe with multiple delimiters efficiently

No comments:

Post a Comment