Wednesday, 17 August 2022

Efficient way to create DataFrame with different column types

I need to read data from numeric Postgres table and create DataFrame accordingly.

The default way Pandas is doing it is by using DataFrame.from_records:

df = DataFrame.from_records(data,
                            columns=columns,
                            coerce_float=coerce_float)

When data looks like:

[(0.16275345863180396, 0.16275346), (0.6356328878675244, 0.6356329)...] 

And columns looks like:

['a', 'b']

The problem is that the generated DataFrame ignores the original Posgres types: double precision and real.

As I use huge DataFrames and my data is mostly real I'd like to explicitly specify the column types.

So I tried:

df = DataFrame.from_records(np.array(data, dtype=columns),
                            coerce_float=coerce_float)

When data is the same, but columns looks like:

[('a', 'float64'), ('b', 'float32')]

(types are extracted from Postgres as a part of query and converted to Numpy dtypes)

This approach works, but DataFrame construction is 2-3 times slower (for 2M rows DataFrames it takes several seconds), because np.array generation is for some reason very slow. In real life I have 10-200 columns mostly float32.

What is the fastest way to construct DataFrame with specified column types?



from Efficient way to create DataFrame with different column types

No comments:

Post a Comment