Hemant Vishwakarma: Efficient way to create DataFrame with different column types

Wednesday, 17 August 2022

Efficient way to create DataFrame with different column types

I need to read data from numeric Postgres table and create DataFrame accordingly.

The default way Pandas is doing it is by using DataFrame.from_records:

df = DataFrame.from_records(data,
                            columns=columns,
                            coerce_float=coerce_float)

When data looks like:

[(0.16275345863180396, 0.16275346), (0.6356328878675244, 0.6356329)...]

And columns looks like:

['a', 'b']

The problem is that the generated DataFrame ignores the original Posgres types: double precision and real.

As I use huge DataFrames and my data is mostly real I'd like to explicitly specify the column types.

So I tried:

df = DataFrame.from_records(np.array(data, dtype=columns),
                            coerce_float=coerce_float)

When data is the same, but columns looks like:

[('a', 'float64'), ('b', 'float32')]

(types are extracted from Postgres as a part of query and converted to Numpy dtypes)

This approach works, but DataFrame construction is 2-3 times slower (for 2M rows DataFrames it takes several seconds), because np.array generation is for some reason very slow. In real life I have 10-200 columns mostly float32.

What is the fastest way to construct DataFrame with specified column types?

from Efficient way to create DataFrame with different column types

Hemant Vishwakarma

Wednesday, 17 August 2022

Efficient way to create DataFrame with different column types

No comments:

Post a Comment