I need to read data from numeric Postgres table and create DataFrame accordingly.
The default way Pandas is doing it is by using DataFrame.from_records
:
df = DataFrame.from_records(data,
columns=columns,
coerce_float=coerce_float)
When data
looks like:
[(0.16275345863180396, 0.16275346), (0.6356328878675244, 0.6356329)...]
And columns
looks like:
['a', 'b']
The problem is that the generated DataFrame ignores the original Posgres types: double precision
and real
.
As I use huge DataFrames and my data is mostly real
I'd like to explicitly specify the column types.
So I tried:
df = DataFrame.from_records(np.array(data, dtype=columns),
coerce_float=coerce_float)
When data
is the same, but columns
looks like:
[('a', 'float64'), ('b', 'float32')]
(types are extracted from Postgres as a part of query and converted to Numpy dtypes)
This approach works, but DataFrame construction is 2-3 times slower (for 2M rows DataFrames it takes several seconds), because np.array
generation is for some reason very slow. In real life I have 10-200 columns mostly float32
.
What is the fastest way to construct DataFrame with specified column types?
from Efficient way to create DataFrame with different column types
No comments:
Post a Comment