Saturday, 25 May 2019

Collapsing row-groups in Parquet efficiently

I have a large Parquet file with a number of small row groups. I'd like to produce a new Parquet file with a single (bigger) row group, and I'm operating in Python. I could do something like:

import pyarrow.parquet as pq
table = pq.read_table('many_tiny_row_groups.parquet')
pq.write_table(table, 'one_big_row_group.parquet')

# Lots of row groups...
print (pq.ParquetFile('many_tiny_row_groups.parquet').num_row_groups)
# Now, only 1 row group...
print (pq.ParquetFile('one_big_row_group.parquet').num_row_groups)

However, this requires that I read the entire Parquet file into memory at once. I would like to avoid doing that. Is there some sort of "streaming" approach in which the memory footprint can stay small?



from Collapsing row-groups in Parquet efficiently

No comments:

Post a Comment