I have a large Parquet file with a number of small row groups. I'd like to produce a new Parquet file with a single (bigger) row group, and I'm operating in Python. I could do something like:
import pyarrow.parquet as pq
table = pq.read_table('many_tiny_row_groups.parquet')
pq.write_table(table, 'one_big_row_group.parquet')
# Lots of row groups...
print (pq.ParquetFile('many_tiny_row_groups.parquet').num_row_groups)
# Now, only 1 row group...
print (pq.ParquetFile('one_big_row_group.parquet').num_row_groups)
However, this requires that I read the entire Parquet file into memory at once. I would like to avoid doing that. Is there some sort of "streaming" approach in which the memory footprint can stay small?
from Collapsing row-groups in Parquet efficiently
No comments:
Post a Comment