Saturday, 25 September 2021

How to store custom Parquet Dataset metadata with pyarrow?

How do I store custom metadata to a ParquetDataset using pyarrow?

For example, if I create a Parquet dataset using Dask

import dask
dask.datasets.timeseries().to_parquet('temp.parq')

I can then read it using pyarrow

import pyarrow.parquet as pq
dataset = pq.ParquetDataset('temp.parq')

However, the same method I would use for writing metadata for a single parquet file (outlined in How to write Parquet metadata with pyarrow?) does not work for a ParquetDataset, since there is no replace_schema_metadata function or similar.

I think I would probably like to write a custom _custom_metadata file, as the metadata I'd like to store pertain to the whole dataset. I imagine the procedure would be something similar to:

meta = pq.read_metadata('temp.parq/_common_metadata')
custom_metadata = {b'type': b'mydataset'}
merged_metadata = { **custom_metadata, **meta.metadata }
# TODO: Construct FileMetaData object with merged_metadata
new_meta.write_metadata_file('temp.parq/_common_metadata')


from How to store custom Parquet Dataset metadata with pyarrow?

No comments:

Post a Comment