Wednesday, 24 May 2023

Most efficient way to parse dataset generated using petastorm from parquet

Versions : Python3.7.13, Tensorflow-2.9.1, Petastorm-0.12.1

I'm trying to implement data loading framework that creates tf.data.Dataset from parquet stored in S3 using petastorm.

Creating dataset as follows:

cols = [col1_nm, col2_nm, ...]
def parse(e):
    x_features = []
    for c in cols:
        x_features.append(getattr(e,c))
    X = tf.stack(x_features, axis=1)
    y = getattr(e, 'target')
    return X, y

with make_batch_reader(s3_paths, schema_fields=cols+['target']) as reader:
    dataset = make_petastorm_dataset(reader).map(parse)
    for e in dataset.take(3):
        print(e)

All is well but want to know if there are alternative(more efficient and maintainable) way.

before parsing dataset is of type DatasetV1Adapter and each element(e) in dataset (obtained via dataset.take(1)) is of type inferred_schema_view which consist of EagerTensor for each feature.

I've tried using index to split X, y however reading last element via [-1] does not return target's eager tensor.



from Most efficient way to parse dataset generated using petastorm from parquet

No comments:

Post a Comment