Versions : Python3.7.13, Tensorflow-2.9.1, Petastorm-0.12.1
I'm trying to implement data loading framework that creates tf.data.Dataset from parquet stored in S3 using petastorm.
Creating dataset as follows:
cols = [col1_nm, col2_nm, ...]
def parse(e):
x_features = []
for c in cols:
x_features.append(getattr(e,c))
X = tf.stack(x_features, axis=1)
y = getattr(e, 'target')
return X, y
with make_batch_reader(s3_paths, schema_fields=cols+['target']) as reader:
dataset = make_petastorm_dataset(reader).map(parse)
for e in dataset.take(3):
print(e)
All is well but want to know if there are alternative(more efficient and maintainable) way.
before parsing dataset is of type DatasetV1Adapter
and each element(e) in dataset (obtained via dataset.take(1)) is of type inferred_schema_view
which consist of EagerTensor
for each feature.
I've tried using index to split X, y however reading last element via [-1] does not return target's eager tensor.
from Most efficient way to parse dataset generated using petastorm from parquet
No comments:
Post a Comment