I have used the spark-tensorflow-connector package to write a tfrecord dataset.
When attempting to read the tfrecord files in as a TFRecordDataset in I am losing any fields that contains an array of arrays.(In my example this is "cold").
Steps to recreate
- Make spark session including spark-tensorflow-connector
- Create example data frame containing nested array column.
from pyspark.sql.types import *
data = [("A", 1, [1.1,2.0,3.0], [[0.0,1.0],[1.0,0.0]]),
("B", 2, [11.0,12.3,13.0], [[0.0,1.0],[1.0,0.0]]),
("C", 3, [21.0,22.0,23.5], [[1.0,0.0],[1.0,0.0]]),
]
schema = StructType([ \
StructField("colA",StringType(),True), \
StructField("colb",IntegerType(),True), \
StructField("colc",ArrayType(FloatType()),True), \
StructField("cold",ArrayType(ArrayType(FloatType())),True), \
])
df = spark.createDataFrame(data=data,schema=schema)
- Write df to disk as tfrecord.
write_path = "/home/ec2-user/SageMaker/testwrite/test.tfrecord"
df.write.format("tfrecords").option("recordType", "SequenceExample").mode("overwrite").save(write_path)
- Attempt to read in using tf.data.TFrecordDataset
import tensorflow as tf
import os
files = [f"{write_path}/{x}" for x in os.listdir(write_path) if x.startswith("part")]
dataset = tf.data.TFRecordDataset(files)
for i in dataset.take(1):
print(repr(i))
example = tf.train.Example()
example.ParseFromString(i.numpy())
print(example)
Even though the output of print(repr(i)) shows "cold" exists in the numpy bytestring.
<tf.Tensor: id=55, shape=(), dtype=string, numpy=b'\n8\n\r\n\x04colA\x12\x05\n\x03\n\x01C\n\r\n\x04colb\x12\x05\x1a\x03\n\x01\x03\n\x18\n\x04colc\x12\x10\x12\x0e\n\x0c\x00\x00\xa8A\x00\x00\xb0A\x00\x00\xbcA\x12&\n$\n\x04cold\x12\x1c\n\x0c\x12\n\n\x08\x00\x00\x80?\x00\x00\x00\x00\n\x0c\x12\n\n\x08\x00\x00\x80?\x00\x00\x00\x00'>
example.ParseFromString does not parse the data :(
features {
feature {
key: "colA"
value {
bytes_list {
value: "C"
}
}
}
feature {
key: "colb"
value {
int64_list {
value: 3
}
}
}
feature {
key: "colc"
value {
float_list {
value: 21.0
value: 22.0
value: 23.5
}
}
}
}
I have had no success using tf.io.parse_single_example either. I could break all my nested arrays into their own fields but would prefer not to as I will only merge them all again in tensorflow.
from Why can't I parse all my data when parsing byte string from TFRecrord file written from spark dataframe?
No comments:
Post a Comment