Tuesday, 27 April 2021

Why can't I parse all my data when parsing byte string from TFRecrord file written from spark dataframe?

I have used the spark-tensorflow-connector package to write a tfrecord dataset.

When attempting to read the tfrecord files in as a TFRecordDataset in I am losing any fields that contains an array of arrays.(In my example this is "cold").

Steps to recreate

  1. Make spark session including spark-tensorflow-connector
  2. Create example data frame containing nested array column.

from pyspark.sql.types import *

data = [("A", 1, [1.1,2.0,3.0], [[0.0,1.0],[1.0,0.0]]),
         ("B", 2, [11.0,12.3,13.0], [[0.0,1.0],[1.0,0.0]]),
         ("C", 3, [21.0,22.0,23.5], [[1.0,0.0],[1.0,0.0]]),
  ]

schema = StructType([ \
    StructField("colA",StringType(),True), \
    StructField("colb",IntegerType(),True), \
    StructField("colc",ArrayType(FloatType()),True), \
    StructField("cold",ArrayType(ArrayType(FloatType())),True), \
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
  1. Write df to disk as tfrecord.
write_path = "/home/ec2-user/SageMaker/testwrite/test.tfrecord"

df.write.format("tfrecords").option("recordType", "SequenceExample").mode("overwrite").save(write_path)
  1. Attempt to read in using tf.data.TFrecordDataset
import tensorflow as tf
import os 

files = [f"{write_path}/{x}" for x in os.listdir(write_path) if x.startswith("part")]
dataset = tf.data.TFRecordDataset(files)

for i in dataset.take(1):
    print(repr(i))
    example = tf.train.Example()
    example.ParseFromString(i.numpy())
    print(example)

Even though the output of print(repr(i)) shows "cold" exists in the numpy bytestring.

<tf.Tensor: id=55, shape=(), dtype=string, numpy=b'\n8\n\r\n\x04colA\x12\x05\n\x03\n\x01C\n\r\n\x04colb\x12\x05\x1a\x03\n\x01\x03\n\x18\n\x04colc\x12\x10\x12\x0e\n\x0c\x00\x00\xa8A\x00\x00\xb0A\x00\x00\xbcA\x12&\n$\n\x04cold\x12\x1c\n\x0c\x12\n\n\x08\x00\x00\x80?\x00\x00\x00\x00\n\x0c\x12\n\n\x08\x00\x00\x80?\x00\x00\x00\x00'>

example.ParseFromString does not parse the data :(

features {
  feature {
    key: "colA"
    value {
      bytes_list {
        value: "C"
      }
    }
  }
  feature {
    key: "colb"
    value {
      int64_list {
        value: 3
      }
    }
  }
  feature {
    key: "colc"
    value {
      float_list {
        value: 21.0
        value: 22.0
        value: 23.5
      }
    }
  }
}

I have had no success using tf.io.parse_single_example either. I could break all my nested arrays into their own fields but would prefer not to as I will only merge them all again in tensorflow.



from Why can't I parse all my data when parsing byte string from TFRecrord file written from spark dataframe?

No comments:

Post a Comment