Wednesday 17 March 2021

How to efficiently feed data into TensorFlow 2.x,

I am looking at a data preprocessing task on a large amount of text data and want to load the preprocessed data into TensorFlow 2.x. The preprocessed data contains arrays of integer values since the preprocessing step generates:

  • a one hot encoded array as label column
  • a tokenized list of tokens per data row
  • an activation mask for usage in transformers

So, I've been thinking I'll use pyspark to pre-process the data and dump the result into a JSON file (since CSV cannot store structured data). Thus far, everythings works out OK. But I am having trouble processing the JSON file in tf.data.Dataset (or anything else that scales as efficient and can interface with TensorFlow 2.x).

I do not want to use/install an additional library (eg TensorFlowOnSpark) besides Tensorflow and PySpark so I am wondering whether its possible to link the two in an efficient way using JSON files since there seems to be no other way for saving/loading records containing a list of data(?). The JSON test file looks like this:

readDF = spark.read.format('json').option('header',True).option('sep','|').load('/output.csv')
readDF.select('label4').show(15, False)

+---------------------------------------------------------+
|label4                                                   |
+---------------------------------------------------------+
|[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]|
|[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
+---------------------------------------------------------+

So, the label4 column has already been one hot encoded and the tokenized text column will look similarly once the tokenizer was applied to it. So, my question is: Can a JSON file be loaded efficiently (maybe via generator function) with tf.data.Dataset or should I go down a different road (with an additional library) for this one?



from How to efficiently feed data into TensorFlow 2.x,

No comments:

Post a Comment