I am looking at a data preprocessing task on a large amount of text data and want to load the preprocessed data into TensorFlow 2.x. The preprocessed data contains arrays of integer values since the preprocessing step generates:
- a one hot encoded array as label column
- a tokenized list of tokens per data row
- an activation mask for usage in transformers
So, I've been thinking I'll use pyspark to pre-process the data and dump the result into a JSON
file (since CSV cannot store structured data). Thus far, everythings works out OK. But I am having trouble processing the JSON
file in tf.data.Dataset
(or anything else that scales as efficient and can interface with TensorFlow 2.x).
I do not want to use/install an additional library (eg TensorFlowOnSpark) besides Tensorflow and PySpark so I am wondering whether its possible to link the two in an efficient way using JSON files since there seems to be no other way for saving/loading records containing a list of data(?). The JSON test file looks like this:
readDF = spark.read.format('json').option('header',True).option('sep','|').load('/output.csv')
readDF.select('label4').show(15, False)
+---------------------------------------------------------+
|label4 |
+---------------------------------------------------------+
|[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]|
|[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
+---------------------------------------------------------+
So, the label4 column has already been one hot encoded and the tokenized text column will look similarly once the tokenizer was applied to it. So, my question is: Can a JSON
file be loaded efficiently (maybe via generator function) with tf.data.Dataset
or should I go down a different road (with an additional library) for this one?
from How to efficiently feed data into TensorFlow 2.x,
No comments:
Post a Comment