Tuesday, 4 June 2019

TensorFlow 1.10+: preprocessing TFRecordDataset with subprocess?

Note: I now sort of have a mini-series on TFRecords on Stack Overflow with associated colab notebooks. Relevant to this post are:

in addition this colab provides template code for a custom tf.estimator which may be of use for providing an answer.

So suppose you have some deliminator separated file (e.g a TSV) whereby each line has been converted to a record. A sample of such a file might look like this:

// my_file.tsv
...
banana  2  true  false  false  3
...

whereby for the TFRecord it has be accordingly converted to appropriate values (int / float / byte string):

// my_file.tsv
...
b'banana'  2.0  1  0  0  3.0
...

Further, the contents of a line (record) are arguments for a command line process:

# this is all bad on-the-spot example
some-bash-command --fruit=banana --number=2 --ripe=true --for-smoothie=false  --for-ice-cream=false --days-old=3

This function might transform this condensed way of storing data into what is needed as input.

A python interface to this command exists using the subprocess and popen e.g.

def process(command:list, stdin:str, popen_options={}):
    '''
    Arguments:
        command (list): a list of strings indicating the command and its
            arguments to spawn as a subprocess.

        stdin (str): passed as stdin to the subprocess. Assumed to be utf-8
            encoded.

        popen_options (dict): used to configure the subprocess.Popen command

    Returns:
        stdout, stderr
    '''
    command = clean_command(command)
    popen_config = POPEN_DEFAULTS.copy()
    popen_config.update(popen_options)
    try:
        pid = subprocess.Popen(args = command, **popen_config)
        stdout_data, stderr_data = pid.communicate(stdin)
    except OSError as err:
        error_message(command, err)
        sys.exit(1)
    if pid.returncode != 0:
        error_message(command, 'pid code {}'.format(pid.returncode), stdout_data, stderr_data)
    return stdout_data, stderr_data

where

command = [
    'some-bash-command',
    '--fruit=banana',
    '--number=2',
    '--ripe=true',
    # ...
]

Additionally, some-bash-command has a batched parallel some-batched-command whereby if the lines of the tsv are passed in via stdin as a string, a batched output would be provide:

some-batched-command --stringified-input='line\t1\targs\nline\t2\targs\n'

How would one call this process either in the parse_fn used to read in the TFRecords or after the records have been parsed?

# import stuff, etc

def from_record(record):
    # see previous posts on recovering TFRecords
    return as_tf

# command as described above

def parse_fn(record):
    parsed = from_record(record)

    # I want this to work but it doesn't
    values_i_want = process(command, stdin, popen_options):

    return values_i_want, parsed['labels']


sess = tf.InteractiveSession()
DATASET_FILENAMES = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(DATASET_FILENAMES).map(lambda r:parse_fn(r)).repeat().batch(2)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

OR

sess = tf.InteractiveSession()
DATASET_FILENAMES = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(DATASET_FILENAMES).map(lambda r:parse_fn(r))
.apply(
# SOMETHING HERE
)
.repeat().batch(2)

Thoughts?

To summarize:

  • there is a tsv file whereby each line provides the arguments (as well as some of the input) for a function to produce part of the input for a model.
  • This function needs to be called at runtime
  • the function is a bash command with a python interface via subprocess
  • the returned values of the function are pythonic values (one line for each line of the tsv file)
  • the returned value of the function needs to be joined with the corresponding line of the tsv file
a  1  2
b  3  4

some-batched-command --input="a\t1\t2\nb\3\4\n"

might return (changes at runtime)

x  y  3
w  v  8

as a string (so "x\ty\t3\nw\tv\t8\n", which then needs to be parsed and convert strings to ints, etc) and then the completed records for the model are:

a  1  2  x  y  3
b  3  4  w  v  8

so the dataset might return a tuple of features, labels like

features = {
  "c1": [a, b],
  "c2": [1, 3],
  "c3": [2, 4],
  "c4": [x, w]
}
labels = {
  "l1": [y, v],
  "l2": [3, 8]
}



from TensorFlow 1.10+: preprocessing TFRecordDataset with subprocess?

No comments:

Post a Comment