Hemant Vishwakarma: TensorFlow 1.10+: preprocessing TFRecordDataset with subprocess?

Note: I now sort of have a mini-series on TFRecords on Stack Overflow with associated colab notebooks. Relevant to this post are:

in addition this colab provides template code for a custom tf.estimator which may be of use for providing an answer.

So suppose you have some deliminator separated file (e.g a TSV) whereby each line has been converted to a record. A sample of such a file might look like this:

// my_file.tsv
...
banana  2  true  false  false  3
...

whereby for the TFRecord it has be accordingly converted to appropriate values (int / float / byte string):

// my_file.tsv
...
b'banana'  2.0  1  0  0  3.0
...

Further, the contents of a line (record) are arguments for a command line process:

# this is all bad on-the-spot example
some-bash-command --fruit=banana --number=2 --ripe=true --for-smoothie=false  --for-ice-cream=false --days-old=3

This function might transform this condensed way of storing data into what is needed as input.

A python interface to this command exists using the subprocess and popen e.g.

def process(command:list, stdin:str, popen_options={}):
    '''
    Arguments:
        command (list): a list of strings indicating the command and its
            arguments to spawn as a subprocess.

        stdin (str): passed as stdin to the subprocess. Assumed to be utf-8
            encoded.

        popen_options (dict): used to configure the subprocess.Popen command

    Returns:
        stdout, stderr
    '''
    command = clean_command(command)
    popen_config = POPEN_DEFAULTS.copy()
    popen_config.update(popen_options)
    try:
        pid = subprocess.Popen(args = command, **popen_config)
        stdout_data, stderr_data = pid.communicate(stdin)
    except OSError as err:
        error_message(command, err)
        sys.exit(1)
    if pid.returncode != 0:
        error_message(command, 'pid code {}'.format(pid.returncode), stdout_data, stderr_data)
    return stdout_data, stderr_data

where

command = [
    'some-bash-command',
    '--fruit=banana',
    '--number=2',
    '--ripe=true',
    # ...
]

Additionally, some-bash-command has a batched parallel some-batched-command whereby if the lines of the tsv are passed in via stdin as a string, a batched output would be provide:

some-batched-command --stringified-input='line\t1\targs\nline\t2\targs\n'

How would one call this process either in the parse_fn used to read in the TFRecords or after the records have been parsed?

# import stuff, etc

def from_record(record):
    # see previous posts on recovering TFRecords
    return as_tf

# command as described above

def parse_fn(record):
    parsed = from_record(record)

    # I want this to work but it doesn't
    values_i_want = process(command, stdin, popen_options):

    return values_i_want, parsed['labels']


sess = tf.InteractiveSession()
DATASET_FILENAMES = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(DATASET_FILENAMES).map(lambda r:parse_fn(r)).repeat().batch(2)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

sess = tf.InteractiveSession()
DATASET_FILENAMES = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(DATASET_FILENAMES).map(lambda r:parse_fn(r))
.apply(
# SOMETHING HERE
)
.repeat().batch(2)

Thoughts?

To summarize:

there is a tsv file whereby each line provides the arguments (as well as some of the input) for a function to produce part of the input for a model.
This function needs to be called at runtime
the function is a bash command with a python interface via subprocess
the returned values of the function are pythonic values (one line for each line of the tsv file)
the returned value of the function needs to be joined with the corresponding line of the tsv file

a  1  2
b  3  4

some-batched-command --input="a\t1\t2\nb\3\4\n"

might return (changes at runtime)

x  y  3
w  v  8

as a string (so "x\ty\t3\nw\tv\t8\n", which then needs to be parsed and convert strings to ints, etc) and then the completed records for the model are:

a  1  2  x  y  3
b  3  4  w  v  8

so the dataset might return a tuple of features, labels like

features = {
  "c1": [a, b],
  "c2": [1, 3],
  "c3": [2, 4],
  "c4": [x, w]
}
labels = {
  "l1": [y, v],
  "l2": [3, 8]
}

from TensorFlow 1.10+: preprocessing TFRecordDataset with subprocess?

Hemant Vishwakarma

Tuesday, 4 June 2019

TensorFlow 1.10+: preprocessing TFRecordDataset with subprocess?

No comments:

Post a Comment