Note: I now sort of have a mini-series on TFRecords on Stack Overflow with associated colab notebooks. Relevant to this post are:
in addition this colab provides template code for a custom tf.estimator which may be of use for providing an answer.
So suppose you have some deliminator separated file (e.g a TSV) whereby each line has been converted to a record. A sample of such a file might look like this:
// my_file.tsv
...
banana 2 true false false 3
...
whereby for the TFRecord it has be accordingly converted to appropriate values (int / float / byte string):
// my_file.tsv
...
b'banana' 2.0 1 0 0 3.0
...
Further, the contents of a line (record) are arguments for a command line process:
# this is all bad on-the-spot example
some-bash-command --fruit=banana --number=2 --ripe=true --for-smoothie=false --for-ice-cream=false --days-old=3
This function might transform this condensed way of storing data into what is needed as input.
A python interface to this command exists using the subprocess and popen e.g.
def process(command:list, stdin:str, popen_options={}):
'''
Arguments:
command (list): a list of strings indicating the command and its
arguments to spawn as a subprocess.
stdin (str): passed as stdin to the subprocess. Assumed to be utf-8
encoded.
popen_options (dict): used to configure the subprocess.Popen command
Returns:
stdout, stderr
'''
command = clean_command(command)
popen_config = POPEN_DEFAULTS.copy()
popen_config.update(popen_options)
try:
pid = subprocess.Popen(args = command, **popen_config)
stdout_data, stderr_data = pid.communicate(stdin)
except OSError as err:
error_message(command, err)
sys.exit(1)
if pid.returncode != 0:
error_message(command, 'pid code {}'.format(pid.returncode), stdout_data, stderr_data)
return stdout_data, stderr_data
where
command = [
'some-bash-command',
'--fruit=banana',
'--number=2',
'--ripe=true',
# ...
]
Additionally, some-bash-command has a batched parallel some-batched-command whereby if the lines of the tsv are passed in via stdin as a string, a batched output would be provide:
some-batched-command --stringified-input='line\t1\targs\nline\t2\targs\n'
How would one call this process either in the parse_fn used to read in the TFRecords or after the records have been parsed?
# import stuff, etc
def from_record(record):
# see previous posts on recovering TFRecords
return as_tf
# command as described above
def parse_fn(record):
parsed = from_record(record)
# I want this to work but it doesn't
values_i_want = process(command, stdin, popen_options):
return values_i_want, parsed['labels']
sess = tf.InteractiveSession()
DATASET_FILENAMES = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(DATASET_FILENAMES).map(lambda r:parse_fn(r)).repeat().batch(2)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
OR
sess = tf.InteractiveSession()
DATASET_FILENAMES = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(DATASET_FILENAMES).map(lambda r:parse_fn(r))
.apply(
# SOMETHING HERE
)
.repeat().batch(2)
Thoughts?
To summarize:
- there is a tsv file whereby each line provides the arguments (as well as some of the input) for a function to produce part of the input for a model.
- This function needs to be called at runtime
- the function is a bash command with a python interface via subprocess
- the returned values of the function are pythonic values (one line for each line of the tsv file)
- the returned value of the function needs to be joined with the corresponding line of the tsv file
a 1 2
b 3 4
some-batched-command --input="a\t1\t2\nb\3\4\n"
might return (changes at runtime)
x y 3
w v 8
as a string (so "x\ty\t3\nw\tv\t8\n", which then needs to be parsed and convert strings to ints, etc) and then the completed records for the model are:
a 1 2 x y 3
b 3 4 w v 8
so the dataset might return a tuple of features, labels like
features = {
"c1": [a, b],
"c2": [1, 3],
"c3": [2, 4],
"c4": [x, w]
}
labels = {
"l1": [y, v],
"l2": [3, 8]
}
from TensorFlow 1.10+: preprocessing TFRecordDataset with subprocess?
No comments:
Post a Comment