Friday, 4 September 2020

Batching in tf.data.dataset in time-series analysis

I'm looking at creating a pipeline for a time-series LSTM model. I have two feeds of inputs, lets call them series1 and series2.

I initialize the tf.data object by calling from.tensor.slices:

ds = tf.data.Dataset.from_tensor_slices((series1, series2))

I batch them further into windows of a set windows size and shift 1 between windows:

ds = ds.window(window_size + 1, shift=1, drop_remainder=True)

At this point I want to play around with how they are batched together. I want to produce a certain input like the following as an example:

series1 = [1, 2, 3, 4, 5]
series2 = [100, 200, 300, 400, 500]

batch 1: [1, 2, 100, 200]
batch 2: [2, 3, 200, 300]
batch 3: [3, 4, 300, 400]

So each batch will return two elements of series1 and then two elements of series2. This code snippet does not work to batch them separately:

ds = ds.map(lambda s1, s2: (s1.batch(window_size + 1), s2.batch(window_size + 1))

Because it returns two mapping of dataset objects. Since they are objects they are not subscriptible, so this does not work either:

ds = ds.map(lambda s1, s2: (s1[:2], s2[:2]))

I'm sure the solution is some utilization of .apply with a custom lambda function. Any help is much appreciated.

Edit

I am also looking at producing a label that represents the next element of the series. So for example, the batches will produce the following:

batch 1: (tf.tensor([1, 2, 100, 200]), tf.tensor([3]))
batch 2: (tf.tensor([2, 3, 200, 300]), tf.tensor([4]))
batch 3: (tf.tensor([3, 4, 300, 400]), tf.tensor([5]))

Where [3], [4] and [5] represent the next elements of series1 to be predicted.



from Batching in tf.data.dataset in time-series analysis

No comments:

Post a Comment