Saturday, 25 March 2023

Keras: time per step increases with a filter on the number of samples, epoch time continues the same

I'm implementing a simple sanity check model on Keras for some data I have. My training dataset is comprised of about 550 files, and each contributes to about 150 samples. Each training sample has the following signature:

({'input_a': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None),
  'input_b': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None)},
   TensorSpec(shape=(None, 1), dtype=tf.int64, name=None)
)

Essentially, each training sample is made up of two inputs with shape (900, 1), and the target is a single (binary) label. The first step of my model is a concatenation of inputs into a (900, 2) Tensor.

The total number of training samples is about 70000.

As input to the model, I'm creating a tf.data.Dataset, and applying a few preparation steps:

  1. tf.Dataset.filter: to filter some samples with invalid labels
  2. tf.Dataset.shuffle
  3. tf.Dataset.filter: to undersample my training dataset
  4. tf.Dataset.batch

Step 3 is the most important in my question. To undersample my dataset I apply a simple function:

def undersampling(dataset: tf.data.Dataset, drop_proba: Iterable[float]) -> tf.data.Dataset:
    def undersample_function(x, y):

        drop_prob_ = tf.constant(drop_proba)

        idx = y[0]

        p = drop_prob_ [idx]
        v = tf.random.uniform(shape=(), dtype=tf.float32)

        return tf.math.greater_equal(v, p)

    return dataset.filter(undersample_function)

Essentially, the function accepts a a vector of probabilities drop_prob such that drop_prob[l] is the probability of dropping a sample with label l (the function is a bit convoluted, but it's the way I found to implement it as Dataset.filter). Using equal probabilities, say drop_prob=[0.9, 0.9], I`ll be dropping about 90% of my samples.

Now, the thing is, I've been experimenting with different undersamplings for my dataset, in order to find a sweet spot between performance and training time, but when I undersample, the epoch duration is the same, with time/step increasing instead.

Keeping my batch_size fixed at 20000, for the complete dataset I have a total of 4 batches, and the following time for an average epoch:

Epoch 4/1000
1/4 [======>.......................] - ETA: 9s
2/4 [==============>...............] - ETA: 5s
3/4 [=====================>........] - ETA: 2s
4/4 [==============================] - ETA: 0s
4/4 [==============================] - 21s 6s/step

While if I undersample my dataset with a drop_prob = [0.9, 0.9] (That is, Im getting rid of about 90% of the dataset), and keeping the same batch_size` of 20000, I have 1 batch, and the following time for an average epoch:

Epoch 4/1000
1/1 [==============================] - ETA: 0s
1/1 [==============================] - 22s 22s/step 

Notice that while the number of batches is only 1, the epoch time is the same! It just takes longer to process the batch.

Now, as a sanity check, I tried a different way of undersampling, by filtering the files instead. So I selected about 55 of the training files (10%), to have a similar number of samples in a single batch, and removed the undersampling from the tf.Dataset. The epoch time decreates as expected:

Epoch 4/1000
1/1 [==============================] - ETA: 0s
1/1 [==============================] - 2s 2s/step 

Note that the original dataset has 70014 training samples, while the undersampled dataset by means of tf.Dataset.filter had 6995 samples and the undersampled dataset by means of file filtering had 7018 samples, thus the numbers are consistent.

Much faster. In fact, it takes about 10% of the time as the epoch takes with the full dataset. So there is an issue with the way I'm performing undersampling (by using tf.data.Dataset.filter) when creating the tf.Dataset, I would like to ask for help to figure it out what is the issue. Thanks.



from Keras: time per step increases with a filter on the number of samples, epoch time continues the same

No comments:

Post a Comment