Saturday, 28 July 2018

Why is TensorFlow's `tf.data` package slowing down my code?

I'm just learning to use TensorFlow's tf.data API, and I've found that it is slowing my code down a lot, measured in time per epoch. This is the opposite of what it's supposed to do, I thought. I wrote a simple linear regression program to test it out.

Tl;Dr: With 100,000 training data, tf.data slows time per epoch down by about a factor of ten, if you're using full batch training. Worse if you use smaller batches. The opposite is true with 500 training data.

My question: What is going on? Is my implementation flawed? Other sources I've read have tf.data improving speeds by about 30%.

import tensorflow as tf 
import numpy as np
import timeit

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
tf.logging.set_verbosity(tf.logging.ERROR)

n_epochs = 10
input_dimensions_list = [10]

def function_to_approximate(x):
    return np.dot(x, random_covector).astype(np.float32) + np.float32(.01) * np.random.randn(1,1).astype(np.float32)

def regress_without_tfData(n_epochs, input_dimension, training_inputs, training_labels):
    tf.reset_default_graph()
    weights = tf.get_variable("weights", initializer=np.random.randn(input_dimension, 1).astype(np.float32))

    X = tf.placeholder(tf.float32, shape=(None, input_dimension), name='X')
    Y = tf.placeholder(tf.float32, shape=(None, 1), name='Y')
    prediction = tf.matmul(X,weights)
    loss = tf.reduce_mean(tf.square(tf.subtract(prediction, Y)))
    loss_op = tf.train.AdamOptimizer(.01).minimize(loss)

    init = tf.global_variables_initializer()

    with tf.Session() as sess:
        sess.run(init)
        for _ in range(n_epochs):
            sess.run(loss_op, feed_dict={X: training_inputs, Y:training_labels})

def regress_with_tfData(n_epochs, input_dimension, training_inputs, training_labels, batch_size):
    tf.reset_default_graph()
    weights = tf.get_variable("weights", initializer=np.random.randn(input_dimension, 1).astype(np.float32))

    data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
    data_set = data_set.repeat(n_epochs)
    data_set = data_set.batch(batch_size)
    X,Y = data_set.make_one_shot_iterator().get_next()

    prediction = tf.matmul(X, weights)
    loss = tf.reduce_mean(tf.square(tf.subtract(prediction, Y)))
    loss_op = tf.train.AdamOptimizer(.01).minimize(loss)

    init = tf.global_variables_initializer()

    with tf.Session() as sess:
        sess.run(init)
        while True:
            try: 
                sess.run(loss_op)
            except tf.errors.OutOfRangeError:
                break

for input_dimension in input_dimensions_list:
    for data_size in [500, 100000]:

        training_inputs = np.random.randn(data_size, input_dimension).astype(np.float32)
        random_covector = np.random.randint(-5, 5, size=(input_dimension, 1))
        training_labels = function_to_approximate(training_inputs)

        print("Not using tf.data, with data size "
        "{}, input dimension {} and training with "
        "a full batch, it took an average of "
        "{} seconds to run {} epochs.\n".
            format(
                data_size,
                input_dimension,
                timeit.timeit(
                    lambda: regress_without_tfData(
                        n_epochs, input_dimension, 
                        training_inputs, training_labels
                    ), 
                    number=3
                ),
                n_epochs))

for input_dimension in input_dimensions_list:
    for data_size, batch_size in [(500, 50), (500, 500), (100000, 50), (100000, 100000)]:

        training_inputs = np.random.randn(data_size, input_dimension).astype(np.float32)
        random_covector = np.random.randint(-5, 5, size=(input_dimension, 1))
        training_labels = function_to_approximate(training_inputs)

        print("Using tf.data, with data size "
        "{}, and input dimension {}, and training with "
        "batch size {}, it took an average of {} seconds "
        "to run {} epochs.\n".
            format(
                data_size,
                input_dimension,
                batch_size,
                timeit.timeit(
                    lambda: regress_with_tfData(
                        n_epochs, input_dimension, 
                        training_inputs, training_labels, 
                        batch_size
                    ),
                    number=3
                )/3,
                n_epochs
            ))

This outputs for me:

Not using tf.data, with data size 500, input dimension 10 and training with a full batch, it took an average of 0.2110292259894777 seconds to run 10 epochs.

Not using tf.data, with data size 100000, input dimension 10 and training with a full batch, it took an average of 0.29116983599669766 seconds to run 10 epochs.

Using tf.data, with data size 500, and input dimension 10, and training with batch size 50, it took an average of 0.08823349800271292 seconds to run 10 epochs.

Using tf.data, with data size 500, and input dimension 10, and training with batch size 500, it took an average of 0.07817019966993637 seconds to run 10 epochs.

Using tf.data, with data size 100000, and input dimension 10, and training with batch size 50, it took an average of 4.450352532333151 seconds to run 10 epochs.

Using tf.data, with data size 100000, and input dimension 10, and training with batch size 100000, it took an average of 2.0855311449995497 seconds to run 10 epochs.



from Why is TensorFlow's `tf.data` package slowing down my code?

No comments:

Post a Comment