Hemant Vishwakarma: How to correctly configure dataset pipelines with tensorflow.data.experimental.make_csv

Friday 13 November 2020

How to correctly configure dataset pipelines with tensorflow.data.experimental.make_csv_dataset

I have a structured dataset(csv features files) of around 200 GB. I'm using tf.data.experimental.make_csv_dataset to make the input pipelines. Here is my code

def pack_features_vector(features, labels):
    """Pack the features into a single array."""
    features = tf.stack(list(features.values()), axis=1)
    return features, labels
def main():    
    defaults=[float()]*len(selected_columns)
    data_set=tf.data.experimental.make_csv_dataset(
        file_pattern = "./../path-to-dataset/Train_DS/*/*.csv",
        column_names=all_columns,    # all_columns=["col1,col2,..."]
        select_columns=selected_columns,   # selected_columns= a subset of all_columns
        column_defaults=defaults,
        label_name="Target",
        batch_size=1000, 
        num_epochs=20,
        num_parallel_reads=50,
    #    shuffle_buffer_size=10000,
        ignore_errors=True)

    data_set = data_set.map(pack_features_vector)

    N_VALIDATION = int(1e3)
    N_TRAIN= int(1e4)
    BUFFER_SIZE = int(1e4)
    BATCH_SIZE = 1000
    STEPS_PER_EPOCH = N_TRAIN//BATCH_SIZE

    validate_ds = data_set.take(N_VALIDATION).cache().repeat()
    train_ds = data_set.skip(N_VALIDATION).take(N_TRAIN).cache().repeat()

    # validate_ds = validate_ds.batch(BATCH_SIZE)
    # train_ds = train_ds.batch(BATCH_SIZE)

    model = tf.keras.Sequential([
    layers.Flatten(),
    layers.Dense(256, activation='elu'),
    layers.Dense(256, activation='elu'),
    layers.Dense(128, activation='elu'),  
    layers.Dense(64, activation='elu'), 
    layers.Dense(32, activation='elu'), 
    layers.Dense(1,activation='sigmoid') 
    ])
    model.compile(optimizer='adam',
                loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                metrics=['accuracy'])    
    model.fit(train_ds,
            validation_data=validate_ds,
            validation_steps=1,
            steps_per_epoch= 1,
            epochs=20,
            verbose=1
            )
if __name__ == "__main__":
    main()

print('Training completed!')

Now, when I execute this code , it's completed within few minutes (I think not going through the whole training data) with the following warnings:

W tensorflow/core/kernels/data/cache_dataset_ops.cc:798] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to dataset.cache().take(k).repeat(). You should use dataset.take(k).cache().repeat() instead.

As per this warning and as training is completed in few minutes meaning that... input pipeline is not configured correctly... Can anyone please guide me, how to correct this problem. GPU of my system is NVIDIA Quadro RTX 6000 7.5(compute capability)

A solution based on some other function like experimental.CsvDataset would work as well. Thanks!

Edit

That warning gone by changing the code to avoid any cache as

    validate_ds = data_set.take(N_VALIDATION).repeat()
    train_ds = data_set.skip(N_VALIDATION).take(N_TRAIN).repeat()

But now the problem is I'm getting zero accuracy, even on the training data. Which I think is a problem of input pipelines. Here is the output.

from How to correctly configure dataset pipelines with tensorflow.data.experimental.make_csv_dataset

Hemant Vishwakarma

Friday 13 November 2020

How to correctly configure dataset pipelines with tensorflow.data.experimental.make_csv_dataset

No comments:

Post a Comment