Tuesday, 19 September 2023

GPU ran out of memory. How to invoke garbage collector for cleaning the GPU memory at each combination of hyperparameters using GridSearchCV?

I am training my model on a remote server using GridSearchCV API.

Unfortunately while tuning the hyper parameters (after few iterations) I get the following error:

Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0
to /job:localhost/replica:0/task:0/device:GPU:0
in order to run _EagerConst: Dst tensor is not initialized.

It seems that the GPU memory of the server is not enough and this error raises when GPU memory is full.

For this reason I tried to reduce the batch_size to 2, 4, 8 and 16 but the error persists since I get:

W tensorflow/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0_bfc) ran 
out of memory trying to allocate 1.17GiB (rounded to 1258291200) requested 
by op _EagerConst
If the cause is memory fragmentation maybe the environment variable
'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation

Thus I set os.environ['TF_GPU_ALLOCATOR'] = 'cuda_malloc_async' as suggested but the problems persists.

Nevertheless the issue seems to be solved if I reduce the data set size, but I have to use the whole data set.

In order to handle this problem my key ideas are:

  1. Prevent a new model and related objects for loss and training management from being recreated. This would be the optimal solution, as it would always use the same model (obviously ensuring that it is "reset" with each new combination of hyperparameters), with relative loss and training. This solution is perhaps the most complicated, as I don't know if the libraries I chose to use allow it
  2. Verify that the same problem is not caused by the data instead of the model (i.e. I would not want the same data to be reallocated for each combination of hyperparameters, leaving the old ones in memory). This could also be a cause and the solution to which I believe is simpler than the previous or similar one, but I see it as less probable as a cause. In any case, check that this does not happen
  3. Reset the memory at each combination of hyperparameters by invoking the garbage collector (I don't know if it works on the GPU too). This is the easiest solution and perhaps the first thing I would try, but it doesn't necessarily work, because if the libraries it uses maintain references to the objects in memory (even if they are no longer used) these are not eliminated by the garbage collector.

Also, with the tensorflow backend the current model is not destroyed, so I need to clear the session.

These are the involved functions:

def grid_search_vae(x_train, latent_dimension):
    param_grid = {
        'epochs': [2500],
        'l_rate': [10 ** -4, 10 ** -5, 10 ** -6, 10 ** -7],
        'batch_size': [32, 64],  # [2, 4, 8, 16] won't fix the issue
        'patience': [30]
    }

    ssim_scorer = make_scorer(my_ssim, greater_is_better=True)

    grid = GridSearchCV(
        VAEWrapper(encoder=Encoder(latent_dimension), decoder=Decoder()),
        param_grid, scoring=ssim_scorer, cv=5, refit=False
    )

    grid.fit(x_train, x_train)
    return grid

def refit(fitted_grid, x_train, y_train, latent_dimension):    
    best_epochs = fitted_grid.best_params_["epochs"]
    best_l_rate = fitted_grid.best_params_["l_rate"]
    best_batch_size = fitted_grid.best_params_["batch_size"]
    best_patience = fitted_grid.best_params_["patience"]

    x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2)

    encoder = Encoder(latent_dimension)
    decoder = Decoder()
    vae = VAE(encoder, decoder, best_epochs, best_l_rate, best_batch_size)
    vae.compile(Adam(best_l_rate))

    early_stopping = EarlyStopping("val_loss", patience=best_patience)
    history = vae.fit(x_train, x_train, best_batch_size, best_epochs,
                      validation_data=(x_val, x_val), callbacks=[early_stopping])
    return history, vae

While this is the main code:

if __name__ == '__main__':
    x_train, x_test, y_train, y_test = load_data("data", "labels")

    # Reducing data set size will fix the issue 
    # new_size = 200
    # x_train, y_train = reduce_size(x_train, y_train, new_size)
    # x_test, y_test = reduce_size(x_test, y_test, new_size)

    latent_dimension = 25 
    grid = grid_search_vae(x_train, latent_dimension)
    history, vae = refit(grid, x_train, y_train, latent_dimension)

Can you help me?

If you need it, these are the GPUs:

2023-09-18 11:21:25.628286: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 7347 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1080, pci bus id: 0000:02:00.0, compute capability: 6.1
2023-09-18 11:21:25.629120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 7371 MB memory:  -> device: 1, name: NVIDIA GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1
2023-09-18 11:21:31.911969: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8600

and I am using tensorflow as keras backend, that is

from keras import backend as K
K.backend()  # 'tensorflow'

I also tried to add:

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

in the main code (as first instructions) but this didn't help.

If you need the code for models, here it is:

import numpy as np
import tensorflow as tf
from keras.initializers import he_uniform
from keras.layers import Conv2DTranspose, BatchNormalization, Reshape, Dense, Conv2D, Flatten
from keras.optimizers.legacy import Adam
from keras.src.callbacks import EarlyStopping
from skimage.metrics import structural_similarity as ssim
from sklearn.base import BaseEstimator
from sklearn.metrics import mean_absolute_error, make_scorer
from sklearn.model_selection import train_test_split, GridSearchCV
from tensorflow import keras

class VAEWrapper:
    def __init__(self, **kwargs):
        self.vae = VAE(**kwargs)
        self.vae.compile(Adam())

    def fit(self, x, y, **kwargs):
        self.vae.fit(x, y, **kwargs)

    def get_config(self):
        return self.vae.get_config()

    def get_params(self, deep):
        return self.vae.get_params(deep)

    def set_params(self, **params):
        return self.vae.set_params(**params)


class VAE(keras.Model, BaseEstimator):
    def __init__(self, encoder, decoder, epochs=None, l_rate=None, batch_size=None, patience=None, **kwargs):
        super().__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder
        self.epochs = epochs  # For grid search
        self.l_rate = l_rate  # For grid search
        self.batch_size = batch_size  # For grid search
        self.patience = patience  # For grid search
        self.total_loss_tracker = keras.metrics.Mean(name="total_loss")
        self.reconstruction_loss_tracker = keras.metrics.Mean(name="reconstruction_loss")
        self.kl_loss_tracker = keras.metrics.Mean(name="kl_loss")

    def call(self, inputs, training=None, mask=None):
        _, _, z = self.encoder(inputs)
        outputs = self.decoder(z)
        return outputs

    @property
    def metrics(self):
        return [
            self.total_loss_tracker,
            self.reconstruction_loss_tracker,
            self.kl_loss_tracker,
        ]

    def train_step(self, data):
        data, labels = data
        with tf.GradientTape() as tape:
            # Forward pass
            z_mean, z_log_var, z = self.encoder(data)
            reconstruction = self.decoder(z)

            # Compute losses
            reconstruction_loss = tf.reduce_mean(
                tf.reduce_sum(
                    keras.losses.binary_crossentropy(data, reconstruction), axis=(1, 2)
                )
            )
            kl_loss = -0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
            kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis=1))
            total_loss = reconstruction_loss + kl_loss

        # Compute gradient
        grads = tape.gradient(total_loss, self.trainable_weights)

        # Update weights
        self.optimizer.apply_gradients(zip(grads, self.trainable_weights))

        # Update metrics
        self.total_loss_tracker.update_state(total_loss)
        self.reconstruction_loss_tracker.update_state(reconstruction_loss)
        self.kl_loss_tracker.update_state(kl_loss)

        return {
            "loss": self.total_loss_tracker.result(),
            "reconstruction_loss": self.reconstruction_loss_tracker.result(),
            "kl_loss": self.kl_loss_tracker.result(),
        }

    def test_step(self, data):
        data, labels = data
        # Forward pass
        z_mean, z_log_var, z = self.encoder(data)
        reconstruction = self.decoder(z)

        # Compute losses
        reconstruction_loss = tf.reduce_mean(
            tf.reduce_sum(
                keras.losses.binary_crossentropy(data, reconstruction), axis=(1, 2)
            )
        )
        kl_loss = -0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
        kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis=1))
        total_loss = reconstruction_loss + kl_loss

        # Update metrics
        self.total_loss_tracker.update_state(total_loss)
        self.reconstruction_loss_tracker.update_state(reconstruction_loss)
        self.kl_loss_tracker.update_state(kl_loss)

        return {
            "loss": self.total_loss_tracker.result(),
            "reconstruction_loss": self.reconstruction_loss_tracker.result(),
            "kl_loss": self.kl_loss_tracker.result(),
        }


@keras.saving.register_keras_serializable()
class Encoder(keras.layers.Layer):
    def __init__(self, latent_dimension):
        super(Encoder, self).__init__()
        self.latent_dim = latent_dimension

        seed = 42

        self.conv1 = Conv2D(filters=64, kernel_size=3, activation="relu", strides=2, padding="same",
                            kernel_initializer=he_uniform(seed))
        self.bn1 = BatchNormalization()

        self.conv2 = Conv2D(filters=128, kernel_size=3, activation="relu", strides=2, padding="same",
                            kernel_initializer=he_uniform(seed))
        self.bn2 = BatchNormalization()

        self.conv3 = Conv2D(filters=256, kernel_size=3, activation="relu", strides=2, padding="same",
                            kernel_initializer=he_uniform(seed))
        self.bn3 = BatchNormalization()

        self.flatten = Flatten()
        self.dense = Dense(units=100, activation="relu")

        self.z_mean = Dense(latent_dimension, name="z_mean")
        self.z_log_var = Dense(latent_dimension, name="z_log_var")

        self.sampling = sample

    def call(self, inputs, training=None, mask=None):
        x = self.conv1(inputs)
        x = self.bn1(x)
        x = self.conv2(x)
        x = self.bn2(x)
        x = self.conv3(x)
        x = self.bn3(x)
        x = self.flatten(x)
        x = self.dense(x)
        z_mean = self.z_mean(x)
        z_log_var = self.z_log_var(x)
        z = self.sampling(z_mean, z_log_var)
        return z_mean, z_log_var, z


@keras.saving.register_keras_serializable()
class Decoder(keras.layers.Layer):
    def __init__(self):
        super(Decoder, self).__init__()
        self.dense1 = Dense(units=4096, activation="relu")
        self.bn1 = BatchNormalization()

        self.dense2 = Dense(units=1024, activation="relu")
        self.bn2 = BatchNormalization()

        self.dense3 = Dense(units=4096, activation="relu")
        self.bn3 = BatchNormalization()

        seed = 42

        self.reshape = Reshape((4, 4, 256))
        self.deconv1 = Conv2DTranspose(filters=256, kernel_size=3, activation="relu", strides=2, padding="same",
                                       kernel_initializer=he_uniform(seed))
        self.bn4 = BatchNormalization()

        self.deconv2 = Conv2DTranspose(filters=128, kernel_size=3, activation="relu", strides=1, padding="same",
                                       kernel_initializer=he_uniform(seed))
        self.bn5 = BatchNormalization()

        self.deconv3 = Conv2DTranspose(filters=128, kernel_size=3, activation="relu", strides=2, padding="valid",
                                       kernel_initializer=he_uniform(seed))
        self.bn6 = BatchNormalization()

        self.deconv4 = Conv2DTranspose(filters=64, kernel_size=3, activation="relu", strides=1, padding="valid",
                                       kernel_initializer=he_uniform(seed))
        self.bn7 = BatchNormalization()

        self.deconv5 = Conv2DTranspose(filters=64, kernel_size=3, activation="relu", strides=2, padding="valid",
                                       kernel_initializer=he_uniform(seed))
        self.bn8 = BatchNormalization()

        self.deconv6 = Conv2DTranspose(filters=1, kernel_size=2, activation="sigmoid", padding="valid",
                                       kernel_initializer=he_uniform(seed))

    def call(self, inputs, training=None, mask=None):
        x = self.dense1(inputs)
        x = self.bn1(x)
        x = self.dense2(x)
        x = self.bn2(x)
        x = self.dense3(x)
        x = self.bn3(x)
        x = self.reshape(x)
        x = self.deconv1(x)
        x = self.bn4(x)
        x = self.deconv2(x)
        x = self.bn5(x)
        x = self.deconv3(x)
        x = self.bn6(x)
        x = self.deconv4(x)
        x = self.bn7(x)
        x = self.deconv5(x)
        x = self.bn8(x)
        decoder_outputs = self.deconv6(x)
        return decoder_outputs


from GPU ran out of memory. How to invoke garbage collector for cleaning the GPU memory at each combination of hyperparameters using GridSearchCV?

No comments:

Post a Comment