I am training my model on a remote server using GridSearchCV
API.
Unfortunately while tuning the hyper parameters (after few iterations) I get the following error:
Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0
to /job:localhost/replica:0/task:0/device:GPU:0
in order to run _EagerConst: Dst tensor is not initialized.
It seems that the GPU memory of the server is not enough and this error raises when GPU memory is full.
For this reason I tried to reduce the batch_size
to 2
, 4
, 8
and 16
but the error persists since I get:
W tensorflow/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0_bfc) ran
out of memory trying to allocate 1.17GiB (rounded to 1258291200) requested
by op _EagerConst
If the cause is memory fragmentation maybe the environment variable
'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation
Thus I set os.environ['TF_GPU_ALLOCATOR'] = 'cuda_malloc_async'
as suggested but the problems persists.
Nevertheless the issue seems to be solved if I reduce the data set size, but I have to use the whole data set.
In order to handle this problem my key ideas are:
- Prevent a new model and related objects for loss and training management from being recreated. This would be the optimal solution, as it would always use the same model (obviously ensuring that it is "reset" with each new combination of hyperparameters), with relative loss and training. This solution is perhaps the most complicated, as I don't know if the libraries I chose to use allow it
- Verify that the same problem is not caused by the data instead of the model (i.e. I would not want the same data to be reallocated for each combination of hyperparameters, leaving the old ones in memory). This could also be a cause and the solution to which I believe is simpler than the previous or similar one, but I see it as less probable as a cause. In any case, check that this does not happen
- Reset the memory at each combination of hyperparameters by invoking the garbage collector (I don't know if it works on the GPU too). This is the easiest solution and perhaps the first thing I would try, but it doesn't necessarily work, because if the libraries it uses maintain references to the objects in memory (even if they are no longer used) these are not eliminated by the garbage collector.
Also, with the tensorflow backend the current model is not destroyed, so I need to clear the session.
These are the involved functions:
def grid_search_vae(x_train, latent_dimension):
param_grid = {
'epochs': [2500],
'l_rate': [10 ** -4, 10 ** -5, 10 ** -6, 10 ** -7],
'batch_size': [32, 64], # [2, 4, 8, 16] won't fix the issue
'patience': [30]
}
ssim_scorer = make_scorer(my_ssim, greater_is_better=True)
grid = GridSearchCV(
VAEWrapper(encoder=Encoder(latent_dimension), decoder=Decoder()),
param_grid, scoring=ssim_scorer, cv=5, refit=False
)
grid.fit(x_train, x_train)
return grid
def refit(fitted_grid, x_train, y_train, latent_dimension):
best_epochs = fitted_grid.best_params_["epochs"]
best_l_rate = fitted_grid.best_params_["l_rate"]
best_batch_size = fitted_grid.best_params_["batch_size"]
best_patience = fitted_grid.best_params_["patience"]
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2)
encoder = Encoder(latent_dimension)
decoder = Decoder()
vae = VAE(encoder, decoder, best_epochs, best_l_rate, best_batch_size)
vae.compile(Adam(best_l_rate))
early_stopping = EarlyStopping("val_loss", patience=best_patience)
history = vae.fit(x_train, x_train, best_batch_size, best_epochs,
validation_data=(x_val, x_val), callbacks=[early_stopping])
return history, vae
While this is the main
code:
if __name__ == '__main__':
x_train, x_test, y_train, y_test = load_data("data", "labels")
# Reducing data set size will fix the issue
# new_size = 200
# x_train, y_train = reduce_size(x_train, y_train, new_size)
# x_test, y_test = reduce_size(x_test, y_test, new_size)
latent_dimension = 25
grid = grid_search_vae(x_train, latent_dimension)
history, vae = refit(grid, x_train, y_train, latent_dimension)
Can you help me?
If you need it, these are the GPUs:
2023-09-18 11:21:25.628286: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 7347 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1080, pci bus id: 0000:02:00.0, compute capability: 6.1
2023-09-18 11:21:25.629120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 7371 MB memory: -> device: 1, name: NVIDIA GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1
2023-09-18 11:21:31.911969: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8600
and I am using tensorflow as keras backend, that is
from keras import backend as K
K.backend() # 'tensorflow'
I also tried to add:
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
in the main
code (as first instructions) but this didn't help.
If you need the code for models, here it is:
import numpy as np
import tensorflow as tf
from keras.initializers import he_uniform
from keras.layers import Conv2DTranspose, BatchNormalization, Reshape, Dense, Conv2D, Flatten
from keras.optimizers.legacy import Adam
from keras.src.callbacks import EarlyStopping
from skimage.metrics import structural_similarity as ssim
from sklearn.base import BaseEstimator
from sklearn.metrics import mean_absolute_error, make_scorer
from sklearn.model_selection import train_test_split, GridSearchCV
from tensorflow import keras
class VAEWrapper:
def __init__(self, **kwargs):
self.vae = VAE(**kwargs)
self.vae.compile(Adam())
def fit(self, x, y, **kwargs):
self.vae.fit(x, y, **kwargs)
def get_config(self):
return self.vae.get_config()
def get_params(self, deep):
return self.vae.get_params(deep)
def set_params(self, **params):
return self.vae.set_params(**params)
class VAE(keras.Model, BaseEstimator):
def __init__(self, encoder, decoder, epochs=None, l_rate=None, batch_size=None, patience=None, **kwargs):
super().__init__(**kwargs)
self.encoder = encoder
self.decoder = decoder
self.epochs = epochs # For grid search
self.l_rate = l_rate # For grid search
self.batch_size = batch_size # For grid search
self.patience = patience # For grid search
self.total_loss_tracker = keras.metrics.Mean(name="total_loss")
self.reconstruction_loss_tracker = keras.metrics.Mean(name="reconstruction_loss")
self.kl_loss_tracker = keras.metrics.Mean(name="kl_loss")
def call(self, inputs, training=None, mask=None):
_, _, z = self.encoder(inputs)
outputs = self.decoder(z)
return outputs
@property
def metrics(self):
return [
self.total_loss_tracker,
self.reconstruction_loss_tracker,
self.kl_loss_tracker,
]
def train_step(self, data):
data, labels = data
with tf.GradientTape() as tape:
# Forward pass
z_mean, z_log_var, z = self.encoder(data)
reconstruction = self.decoder(z)
# Compute losses
reconstruction_loss = tf.reduce_mean(
tf.reduce_sum(
keras.losses.binary_crossentropy(data, reconstruction), axis=(1, 2)
)
)
kl_loss = -0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis=1))
total_loss = reconstruction_loss + kl_loss
# Compute gradient
grads = tape.gradient(total_loss, self.trainable_weights)
# Update weights
self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
# Update metrics
self.total_loss_tracker.update_state(total_loss)
self.reconstruction_loss_tracker.update_state(reconstruction_loss)
self.kl_loss_tracker.update_state(kl_loss)
return {
"loss": self.total_loss_tracker.result(),
"reconstruction_loss": self.reconstruction_loss_tracker.result(),
"kl_loss": self.kl_loss_tracker.result(),
}
def test_step(self, data):
data, labels = data
# Forward pass
z_mean, z_log_var, z = self.encoder(data)
reconstruction = self.decoder(z)
# Compute losses
reconstruction_loss = tf.reduce_mean(
tf.reduce_sum(
keras.losses.binary_crossentropy(data, reconstruction), axis=(1, 2)
)
)
kl_loss = -0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis=1))
total_loss = reconstruction_loss + kl_loss
# Update metrics
self.total_loss_tracker.update_state(total_loss)
self.reconstruction_loss_tracker.update_state(reconstruction_loss)
self.kl_loss_tracker.update_state(kl_loss)
return {
"loss": self.total_loss_tracker.result(),
"reconstruction_loss": self.reconstruction_loss_tracker.result(),
"kl_loss": self.kl_loss_tracker.result(),
}
@keras.saving.register_keras_serializable()
class Encoder(keras.layers.Layer):
def __init__(self, latent_dimension):
super(Encoder, self).__init__()
self.latent_dim = latent_dimension
seed = 42
self.conv1 = Conv2D(filters=64, kernel_size=3, activation="relu", strides=2, padding="same",
kernel_initializer=he_uniform(seed))
self.bn1 = BatchNormalization()
self.conv2 = Conv2D(filters=128, kernel_size=3, activation="relu", strides=2, padding="same",
kernel_initializer=he_uniform(seed))
self.bn2 = BatchNormalization()
self.conv3 = Conv2D(filters=256, kernel_size=3, activation="relu", strides=2, padding="same",
kernel_initializer=he_uniform(seed))
self.bn3 = BatchNormalization()
self.flatten = Flatten()
self.dense = Dense(units=100, activation="relu")
self.z_mean = Dense(latent_dimension, name="z_mean")
self.z_log_var = Dense(latent_dimension, name="z_log_var")
self.sampling = sample
def call(self, inputs, training=None, mask=None):
x = self.conv1(inputs)
x = self.bn1(x)
x = self.conv2(x)
x = self.bn2(x)
x = self.conv3(x)
x = self.bn3(x)
x = self.flatten(x)
x = self.dense(x)
z_mean = self.z_mean(x)
z_log_var = self.z_log_var(x)
z = self.sampling(z_mean, z_log_var)
return z_mean, z_log_var, z
@keras.saving.register_keras_serializable()
class Decoder(keras.layers.Layer):
def __init__(self):
super(Decoder, self).__init__()
self.dense1 = Dense(units=4096, activation="relu")
self.bn1 = BatchNormalization()
self.dense2 = Dense(units=1024, activation="relu")
self.bn2 = BatchNormalization()
self.dense3 = Dense(units=4096, activation="relu")
self.bn3 = BatchNormalization()
seed = 42
self.reshape = Reshape((4, 4, 256))
self.deconv1 = Conv2DTranspose(filters=256, kernel_size=3, activation="relu", strides=2, padding="same",
kernel_initializer=he_uniform(seed))
self.bn4 = BatchNormalization()
self.deconv2 = Conv2DTranspose(filters=128, kernel_size=3, activation="relu", strides=1, padding="same",
kernel_initializer=he_uniform(seed))
self.bn5 = BatchNormalization()
self.deconv3 = Conv2DTranspose(filters=128, kernel_size=3, activation="relu", strides=2, padding="valid",
kernel_initializer=he_uniform(seed))
self.bn6 = BatchNormalization()
self.deconv4 = Conv2DTranspose(filters=64, kernel_size=3, activation="relu", strides=1, padding="valid",
kernel_initializer=he_uniform(seed))
self.bn7 = BatchNormalization()
self.deconv5 = Conv2DTranspose(filters=64, kernel_size=3, activation="relu", strides=2, padding="valid",
kernel_initializer=he_uniform(seed))
self.bn8 = BatchNormalization()
self.deconv6 = Conv2DTranspose(filters=1, kernel_size=2, activation="sigmoid", padding="valid",
kernel_initializer=he_uniform(seed))
def call(self, inputs, training=None, mask=None):
x = self.dense1(inputs)
x = self.bn1(x)
x = self.dense2(x)
x = self.bn2(x)
x = self.dense3(x)
x = self.bn3(x)
x = self.reshape(x)
x = self.deconv1(x)
x = self.bn4(x)
x = self.deconv2(x)
x = self.bn5(x)
x = self.deconv3(x)
x = self.bn6(x)
x = self.deconv4(x)
x = self.bn7(x)
x = self.deconv5(x)
x = self.bn8(x)
decoder_outputs = self.deconv6(x)
return decoder_outputs
from GPU ran out of memory. How to invoke garbage collector for cleaning the GPU memory at each combination of hyperparameters using GridSearchCV?
No comments:
Post a Comment