I've an issue running a keras model on a Google Cloud Platform instance.
The model is the following one:
n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1]
train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1))
verbose, epochs, batch_size = 1, 1, 64 # low number of epochs just for testing purpose
with tf.device('/cpu:0'):
m = Sequential()
m.add(CuDNNLSTM(20, input_shape=(n_timesteps, n_features)))
m.add(LeakyReLU(alpha=0.1))
m.add(RepeatVector(n_outputs))
m.add(CuDNNLSTM(20, return_sequences=True))
m.add(LeakyReLU(alpha=0.1))
m.add(TimeDistributed(Dense(20)))
m.add(LeakyReLU(alpha=0.1))
m.add(TimeDistributed(Dense(1)))
self.model = multi_gpu_model(m, gpus=8)
self.model.compile(loss='mse', optimizer='adam')
self.model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose)
As you can see from the code above, I run the model on machine with 8 GPUs (Nvidia Tesla K80).
Train works well, without any errors. However, the prediction fails and returns the following error:
W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
Here the code of prediction:
self.model.predict(input_x)
What i've noticed is that if I remove the code for multi-GPU data parallelism, code works well using single GPU.
To be more precise, if i comment this line, code works without error
self.model = multi_gpu_model(m, gpus=8)
What am i missing?
virtualenv information
cudatoolkit - 10.0.130
cudnn - 7.6.4
keras - 2.2.4
keras-applications - 1.0.8
keras-base - 2.2.4
keras-gpu - 2.2.4
python - 3.6
UPDATE
train_x.shape = (1441, 288, 1)
train_y.shape = (1441, 288, 1)
input_x.shape = (1, 288, 1)
After Olivier Dehaene's reply I tried his suggestion and it worked.
So, I tried to modify the input_x shape in order to obtain (8, 288, 1).
In order to do that I've also modified train_x and train_y shapes.
Here a recap:
train_x.shape = (8065, 288, 1)
train_y.shape = (8065, 288, 1)
input_x.shape = (8, 288, 1)
But now i've the same error on the training phase, on this line:
self.model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose)
from Error on prediction running keras multi_gpu_model
No comments:
Post a Comment