Wednesday, 20 November 2019

Error on prediction running keras multi_gpu_model

I've an issue running a keras model on a Google Cloud Platform instance.
The model is the following one:

n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1]

train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1))

verbose, epochs, batch_size = 1, 1, 64  # low number of epochs just for testing purpose
with tf.device('/cpu:0'):
    m = Sequential()
    m.add(CuDNNLSTM(20, input_shape=(n_timesteps, n_features)))
    m.add(LeakyReLU(alpha=0.1))
    m.add(RepeatVector(n_outputs))
    m.add(CuDNNLSTM(20, return_sequences=True))
    m.add(LeakyReLU(alpha=0.1))
    m.add(TimeDistributed(Dense(20)))
    m.add(LeakyReLU(alpha=0.1))
    m.add(TimeDistributed(Dense(1)))

self.model = multi_gpu_model(m, gpus=8)
self.model.compile(loss='mse', optimizer='adam')

self.model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose)

As you can see from the code above, I run the model on machine with 8 GPUs (Nvidia Tesla K80).
Train works well, without any errors. However, the prediction fails and returns the following error:

W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'

Here the code of prediction:

self.model.predict(input_x)

What i've noticed is that if I remove the code for multi-GPU data parallelism, code works well using single GPU.
To be more precise, if i comment this line, code works without error

self.model = multi_gpu_model(m, gpus=8)

What am i missing?

virtualenv information

cudatoolkit - 10.0.130
cudnn - 7.6.4
keras - 2.2.4
keras-applications - 1.0.8
keras-base - 2.2.4
keras-gpu - 2.2.4
python - 3.6

UPDATE

train_x.shape = (1441, 288, 1)
train_y.shape = (1441, 288, 1)
input_x.shape = (1, 288, 1)

After Olivier Dehaene's reply I tried his suggestion and it worked.
So, I tried to modify the input_x shape in order to obtain (8, 288, 1).
In order to do that I've also modified train_x and train_y shapes.
Here a recap:

train_x.shape = (8065, 288, 1)
train_y.shape = (8065, 288, 1)
input_x.shape = (8, 288, 1)

But now i've the same error on the training phase, on this line:

self.model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose)


from Error on prediction running keras multi_gpu_model

No comments:

Post a Comment