Monday 15 March 2021

Tensorflow / Keras : How does the embedding layer work with GPUs

Here is a fully reproducible code:

#tensorflow-gpu version 2.3.1
import numpy as np
import tensorflow as tf
import tensorflow.keras.backend as K

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import *
from tensorflow.keras.models import Sequential, load_model, Model 

embed_dim = 64; maxlen = 152; vocab_size = 8
K.clear_session()
X_tk = np.random.randint(1, vocab_size, (10, 152))
X_mask_tk = np.random.randint(1, vocab_size + 1, (10, 152)) #The +1 is for the mask token
l = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
# print(l(X_tk)) #this works
# print(l(X_mask_tk)) #this doesn't work

model = keras.Sequential([layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)])
# print(model(X_tk)) #this works
# print(model(X_mask_tk)) #this doesn't works

model2 = keras.Sequential([layers.Input(shape=(maxlen,)), layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)])
model2.compile(optimizer='Adam', loss='sparse_categorical_crossentropy')
# model2.fit(X_mask_tk, X_tk) #This doesn't work

strategy = tf.distribute.MirroredStrategy()
with strategy.scope(): #im using 2 gpus but I reckon this issue would occur even with 1 gpu

    model3 = keras.Sequential([layers.Input(shape=(maxlen,)), layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)])
    model3.compile(optimizer='Adam', loss='sparse_categorical_crossentropy')
model3.fit(X_mask_tk, X_tk) #This works

The comments indicate which lines work and which lines don't work.

Essentially, the problem is the following. When creating an embedding layer, one has to specify the vocabulary size. If you feed a tensor containing a value that is equal to or greater than the vocabulary size, the embedding layer will throw an error. This as expected based on the function defintion.

However, if I initialize the embedding layer using strategy.scope(), i.e. inside a GPU environment, then suddenly the error is no longer thrown. The embedding layer is still able to process tensors with vocabulary far greater than the specified vocabulary size.

Questions:

  1. What is causing this inconsistent embedding layer functionality and how do I fix it?
  2. What other layers can be affected by this?
  3. What exactly happens with distributed strategy? I read the documentation, but something else seems to be going on with how the values are treated.

I looked around in github issues and other SO posts but couldn't find a similar problem. Hopefully I didn't miss it.



from Tensorflow / Keras : How does the embedding layer work with GPUs

No comments:

Post a Comment