Hemant Vishwakarma: Deep Learning: Choice of architecture in the IMDB Classification toy example. Embeddings underperform simple baseline

I am comparing the performance of two different network architectures to solve the binary classification problem of IMBD movie reviews, presented in Chapter 3 of "Deep Learning With Python".

Loading the data:

# num_words means only use most common `n` words in the vocab
(train_data,train_labels),(test_data,test_labels) = imdb.load_data(num_words=10_000)
train_xs = torch.vstack([multi_vectorize(t) for t in train_data]).to(torch.float)
train_ys = torch.tensor(train_labels).unsqueeze(dim=1).to(torch.float) # 250k -> (250k,1) for compatability w/ train

## train/val split
val_xs = train_xs[0:10_000]
val_ys = train_ys[0:10_000]
partial_train_xs = train_xs[10_000:]
partial_train_ys = train_ys[10_000:]

The architecture the book uses is a simple sequential dense network:

Sequential(
  (0): Linear(in_features=10000, out_features=16, bias=True)
  (1): ReLU()
  (2): Linear(in_features=16, out_features=16, bias=True)
  (3): ReLU()
  (4): Linear(in_features=16, out_features=1, bias=True)
  (5): Sigmoid()
)

The inputs to this network are "multihot" encoded text snippets. For example, given a review with 80 words and an assumed total vocabulary of 10k words (the domain), each input would be a vector of 10k elements with either a 0 if the index position in the vector corresponding to the letter in the vocab is not present, and a 1 if it is present:

VOCAB_SZ = max(max(s) for s in train_data) + 1
def multi_vectorize(seq):
    t = torch.zeros(VOCAB_SZ)
    for s in seq:
        t[s] = 1
    return t

This makes sense but it also would ignore duplicate words, since the input could only represent them as either being present or not.

I know that embeddings are typically used as the first layer in NLP tasks, so I attempted to run a basic experiment including an embedding layer, assuming it would improve the performance:

Adding an embedding layer:

I made a handful of modifications to create the embedding layer. First, we don't need every vector to be 10K elements now, since we aren't multihot encoding every input. Instead we make sure each input vector is the same size, which we accomplish by finding the largest input size in the training set and padding every input to be at least that large with a given "pad token". Then the embedding layer is (10001,3). The first dimension is 10001 reflecting the vocab size of 10000 plus one for the newly added pad token. The second dimension is arbitrary and represents the dimensionality of each embedded token.

## make all token lists the same size.
MAX_INPUT_LEN = len(max(train_data,key=len))

## zero is already in the vocab, which starts tokens at zero
## since the max token is VOCAB_SZ - 1 (zero based) we can 
## use VOCAB_SZ as the start point for new special tokens
PAD_TOKEN = VOCAB_SZ

def lpad(x,maxlen=MAX_INPUT_LEN,pad_token=PAD_TOKEN):
    padlen = maxlen - len(x)
    if padlen > 0:
        return [pad_token] * padlen + x
    return x

EMB_SZ = 3
NUM_EMBED = VOCAB_SZ + 1 # special pad char (10,000)
emb_model = nn.Sequential(
          nn.Embedding(NUM_EMBED,EMB_SZ),
          nn.Flatten(),
          nn.Linear(EMB_SZ * MAX_INPUT_LEN,16),
          nn.ReLU(),
          nn.Linear(16,16),
          nn.ReLU(),
          nn.Linear(16,1),
          nn.Sigmoid()
        )

Results / Questions:

This network takes an order of magnitude longer to train, and also performs worse. I am unsure why it is so much slower to train (10 epochs in the first version take about 3 seconds, versus 30 seconds for this version). There is an extra layer of indirection due to the embeddings, but the total parameter count of this model is actually less than the first version since we aren't using a 10K sized vector for every input. So that's point of confusion #1, why would it be so much slower to train?

Second is the performance. I would have thought adding an embedding layer would allow the model more dimensionality to express the sentiment of the reviews. At the very least it would avoid "ignoring" tokens that repeat in the input the way the multihot version does. I experimented with smaller and larger embedding layers, but I cannot seem to get above ~82% validation accuracy and it takes ~80 epochs to get there. The first version gets to 90% validation accuracy after only 10 epochs. How should I think about why the performance is worse for the embedding version?

from Deep Learning: Choice of architecture in the IMDB Classification toy example. Embeddings underperform simple baseline

Hemant Vishwakarma

Thursday, 2 November 2023

Deep Learning: Choice of architecture in the IMDB Classification toy example. Embeddings underperform simple baseline

No comments:

Post a Comment