Hemant Vishwakarma: Validation accuracy is very less than Training accuracy

Thursday, 12 August 2021

Validation accuracy is very less than Training accuracy

I am using MOSI dataset for the multimodal sentiment analysis, where for now I am training model for text dataset only. For text, I am using glove embeddings of 300 dimensions for processing text. My total vocab size is 2173 and my padded sequence length is 30. My target array is [0,0,0,0,0,0,1] where left most is highly -ve and right most highly +ve.

I am splitting the dataset like this

X_train, X_test, y_train, y_test = train_test_split(WDatasetX, y7, test_size=0.20, random_state=42)

My tokenization process is

MAX_NB_WORDS = 3000
tokenizer = Tokenizer(num_words=MAX_NB_WORDS,oov_token = "OOV")
tokenizer.fit_on_texts(Text_X_Train)
tokenized_X_train = tokenizer.texts_to_sequences(Text_X_Train)
tokenized_X_test = tokenizer.texts_to_sequences(Text_X_Test)

My embedding matrix:

vocab_size = len(tokenizer.word_index)+1
emb_mean=0
def embedding_matrix_filteration():
    all_embs = np.stack(list(embeddings_index.values()))
    print(all_embs.shape)
    emb_mean, emb_std = np.mean(all_embs), np.std(all_embs)
    print(emb_mean)
    embedding_matrix = np.random.normal(emb_mean, emb_std, (vocab_size, embed_dim)) gives the matrix of specified
                                                                    size filled with values from gauss distribution
    print(embedding_matrix.shape)
     print("length of word2id:",len(word2id))
    embeddedCount = 0
    not_found = []
    for word, idx in tokenizer.word_index.items():
        embedding_vector = embeddings_index.get(word.lower())
        if word == ' ':
            embedding_vector = np.zeros_like(emb_mean)
        if embedding_vector is not None: 
            embedding_matrix[idx] = embedding_vector
            embeddedCount += 1
        else:
            print(word)
            print("$$$")
    print('total embedded:',embeddedCount,'common words')# words common between glove vector and wordset
    print("length of word2id:",len(word2id))
    print(len(embedding_matrix))
    return embedding_matrix

emb = embedding_matrix_filteration()

Model Architecture:

Embedding Layer:

embedding_layer = Embedding(
    vocab_size,
    300,
    weights=[emb],
    trainable=False,
    input_length=sequence_length
)

My model:

from keras import regularizers,layers

model = Sequential()
model.add(embedding_layer)
model.add(Bidirectional(layers.LSTM(512,return_sequences=True)))
model.add(Bidirectional(layers.LSTM(512,return_sequences=True)))
model.add(Bidirectional(layers.LSTM(256,return_sequences=True)))
model.add(Bidirectional(layers.LSTM(256)))#kernel_regularizer=regularizers.l2(0.001)
model.add(Dense(128, activation='relu'))
# model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
# model.add(Dropout(0.2))
model.add(Dense(7, activation='softmax'))

For some reason when my training accuracy reached 80%, val. accuracy still remains very low. I have tried different regularization techniques, optimizers, loss functions, but the result is the same. I don't know why.