Hemant Vishwakarma: When processing IMDB data prepared by myself with a Keras RNN, accuracy never exceeds 0.5

A very strange thing is happening. I fetched the IMDB corpus from Kaggle, kept only the 50,000 positive and negative texts, counted word frequencies, sorted words according to their decreasing frequency, replaced in the texts the 10,000 most frequent words by their rank (plus 3 units), inserted a 1 at all sentence begins and replaced all words outside the 10,000 most frequent ones by number 2. In this I followed exactly the instructions given in the documentation of Keras imdb class.

I then ran a RNN with an Embedding layer, a SimpleRNN layer, a Dense layer. The result I get is an accuracy always around 0.5, no matter how hard I try. I then replace my code by imdb.load_data(num_words=10000) and get an accuracy of 0.86 already at the third epoch. How is this possible? Why such an extreme difference? What have I done wrong?

Here is the code I used:

import re, os, time, pickle
word=re.compile(r'^[A-Za-z]+$')
spacyline=re.compile(r'^([0-9]+) ([^ ]+) ([^ ]+) ([^ ]+) ([0-9]+) ([A-Za-z]+)')

DICT=dict()

inp=open("imdb_master_lemma.txt")
for ligne in inp:
    if (ligne[0:9]=="DEBUT DOC"):
        if (ligne[-4:-1]=="neg"):
            classe=-1
        elif (ligne[-4:-1]=="pos"):
            classe=1
    elif (ligne[0:9]=="FIN DOCUM"):
        a=1
    else:
        res=spacyline.findall(ligne)
        if res:
            lemma=str(res[0][3])
            if (word.match(lemma)):
                if (lemma in DICT.keys()):
                    DICT[lemma] += 1
                else:
                    DICT[lemma]=1
inp.close()

SORTED_WORDS=sorted(DICT.keys(), key=lambda x:DICT[x], reverse=True)
THOUSAND=SORTED_WORDS[:9997]
ORDRE=dict()
c=0
for w in THOUSAND:
    ORDRE[w]=c
    c+=1
CORPUS=[]
CLASSES=[]

inp=open("imdb_master_lemma.txt")
for ligne in inp:
    if (ligne[0:9]=="DEBUT DOC"):
        if (ligne[-4:-1]=="neg"):
            classe=0
        elif (ligne[-4:-1]=="pos"):
            classe=1
        a=[]
    if (ligne[0:9]=="DEBUT PHR"):
        a.append(1)
    elif (ligne[0:9]=="FIN DOCUM"):
        CORPUS.append(a)
        CLASSES.append(classe)
    else:
        res=spacyline.findall(ligne)
        if res:
            lemma=str(res[0][3])
            if lemma in ORDRE:
                a.append(ORDRE[lemma]+3)
            elif (word.match(lemma)):
                a.append(2)
inp.close()

from sklearn.utils import shuffle
CORPUS, CLASSES=shuffle(CORPUS, CLASSES)

out=open("keras1000.pickle","wb")
pickle.dump((CORPUS,CLASSES,ORDRE),out)
out.close()

The file imdb_master_lemma.txt contains the IMDB texts processed by Spacy, and I keep only the lemma (which is already in lowercase, so this is more or less what is used in Keras imdb only it should work even better since there are no plurals and verbs are lemmatized). Once the pickle file stored, I recall it and use it as follows:

picklefile=open("keras1000.pickle","rb")
(CORPUS,CLASSES,ORDRE)=pickle.load(picklefile)
picklefile.close()

import numpy as np
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results
x_train = np.array(vectorize_sequences(CORPUS[:25000]),dtype=object)
x_test = np.array(vectorize_sequences(CORPUS[25000:]),dtype=object)
train_labels = np.array(CLASSES[:25000])
test_labels = np.array(CLASSES[25000:])

from keras import models
from keras import layers
from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding, SimpleRNN, LSTM, Bidirectional
from keras.preprocessing import sequence

input_train = sequence.pad_sequences(x_train, maxlen=500)
input_test = sequence.pad_sequences(x_test, maxlen=500)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)

model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])

history = model.fit(input_train,
                    train_labels,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)
results = model.evaluate(input_test, test_labels)

print(results)

The result is utterly disappointing, an accuracy around 0.5. When I replace the 14 first lines by

from keras.datasets import imdb
(x_train, train_labels), (x_test, test_labels) = imdb.load_data(num_words=10000)

then everything works as described in Chollet's book and I get immediately a very high accuracy.

Can anyone tell me what I am doing wrong?

PS. Here is a small sample of the data, to illustrate the preparation process: The two first sentences of the first IMDB document, as processed by spaCy, are

DEBUT DOCUMENT neg
DEBUT PHRASE
0 Once RB once 1 advmod
1 again RB again 5 advmod
2 Mr. NNP mr. 3 compound
3 Costner NNP costner 5 nsubj
4 has VBZ have 5 aux
5 dragged VBN drag 5 ROOT
6 out RP out 5 prt
7 a DT a 8 det
8 movie NN movie 5 dobj
9 for IN for 5 prep
10 far RB far 11 advmod
11 longer JJR long 9 pcomp
12 than IN than 11 prep
13 necessary JJ necessary 12 amod
14 . . . 5 punct
FIN PHRASE
DEBUT PHRASE
15 Aside RB aside 16 advmod
16 from IN from 33 prep
17 the DT the 21 det
18 terrific JJ terrific 19 amod
19 sea NN sea 21 compound
20 rescue NN rescue 21 compound
21 sequences NNS sequence 16 pobj
22 , , , 21 punct
23 of IN of 26 prep
24 which WDT which 23 pobj
25 there EX there 26 expl
26 are VBP be 21 relcl
27 very RB very 28 advmod
28 few JJ few 26 acomp
29 I PRP -PRON- 33 nsubj
30 just RB just 33 advmod
31 did VBD do 33 aux
32 not RB not 33 neg
33 care VB care 33 ROOT
34 about IN about 33 prep
35 any DT any 34 pobj
36 of IN of 35 prep
37 the DT the 38 det
38 characters NNS character 36 pobj
39 . . . 33 punct
FIN PHRASE

This becomes:

[1, 258, 155, 5920, 13, 979, 38, 6, 14, 17, 207, 165, 68, 1526, 1, 1044, 33, 3, 1212, 1380, 1396, 382, 7, 58, 34, 4, 51, 150, 37, 19, 12, 338, 39, 91, 7, 3, 46,

etc. As you can see, the 1 shows the sentence begin, 258 is once, 155 is again, I have missed mr. because it contains a period (but this can hardly be the reason my system is failing), 5920 is costner (apparently Kevin Costner's name appears so often, that it is included in the 10,000 most frequent words), 13 is have, 979 is drag, 38 is out, 6 is the article a, 14 is the word movie, and so on. These ranks are all very reasonable, I think, so I don't see what may have went wrong.

from When processing IMDB data prepared by myself with a Keras RNN, accuracy never exceeds 0.5

Hemant Vishwakarma

Thursday, 13 June 2019

When processing IMDB data prepared by myself with a Keras RNN, accuracy never exceeds 0.5

No comments:

Post a Comment