Monday, 28 December 2020

Word2Vec for Naive Bayes classifier

I would like to include word2vec features in my Naive Bayes model.

   naiveb_pipeline = Pipeline([
            ('NBCV',FS.countV),
            ('nb_clf',MultinomialNB())])

naiveb_pipeline.fit(DataPrep.train['Text'],DataPrep.train['Label'])
predicted_nb = naiveb_pipeline.predict(DataPrep.test['Text'])
np.mean(predicted_nb == DataPrep.test['Label'])

where FS is

#creating feature vector
countV = CountVectorizer()
train_count = countV.fit_transform(DataPrep.train['Text'].values)
    
tfidfV = TfidfTransformer()
train_tfidf = tfidfV.fit_transform(train_count)

tfidf_ngram = TfidfVectorizer(stop_words='english',ngram_range=(1,2))

and DataPrep is

X = train['Text']
y = train['Label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

dataset = pd.concat([X_train, y_train], axis=1)

plus pos tagging and stemming.

Now I am trying to use the word2vec in the model in order to compare with the tf-idf mentioned above. I wrote the following to be included in the FS

#Using Word2Vec 
with open("path/glove.6B/glove.6B.50d.txt", "rb") as lines:
    w2v = {line.split()[0]: np.array(map(float, line.split()[1:]))
           for line in lines}

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.dim = len(word2vec.itervalues().next())

    def fit(self, X, y):
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

but I do not know if it is the right way to proceed within the FS program. A useful link is the following: https://github.com/nishitpatel01/Fake_News_Detection

I have more code, but I have thought the above one is the more relevant. I will be happy to provide other code, if requested.



from Word2Vec for Naive Bayes classifier

No comments:

Post a Comment