I would like to include word2vec features in my Naive Bayes model.
naiveb_pipeline = Pipeline([
('NBCV',FS.countV),
('nb_clf',MultinomialNB())])
naiveb_pipeline.fit(DataPrep.train['Text'],DataPrep.train['Label'])
predicted_nb = naiveb_pipeline.predict(DataPrep.test['Text'])
np.mean(predicted_nb == DataPrep.test['Label'])
where FS is
#creating feature vector
countV = CountVectorizer()
train_count = countV.fit_transform(DataPrep.train['Text'].values)
tfidfV = TfidfTransformer()
train_tfidf = tfidfV.fit_transform(train_count)
tfidf_ngram = TfidfVectorizer(stop_words='english',ngram_range=(1,2))
and DataPrep is
X = train['Text']
y = train['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
dataset = pd.concat([X_train, y_train], axis=1)
plus pos tagging and stemming.
Now I am trying to use the word2vec in the model in order to compare with the tf-idf mentioned above. I wrote the following to be included in the FS
#Using Word2Vec
with open("path/glove.6B/glove.6B.50d.txt", "rb") as lines:
w2v = {line.split()[0]: np.array(map(float, line.split()[1:]))
for line in lines}
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
self.dim = len(word2vec.itervalues().next())
def fit(self, X, y):
return self
def transform(self, X):
return np.array([
np.mean([self.word2vec[w] for w in words if w in self.word2vec]
or [np.zeros(self.dim)], axis=0)
for words in X
])
but I do not know if it is the right way to proceed within the FS program. A useful link is the following: https://github.com/nishitpatel01/Fake_News_Detection
I have more code, but I have thought the above one is the more relevant. I will be happy to provide other code, if requested.
from Word2Vec for Naive Bayes classifier
No comments:
Post a Comment