Hemant Vishwakarma: Naives Bayes Classifier for bag of vectorized sentences

Summary : How to train a Naive Bayes Classifier on a bag of vectorized sentences ?

Example here :

X_train[0] = [[0, 1, 0, 0], [1, 0, 0, 1], [0, 0, 0, 1]]
y_train[0] = 1

X_train[1] = [[0, 0, 0, 0], [0, 1, 0, 1], [1, 0, 0, 1], [0, 1, 0, 1]]
y_train[1] = 0

1) Context of the project : perform sentiment analysis on batch of tweets to perform market prediction

I am working on a sentiment analysis for stock market classification. As I am new to these techniques, I tried to replicate one from this article : http://cs229.stanford.edu/proj2015/029_report.pdf

But I am facing a big issue with it. Let me explain the main steps of the article I realized :

I collected a huge amount of tweets during 4 months (7 million)
I cleaned then (removing stop words, hashtags, mentions, punctuation, etc...)
I grouped them into period interval of 1 hour
I created a target that tell if the price of the Bitcoin has gone down or up after one hour (0 = down ; 1 = up)

What I need to do next is to train my Bernoulli Naive Bayes Model with this. To do this, the article mentions to vectorize the tweets this way.

[.....]

What I did with the CountVectorizer class from sklearn.

2 ) The issue : the dimension of the inputs doesn't match Naive Bayes standards

But then I encounter an issue when I try to fit the Bernoulli Naive Bayes model, following the article method :

So, one observation is shaped this way :

input shape (one observation): (nb_tweets_on_this_1hour_interval, vocabulary_size= 10 000)

    one_observation_input = [
        [0, 1, 0 ....., 0, 0], #Tweet 1 vectorized
        ....,
        [1, 0, ....., 1, 0]    #Tweet N vectorized
    ]#All of the values are 0 or 1

output shape (one observation): (1,)

one_observation_output = [0] #Can only be 0 or 1

When I tried to fit my Sklearn Bernoulli Naive Bayes model with this type of value, I am getting this error

>>> ValueError: Found array with dim 3. Estimator expected <= 2.

Indeed, the model expects binary input shaped this way :

input : (nb_features)

ex: [0, 0, 1, 0, ...., 1, 0, 1]

while I am giving it vectors of binary values !

3 ) What I have tried

So far, I tried several things to resolve this :

Associating the label for every tweet, but the results are not good since the tweets are really noisy
Flatten the inputs so the shape for one input is (nb_tweets_on_this_1hour_interval*vocabulary_size, ). But the model can not train as the number of tweets every hour is not constant.

4 ) Conclusion

I don't know if the error comes from my misunderstanding of the article or of the Nayes Bayes models.

How to train efficiently a naive bayes classifier on a bag of tweets ?

Here is my training code :


    bnb = BernoulliNB()

    uniqueY = [0, 1]#I give the algorithm the 2 classes I want to classify the tweets with. This is needed for the partial fit

    for _index, row in train_df.iterrows():#I have to use a for loop to partialy fit my Bernouilli Naive Bayes classifier to prevent from out of memory issues

        #row["Tweet"] contains all the (cleaned) tweets over 1hour interval this way  : ["I like Bitcoin", "Nice portfolio", ...., "I am the last tweet of the interval"]
        X_train = vectorizer.transform(row["Tweet"]).toarray()
        #X_train contrains all of the row["Tweet"] tweets vectorizes with a bag of words algorithm which return this kind of data : [[0, 1, 0 ....., 0, 0], ....,[1, 0, ....., 1, 0]]

        y_train = row["target"]
        #Target is 0 if the market is going down after the tweets and 1 if it is going up

        bnb.partial_fit([X_train], [y_train], uniqueY)
        #Again, I use partial fit to avoid out of memory issues

from Naives Bayes Classifier for bag of vectorized sentences

Hemant Vishwakarma

Wednesday, 23 February 2022

Naives Bayes Classifier for bag of vectorized sentences

Summary : How to train a Naive Bayes Classifier on a bag of vectorized sentences ?

1) Context of the project : perform sentiment analysis on batch of tweets to perform market prediction

2 ) The issue : the dimension of the inputs doesn't match Naive Bayes standards

3 ) What I have tried

4 ) Conclusion

No comments:

Post a Comment