Wednesday, 23 February 2022

Naives Bayes Classifier for bag of vectorized sentences

Summary : How to train a Naive Bayes Classifier on a bag of vectorized sentences ?

Example here :

X_train[0] = [[0, 1, 0, 0], [1, 0, 0, 1], [0, 0, 0, 1]]
y_train[0] = 1

X_train[1] = [[0, 0, 0, 0], [0, 1, 0, 1], [1, 0, 0, 1], [0, 1, 0, 1]]
y_train[1] = 0

.


1) Context of the project : perform sentiment analysis on batch of tweets to perform market prediction

I am working on a sentiment analysis for stock market classification. As I am new to these techniques, I tried to replicate one from this article : http://cs229.stanford.edu/proj2015/029_report.pdf

But I am facing a big issue with it. Let me explain the main steps of the article I realized :

  1. I collected a huge amount of tweets during 4 months (7 million)
  2. I cleaned then (removing stop words, hashtags, mentions, punctuation, etc...)
  3. I grouped them into period interval of 1 hour
  4. I created a target that tell if the price of the Bitcoin has gone down or up after one hour (0 = down ; 1 = up)

What I need to do next is to train my Bernoulli Naive Bayes Model with this. To do this, the article mentions to vectorize the tweets this way.

enter image description here

[.....] enter image description here

What I did with the CountVectorizer class from sklearn.

2 ) The issue : the dimension of the inputs doesn't match Naive Bayes standards

But then I encounter an issue when I try to fit the Bernoulli Naive Bayes model, following the article method :

enter image description here

So, one observation is shaped this way :

  • input shape (one observation): (nb_tweets_on_this_1hour_interval, vocabulary_size= 10 000)
    one_observation_input = [
        [0, 1, 0 ....., 0, 0], #Tweet 1 vectorized
        ....,
        [1, 0, ....., 1, 0]    #Tweet N vectorized
    ]#All of the values are 0 or 1
  • output shape (one observation): (1,)

    one_observation_output = [0] #Can only be 0 or 1

When I tried to fit my Sklearn Bernoulli Naive Bayes model with this type of value, I am getting this error

>>> ValueError: Found array with dim 3. Estimator expected <= 2.

Indeed, the model expects binary input shaped this way :

  • input : (nb_features)

    ex: [0, 0, 1, 0, ...., 1, 0, 1]

while I am giving it vectors of binary values !

3 ) What I have tried

So far, I tried several things to resolve this :

  • Associating the label for every tweet, but the results are not good since the tweets are really noisy
  • Flatten the inputs so the shape for one input is (nb_tweets_on_this_1hour_interval*vocabulary_size, ). But the model can not train as the number of tweets every hour is not constant.

4 ) Conclusion

I don't know if the error comes from my misunderstanding of the article or of the Nayes Bayes models.

How to train efficiently a naive bayes classifier on a bag of tweets ?

Here is my training code :


    bnb = BernoulliNB()

    uniqueY = [0, 1]#I give the algorithm the 2 classes I want to classify the tweets with. This is needed for the partial fit

    for _index, row in train_df.iterrows():#I have to use a for loop to partialy fit my Bernouilli Naive Bayes classifier to prevent from out of memory issues

        #row["Tweet"] contains all the (cleaned) tweets over 1hour interval this way  : ["I like Bitcoin", "Nice portfolio", ...., "I am the last tweet of the interval"]
        X_train = vectorizer.transform(row["Tweet"]).toarray()
        #X_train contrains all of the row["Tweet"] tweets vectorizes with a bag of words algorithm which return this kind of data : [[0, 1, 0 ....., 0, 0], ....,[1, 0, ....., 1, 0]]

        y_train = row["target"]
        #Target is 0 if the market is going down after the tweets and 1 if it is going up

        bnb.partial_fit([X_train], [y_train], uniqueY)
        #Again, I use partial fit to avoid out of memory issues



from Naives Bayes Classifier for bag of vectorized sentences

No comments:

Post a Comment