Friday 13 November 2020

Difficulties to get the correct posterior value in a Naive Bayes Implementation

For studying purposes, I've tried to implement this "lesson" using python but "without" sckitlearn or something similar.

My attempt code is the follow:

import pandas, math

training_data = [
        ['A great game','Sports'],
        ['The election was over','Not sports'],
        ['Very clean match','Sports'],
        ['A clean but forgettable game','Sports'],
        ['It was a close election','Not sports']
]

text_to_predict = 'A very close game'
data_frame = pandas.DataFrame(training_data, columns=['data','label'])
data_frame = data_frame.applymap(lambda s:s.lower() if type(s) == str else s)
text_to_predict = text_to_predict.lower()
labels = data_frame.label.unique()
word_frequency = data_frame.data.str.split(expand=True).stack().value_counts()
unique_words_set = set()
unique_words = data_frame.data.str.split().apply(unique_words_set.update)
total_unique_words = len(unique_words_set)

word_frequency_per_labels = []
for l in labels:
    word_frequency_per_label = data_frame[data_frame.label == l].data.str.split(expand=True).stack().value_counts()
    for w, f in word_frequency_per_label.iteritems():
        word_frequency_per_labels.append([w,f,l])

word_frequency_per_labels_df = pandas.DataFrame(word_frequency_per_labels, columns=['word','frequency','label'])
laplace_smoothing = 1
results = []
for l in labels:
    p = []
    total_words_in_label = word_frequency_per_labels_df[word_frequency_per_labels_df.label == l].frequency.sum()
    for w in text_to_predict.split():
        x = (word_frequency_per_labels_df.query('word == @w and label == @l').frequency.to_list()[:1] or [0])[0]
        p.append((x + laplace_smoothing) / (total_words_in_label + total_unique_words))
    results.append([l,math.prod(p)])

print(results)
result = pandas.DataFrame(results, columns=['labels','posterior']).sort_values('posterior',ascending = False).labels.iloc[0]
print(result)

In the blog lesson their results are:

enter image description here

But my result were:

[['sports', 4.607999999999999e-05], ['not sports', 1.4293831139825827e-05]]

So, what did I do wrong in my python implementation? How can I get the same results?

Thanks in advance



from Difficulties to get the correct posterior value in a Naive Bayes Implementation

No comments:

Post a Comment