I trained the GaussianNB model from scikit sklearn. When I call the method classifier.predict_proba it only returns 1 or 0 on new data. It is expected to return a percentage of confidence that the prediction is correct or not. I doubt it can have 100% confidence on new data it has never seen before. I have tested it on multiple different inputs. I use CountVectorizer and TfidfTransformer for the text encoding.
The encoding:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_train_counts = count_vect.fit_transform(X_train_word)
X_train = tfidf_transformer.fit_transform(X_train_counts).toarray()
print(X_train)
X_test_counts = count_vect.transform(X_test_word)
X_test = tfidf_transformer.transform(X_test_counts).toarray()
print(X_test)
The model: (I am getting an accuracy of 91%)
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Predict Class
y_pred = classifier.predict(X_test)
# Accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)
And finally, when I use the predict_proba method:
y_pred = classifier.predict_proba(X_test)
print(y_pred)
I am getting an output like:
[[0. 1.]
[1. 0.]
[0. 1.]
...
[1. 0.]
[1. 0.]
[1. 0.]]
It doesn't make much sense to have 100% accuracy on new data. Other than on y_test I have tested it on other inputs and it still returns the same. Any help would be appreciated!
Edit for the comments: The response of .predict_log_proba() is even more strange:
[[ 0.00000000e+00 -6.95947375e+09]
[-4.83948755e+09 0.00000000e+00]
[ 0.00000000e+00 -1.26497690e+10]
...
[ 0.00000000e+00 -6.97191054e+09]
[ 0.00000000e+00 -2.25589894e+09]
[ 0.00000000e+00 -2.93089863e+09]]
from Naive Gaussian predict probability only returns 0 or 1
No comments:
Post a Comment