I am working on classifying texts based on word occurrences. One of the steps is to estimate the probability of a particular text for each possible class. To do this, I am given NSAMPLES of texts from a vocabulary of NFEATURES words, each labelled with one of NLABELS class labels. From this, I construct a binary occurrence matrix where entry(sample,feature) is 1 iff text "sample" contains the word encoded by "feature".
From the occurrence matrix, we can construct a matrix of conditional probabilities and then smooth this so the probabilities are neither 0.0 or 1.0, using the following code (copied from Coursera notebook):
def laplace_smoothing(labels, binary_data, n_classes):
# Compute the parameter estimates (adjusted fraction of documents in class that contain word)
n_words = binary_data.shape[1]
alpha = 1 # parameters for Laplace smoothing
theta = np.zeros([n_classes, n_words]) # stores parameter values - prob. word given class
for c_k in range(n_classes): # 0, 1, ..., 19
class_mask = (labels == c_k)
N = class_mask.sum() # number of articles in class
theta[c_k, :] = (binary_data[class_mask, :].sum(axis=0) + alpha)/(N + alpha*2)
return theta
To see the problem, here is code to mock up inputs and call for the result:
import tensorflow_probability as tfp
tfd = tfp.distributions
NSAMPLES = 2000 # Size of corpus
NFEATURES = 10000 # Number of words in corpus
NLABELS = 10 # Number of classes
ONE_PROB = 0.02 # Probability that binary_datum will be 1
def mock_binary_data( nsamples, nfeatures, one_prob ):
binary_data = ( np.random.uniform( 0, 1, ( nsamples, nfeatures ) ) < one_prob ).astype( 'int32' )
return binary_data
def mock_labels( nsamples, nlabels ):
labels = np.random.randint( 0, nlabels, nsamples )
return labels
binary_data = mock_binary_data( NSAMPLES, NFEATURES, ONE_PROB )
labels = mock_labels( NSAMPLES, NLABELS )
smoothed_data = laplace_smoothing( labels, binary_data, NLABELS )
bernoulli = tfd.Independent( tfd.Bernoulli( probs = smoothed_data ), reinterpreted_batch_ndims = 1 )
test_random_data = mock_binary_data( 1, NFEATURES, ONE_PROB )[ 0 ]
bernoulli.prob( test_random_data )
When I execute this, I get:
<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)>
that is, all the probabilities are zero. Some step here is incorrect, can you please help me find it?
from Why does Tensorflow Bernoulli distribution always return 0?
No comments:
Post a Comment