Tuesday, 29 June 2021

Why does Tensorflow Bernoulli distribution always return 0?

I am working on classifying texts based on word occurrences. One of the steps is to estimate the probability of a particular text for each possible class. To do this, I am given NSAMPLES of texts from a vocabulary of NFEATURES words, each labelled with one of NLABELS class labels. From this, I construct a binary occurrence matrix where entry(sample,feature) is 1 iff text "sample" contains the word encoded by "feature".

From the occurrence matrix, we can construct a matrix of conditional probabilities and then smooth this so the probabilities are neither 0.0 or 1.0, using the following code (copied from Coursera notebook):

def laplace_smoothing(labels, binary_data, n_classes):
    # Compute the parameter estimates (adjusted fraction of documents in class that contain word)
    n_words = binary_data.shape[1]
    alpha = 1 # parameters for Laplace smoothing
    theta = np.zeros([n_classes, n_words]) # stores parameter values - prob. word given class
    for c_k in range(n_classes): # 0, 1, ..., 19
        class_mask = (labels == c_k)
        N = class_mask.sum() # number of articles in class
        theta[c_k, :] = (binary_data[class_mask, :].sum(axis=0) + alpha)/(N + alpha*2)
    return theta

To see the problem, here is code to mock up inputs and call for the result:

import tensorflow_probability as tfp
tfd = tfp.distributions

NSAMPLES = 2000   # Size of corpus
NFEATURES = 10000 # Number of words in corpus
NLABELS = 10      # Number of classes
ONE_PROB = 0.02   # Probability that binary_datum will be 1

def mock_binary_data( nsamples, nfeatures, one_prob ):
    binary_data = ( np.random.uniform( 0, 1, ( nsamples, nfeatures ) ) < one_prob ).astype( 'int32' )
    return binary_data

def mock_labels( nsamples, nlabels ):
    labels = np.random.randint( 0, nlabels, nsamples )
    return labels

binary_data = mock_binary_data( NSAMPLES, NFEATURES, ONE_PROB )
labels = mock_labels( NSAMPLES, NLABELS )
smoothed_data = laplace_smoothing( labels, binary_data, NLABELS )

bernoulli = tfd.Independent( tfd.Bernoulli( probs = smoothed_data ), reinterpreted_batch_ndims = 1 )

test_random_data = mock_binary_data( 1, NFEATURES, ONE_PROB )[ 0 ]
bernoulli.prob( test_random_data )

When I execute this, I get:

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)>

that is, all the probabilities are zero. Some step here is incorrect, can you please help me find it?



from Why does Tensorflow Bernoulli distribution always return 0?

No comments:

Post a Comment