Friday, 31 August 2018

LabelPropagation - How to avoid division by zero?

When using LabelPropagation, I often run into this warning (imho it should be an error because it completely fails the propagation):

/usr/local/lib/python3.5/dist-packages/sklearn/semi_supervised/label_propagation.py:279: RuntimeWarning: invalid value encountered in true_divide self.label_distributions_ /= normalizer

So after few tries with the RBF kernel, I discovered the paramater gamma has an influence.

EDIT:

The problem comes from these lines:

        if self._variant == 'propagation':
            normalizer = np.sum(
                self.label_distributions_, axis=1)[:, np.newaxis]
            self.label_distributions_ /= normalizer

I don't get how label_distributions_ can be all zeros, especially when its definition is:

self.label_distributions_ = safe_sparse_dot(
graph_matrix, self.label_distributions_)

Gamma has an influence on the graph_matrix (because graph_matrix is the result of _build_graph() that call the kernel function). OK. But still. Something's wrong

OLD POST (before edit)

I remind you how graph weights are computed for the propagation: W = exp(-gamma * D), D the pairwise distance matrix between all points of the dataset.

The problem is: np.exp(x) returns 0.0 if x very small.
Let's imagine we have two points i and j such that dist(i, j) = 10.

>>> np.exp(np.asarray(-10*40, dtype=float)) # gamma = 40 => OKAY
1.9151695967140057e-174
>>> np.exp(np.asarray(-10*120, dtype=float)) # gamma = 120 => NOT OKAY
0.0

In practice, I'm not setting gamma manually but I'm using the method described in this paper (section 2.4).

So, how would one avoid this division by zero to get a proper propagation ?

The only way I can think of is to normalize the dataset in every dimension, but we lose some geometric/topologic property of the dataset (a 2x10 rectangle becoming a 1x1 square for example)

Reproductible example:

In this example, it's worst: even with gamma = 20 it fails.

In [11]: from sklearn.semi_supervised.label_propagation import LabelPropagation

In [12]: import numpy as np

In [13]: X = np.array([[0, 0], [0, 10]])

In [14]: Y = [0, -1]

In [15]: LabelPropagation(kernel='rbf', tol=0.01, gamma=20).fit(X, Y)
/usr/local/lib/python3.5/dist-packages/sklearn/semi_supervised/label_propagation.py:279: RuntimeWarning: invalid value encountered in true_divide
  self.label_distributions_ /= normalizer
/usr/local/lib/python3.5/dist-packages/sklearn/semi_supervised/label_propagation.py:290: ConvergenceWarning: max_iter=1000 was reached without convergence.
  category=ConvergenceWarning
Out[15]: 
LabelPropagation(alpha=None, gamma=20, kernel='rbf', max_iter=1000, n_jobs=1,
         n_neighbors=7, tol=0.01)

In [16]: LabelPropagation(kernel='rbf', tol=0.01, gamma=2).fit(X, Y)
Out[16]: 
LabelPropagation(alpha=None, gamma=2, kernel='rbf', max_iter=1000, n_jobs=1,
         n_neighbors=7, tol=0.01)

In [17]:

from LabelPropagation - How to avoid division by zero?

Hemant Vishwakarma