So, as the title says, I'm trying to calculate the probability of a value given a list of samples, preferably normalized so the probability is 0<p<1. I found this answer on the topic from about 6 years ago, which seemed promising. To test it, I implemented the example used in the first reply (edited for brevity):
import numpy as np
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
# Generate random samples from a mixture of 2 Gaussians
# with modes at 5 and 10
data = np.concatenate((5 + np.random.randn(10, 1),
10 + np.random.randn(30, 1)))
x = np.linspace(0, 16, 1000)[:, np.newaxis]
# Do kernel density estimation
kd = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(data)
# Get probability for range of values
start = 5 # Start of the range
end = 6 # End of the range
probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]
However, this approach throws the following error:
Traceback (most recent call last):
File "prob test.py", line 44, in <module>
probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]
File "/usr/lib/python3/dist-packages/scipy/integrate/quadpack.py", line 340, in quad
retval = _quad(func, a, b, args, full_output, epsabs, epsrel, limit,
File "/usr/lib/python3/dist-packages/scipy/integrate/quadpack.py", line 448, in _quad
return _quadpack._qagse(func,a,b,args,full_output,epsabs,epsrel,limit)
File "prob test.py", line 44, in <lambda>
probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]
File "/usr/lib/python3/dist-packages/sklearn/neighbors/_kde.py", line 190, in score_samples
X = check_array(X, order='C', dtype=DTYPE)
File "/usr/lib/python3/dist-packages/sklearn/utils/validation.py", line 545, in check_array
raise ValueError(
ValueError: Expected 2D array, got scalar array instead:
array=5.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I'm not sure how to reshape the distribution when its already inside the lambda function, and, in any case, I'm guessing this is happening because Scikit-Learn has been updated in the 6 years since this answer was written. What's the best way to work around this issue to get the probability value?
Thanks!
from Python--calculate normalized probability of a value given a list of samples
No comments:
Post a Comment