Saturday 21 August 2021

How to check the convergence when fitting a distribution in SciPy

Is there a way to check the convergence when fitting a distribution in SciPy?

My goal is to fit a SciPy distribution (namely Johnson S_U distr.) to dozens of datasets as a part of an automated data-monitorign system. Mostly it works fine, but a few datasets are anomalous and clearly do not follow the Johnson S_U distribution. Fits on these datasets diverge silently, i.e. without any warning/error/whatever! On the contrary, if I switch to R and try to fit there I never ever get a convergence, which is right - regardless of the fit settings, the R algorithm denies to declare a convergence.

data: Two datasets are available in Dropbox:

  • data-converging-fit.csv ... a standard data where fit converges nicely:

enter image description here

  • data-diverging-fit.csv ... an anomalous data where fit diverges:

enter image description here

code to fit the distributoin:

import pandas as pd
from scipy import stats

distribution_name = 'johnsonsu'
dist = getattr(stats, distribution_name)

convdata = pd.read_csv('data-converging-fit.csv', index_col= 'timestamp')
divdata  = pd.read_csv('data-diverging-fit.csv', index_col= 'timestamp')

On the good data, the fitted parameters have common order of magnitude:

a, b, loc, scale = dist.fit(convdata['target'])
a, b, loc, scale

[out]: (0.3154946859186918, 
 2.9938226613743932,
 0.002176043693009398,
 0.045430055488776266)

On the anomalous data, the fitted parameters are unreasonable:

a, b, loc, scale = dist.fit(divdata['target'])
a, b, loc, scale

[out]: (-3424954.6481554992, 
7272004.43156841, 
-71078.33596490842, 
145478.1300979394)

Still I get no single line of warning that the fit failed to converge.

From researching similar questions on StackOverflow, I know the suggestion to bin my data and then use curve_fit. Despite its practicality, that solution is not right in my opinion, since that is not the way we fit distributions: the binning is arbitrary (the nr. of bins) and it affects the final fit. A more realistic option might be scipy.optimize.minimize and callbacks to learn the progrss of convergence; still I am not sure that it will eventually tell me whether the algorithm converged.



from How to check the convergence when fitting a distribution in SciPy

No comments:

Post a Comment