Tuesday, 24 August 2021

Empirical CDF function in python with reasonable NaN behavior

I'm looking to compute the ECDF and am using this statsmodels function:

from statsmodels.distributions.empirical_distribution import ECDF

Looks good at first:

ECDF(np.array([0,1,2,3, 3, 3]))(np.array([0,1,2,3, 3,3]))
array([0.16666667, 0.33333333, 0.5       , 1.        , 1.        ,
       1.        ])

However, nan seems to be treated as infinity:


>>> x = np.array([0,1,2,3, np.nan, np.nan])
>>> ECDF(x)(x)
array([0.16666667, 0.33333333, 0.5       , 0.66666667, 1.        ,
       1.        ])

Same as:

np.array([0,1,2,3, np.inf, np.inf])
ECDF(x)(x)
array([0.16666667, 0.33333333, 0.5       , 0.66666667, 1.        ,
       1.        ])

Comparing with R:

> x <- c(0,1,2,3,NA,NA)
> x
[1]  0  1  2  3 NA NA
> ecdf(x)(x)
[1] 0.25 0.50 0.75 1.00   NA   NA

What's the standard python function for ecdf that is nan aware?

Hot-wiring like so does not seem to work:

def ecdf(x):
  return np.where(~np.isfinite(x),
                  np.full_like(x, np.nan),
                  ECDF(x[np.isfinite(x)])(x[np.isfinite(x)]))

ecdf(x)
    ECDF(x[np.isfinite(x)])(x[np.isfinite(x)]))
  File "<__array_function__ internals>", line 6, in where
ValueError: operands could not be broadcast together with shapes (7,) (7,) (4,) 

                  


from Empirical CDF function in python with reasonable NaN behavior

No comments:

Post a Comment