Sunday, 16 December 2018

Is there an efficient method of checking whether a column has mixed dtypes?

Consider

np.random.seed(0)
s1 = pd.Series([1, 2, 'a', 'b', [1, 2, 3]])
s2 = np.random.randn(len(s1))
s3 = np.random.choice(list('abcd'), len(s1))


df = pd.DataFrame({'A': s1, 'B': s2, 'C': s3})
df
           A         B  C
0          1  1.764052  a
1          2  0.400157  d
2          a  0.978738  c
3          b  2.240893  a
4  [1, 2, 3]  1.867558  a

Column "A" has mixed data types. I would like to come up with a really quick way of determining this. It would not be as simple as checking whether type == object, because that would identify "C" as a false positive.

I can think of doing this with

df.applymap(type).nunique() > 1

A     True
B    False
C    False
dtype: bool

But calling type atop applymap is pretty slow. Especially for larger frames.

%timeit df.applymap(type).nunique() > 1
3.95 ms ± 88 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Can we do better (perhaps with NumPy)? I can accept "No" if your argument is convincing enough. :-)



from Is there an efficient method of checking whether a column has mixed dtypes?

No comments:

Post a Comment