Sunday 14 March 2021

Numpy/Pandas correlate multiple arrays of different length

I can correlate two arrays of difdferent length using this method:

import pandas as pd
import numpy as np
from scipy.stats.stats import pearsonr

a = [0, 0.4, 0.2, 0.4, 0.2, 0.4, 0.2, 0.5]
b = [25, 40, 62, 58, 53, 54]
df = pd.DataFrame(dict(x=a))

CORR_VALS = np.array(b)
def get_correlation(vals):
    return pearsonr(vals, CORR_VALS)[0]

df['correlation'] = df.rolling(window=len(CORR_VALS)).apply(get_correlation)

It get a result like this:

In [1]: df
Out[1]: 

    x  correlation
0  0.0          NaN
1  0.4          NaN
2  0.2          NaN
3  0.4          NaN
4  0.2          NaN
5  0.4     0.527932
6  0.2    -0.159167
7  0.5     0.189482

First of all, the pearson coeff should just be the highest number in this dataset...

Secondly, how could I do this for multiple sets of data? I would like an output like I would get in df.corr(). With the indices and columns labeled appropriately.

for example, say I have the following datasets:

a = [0, 0.4, 0.2, 0.4, 0.2, 0.4, 0.2, 0.5]
b = [25, 40, 62, 58, 53, 54]
c = [ 0, 0.4, 0.2, 0.4, 0.2, 0.45, 0.2, 0.52, 0.52, 0.4, 0.21, 0.2, 0.4, 0.51]
d = [ 0.4, 0.2, 0.5]

I want a correlation matrix of four Pearson coeffs...



from Numpy/Pandas correlate multiple arrays of different length

No comments:

Post a Comment