I have a large (106x106) correlation matrix in pandas with the following structure:
+---+-------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-------------------+
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
+---+-------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-------------------+
| 0 | 1.0 | 0.465539925807 | 0.736955649673 | 0.733077703346 | -0.177380436347 | -0.268022641963 | 0.0642473239514 | -0.0136866435594 | -0.025596700815 | -0.00385065532308 |
| 1 | 0.465539925807 | 1.0 | -0.173472213691 | -0.16898620433 | -0.0460674481563 | 0.0994673318696 | 0.137137216943 | 0.061999118034 | 0.0944808695878 | 0.0229095105328 |
| 2 | 0.736955649673 | -0.173472213691 | 1.0 | 0.996627003263 | -0.172683935315 | -0.33319698831 | -0.0562591684255 | -0.0306820050477 | -0.0657065745626 | -0.0457836647012 |
| 3 | 0.733077703346 | -0.16898620433 | 0.996627003263 | 1.0 | -0.153606414649 | -0.321562257834 | -0.0465540370732 | -0.0224318843281 | -0.0586629098513 | -0.0417237678539 |
| 4 | -0.177380436347 | -0.0460674481563 | -0.172683935315 | -0.153606414649 | 1.0 | 0.0148395123941 | 0.191615549534 | 0.289211355855 | 0.28799868259 | 0.291523969899 |
| 5 | -0.268022641963 | 0.0994673318696 | -0.33319698831 | -0.321562257834 | 0.0148395123941 | 1.0 | 0.205432455075 | 0.445668299971 | 0.454982398693 | 0.427323555674 |
| 6 | 0.0642473239514 | 0.137137216943 | -0.0562591684255 | -0.0465540370732 | 0.191615549534 | 0.205432455075 | 1.0 | 0.674329392219 | 0.727261969241 | 0.67891326835 |
| 7 | -0.0136866435594 | 0.061999118034 | -0.0306820050477 | -0.0224318843281 | 0.289211355855 | 0.445668299971 | 0.674329392219 | 1.0 | 0.980543049288 | 0.939548790275 |
| 8 | -0.025596700815 | 0.0944808695878 | -0.0657065745626 | -0.0586629098513 | 0.28799868259 | 0.454982398693 | 0.727261969241 | 0.980543049288 | 1.0 | 0.930281915882 |
| 9 | -0.00385065532308 | 0.0229095105328 | -0.0457836647012 | -0.0417237678539 | 0.291523969899 | 0.427323555674 | 0.67891326835 | 0.939548790275 | 0.930281915882 | 1.0 |
+---+-------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-------------------+
Truncated here for simplicity.
If I calculate the linkage, and later plot the dendrogram using the following code:
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(result_df.corr(),'average')
plt.figure()
fig, axes = plt.subplots(1, 1, figsize=(20, 20))
axes.tick_params(axis='both', which='major', labelsize=15)
dendrogram(Z=Z,labels=result_df_null_cols.columns,leaf_rotation=90.,ax=axes,color_threshold=2.);#,fig_size=[10,10]);
My question is surrounding the y-axis. On all examples I have seen, the Y axis is bound between 0,2 - which I have read to interpret as (1-corr)
. In my result, the boundary is much higher. 0 being items that are highly correlated (1-1 = 0)
, and 2 being the cutoff on lowly correlated stuff (1 - -1 = 2)
.
I found the following answer but it does not agree with this answer and the referenced lecture notes here.
Anyway - hoping someone can clarify which source is the correct one, and help spread some knowledge on the topic.
from Dendrogram y-axis labeling confusion
No comments:
Post a Comment