I'm drawing dendrograms from scratch using the Z
and P
outputs of code like the following (see below for a fuller example):
Z = scipy.cluster.hierarchy.linkage(...)
P = scipy.cluster.hierarchy.dendrogram(Z, ..., no_plot=True)
and in order to do what I want, I need to match up a given index in P["icoord"]
/P["dcoord"]
(which contain the coordinates to draw the cluster linkage in a plot) with the corresponding index in Z
(which contains the information about which data elements are in which cluster) or vice-versa. Unfortunately, it does not seem that in general, the position of clusters in P["icoord"]
/P["dcoord"]
just match up with the corresponding positions in Z
(see the output of the code below for proof).
The Question: what is a way that I could match them up? I need either a function Z_i = f(P_coords_i)
or its inverse P_coords_i = g(Z_i)
so that I can iterate over one list and easily access the corresponding elements in the other.
The code below generates 26 random points and labels them with the letters of the alphabet and then prints out the letters corresponding with the clusters represented by the rows of Z
and then the points in P
where dcoord
is zero (i.e. the leaf nodes), to prove that in general they don't match up: for example the first element of Z
corresponds to cluster iu
but the first set of points in P["icoord"]
/P["dcoord"]
corresponds to drawing the cluster for jy
and that of iu
doesn't come until a few elements later.
import numpy as np
from scipy.cluster import hierarchy
from scipy.spatial import distance
import string
# let's make some random data
np.random.seed(1)
data = np.random.multivariate_normal([0,0],[[5, 0], [0, 1]], 26)
letters = list(string.ascii_lowercase)
X = distance.pdist(data)
# here's the code I need to run for my use-case
Z = hierarchy.linkage(X)
P = hierarchy.dendrogram(Z, labels=letters, no_plot=True)
# let's look at the order of Z
print("Z:")
clusters = letters.copy()
for c1, c2, _, _ in Z:
clusters.append(clusters[int(c1)]+clusters[int(c2)])
print(clusters[-1])
# now let's look at the order of P["icoord"] and P["dcoord"]
print("\nP:")
def lookup(y, x):
return "?" if y else P["ivl"][int((x-5)/10)]
for ((x1,x2,x3,x4),(y1,y2,y3,y4)) in zip(P["icoord"], P["dcoord"]):
print(lookup(y1, x1)+lookup(y4, x4))
Output:
------Z:
iu
ez
niu
jy
ad
pr
bq
prbq
wniu
gwniu
ezgwniu
hm
ojy
prbqezgwniu
ks
ojyprbqezgwniu
vks
ojyprbqezgwniuvks
lhm
adlhm
fadlhm
cfadlhm
tcfadlhm
ojyprbqezgwniuvkstcfadlhm
xojyprbqezgwniuvkstcfadlhm
------P:
jy
o?
pr
bq
??
ez
iu
n?
w?
g?
??
??
??
ks
v?
??
ad
hm
l?
??
f?
c?
t?
??
x?
from Matching up the output of scipy linkage() and dendrogram()
No comments:
Post a Comment