Wednesday, 25 January 2023

After performing t-SNE dimentionality reduction, use k-means and check what features contribute the most in each individual cluster

The following plot displays the t-SNE plot. I can show it here but unfortunately, I can't show you the labels. There are 4 different labels:

enter image description here

The plot was created using a data frame called scores, which contains approximately 1100 patient samples and 25 features represented by its columns. The labels for the plot were sourced from a separate data frame called metadata. The following code was used to generate the plot, utilizing the information from both scores and metadata data frames.

tsneres <- Rtsne(scores, dims = 2, perplexity = 6)
tsneres$Y = as.data.frame(tsneres$Y)
ggplot(tsneres$Y, aes(x = V1, y = V2, color = metadata$labels)) + 
  geom_point()

My mission:

I want to analyze the t-SNE plot and identify which features, or columns from the "scores" matrix, are most prevalent in each cluster. Specifically, I want to understand which features are most helpful in distinguishing between the different clusters present in the plot. Is it possible to use an alternative algorithm, such as PCA, that preserves the distances between data points in order to accomplish this task? perhaps it's even a better choice than t-SNE?

This is an example of scores, this is not the real data, but it's similar:

structure(list(Feature1 = c(0.1, 0.3, -0.2, -0.12, 0.17, -0.4, 
-0.21, -0.19, -0.69, 0.69), Feature2 = c(0.22, 0.42, 0.1, -0.83, 
0.75, -0.34, -0.25, -0.78, -0.68, 0.55), Feature3 = c(0.73, -0.2, 
0.8, -0.48, 0.56, -0.21, -0.26, -0.78, -0.67, 0.4), Feature4 = c(0.34, 
0.5, 0.9, -0.27, 0.64, -0.11, -0.41, -0.82, -0.4, -0.23), Feature5 = c(0.45, 
0.33, 0.9, 0.73, 0.65, -0.1, -0.28, -0.78, -0.633, 0.32)), class = "data.frame", row.names = c("Patient_A", 
"Patient_B", "Patient_C", "Patient_D", "Patient_E", "Patient_F", 
"Patient_G", "Patient_H", "Patient_I", "Patient_J"))

EDIT - PYTHON

I got to the same point python. I tried PCA at first but it produced very bad plots. So I first reduced dimensions using t-SNE, which produced much better results and clustered the data using k-means. I still got the same question as before, just now I don't mind using R or python.

This is the new plot:

enter image description here

And this is the code:

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, learning_rate=200)
tsne_result = tsne.fit_transform(scores)

#create a dict to map the labels to colors
label_color_dict = {'label1':'blue', 'label2':'red', 'label3':'yellow', 'label4':'green'}

#create a list of colors based on the 'labels' column in metadata
colors = [label_color_dict[label] for label in metadata[['labels']]

plt.scatter(tsne_result[:, 0], tsne_result[:, 1], c=colors, s=50)
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], c='red', marker='o')

# Add labels to the cluster centers
for i, center in enumerate(cluster_centers,1):
    plt.annotate(f"Cluster {i}", (center[0], center[1]), 
                 textcoords="offset points", 
                 xytext=(0,10), ha='center', fontsize=20)


from After performing t-SNE dimentionality reduction, use k-means and check what features contribute the most in each individual cluster

No comments:

Post a Comment