Monday 20 November 2023

Can the results of UMAP for HDBScan clustering be made more consistent?

I have a set of ~40K phrases which I'm clustering with HDBScan after using UMAP for dimensionality reduction. The steps are:

  1. Generate embeddings using a fine-tuned BERT model
  2. Reduce dimensions with UMAP
  3. Cluster with HDBScan

I'm finding that sometimes, HDBScan finds 100-200 clusters, which is the desired result. But other times, it finds only 2-4. This is with the same dataset and no change in parameters either for UMAP or HDBScan.

From the UMAP documentation I see that UMAP is a stochastic algorithm, so complete reproducibility should not be expected. But it also says "the variance between runs should ideally be relatively small", which is not the case here. Also, the variance seems to be bimodal -- I either end up with 2-4 clusters or 100+, nothing in between.

I've tried different values of parameters for both UMAP (n_components: 3, 4, 6, 10; min_dist: 0.0, 0.1, 0.3, 0.5; n_neighbors: 15, 30) and HDBScan (min_cluster_size: 50, 100, 200) but with all combinations so far, I still occasionally get the undesired 2-4 clusters.

Why is UMAP behaving this way, and how can I ensure it yields the desired 100+ clusters rather than 2-4?



from Can the results of UMAP for HDBScan clustering be made more consistent?

No comments:

Post a Comment