How to set cluster_selection_epsilon when using cosine distances? #627

ma9o · 2024-02-22T14:36:24Z

Hi, I am using HDBSCAN to cluster text embeddings.

As the data is unbalanced in favor of one category of embeddings, I am obtaining too many sub-clusters of that category, which I would like to squash together. I have found that datapoints with a cosine distance <0.7 should belong in the same cluster, and if I understand correctly I should set cluster_selection_epsilon=0.7 to achieve this outcome.

This doesn't seem to be working as all the datapoints and up in the same cluster (the value is too high?).

My current code:

from cuml.metrics import pairwise_distances
from hdbscan import HDBSCAN
import numpy as np
import cupy as cp  
import cuml

embeddings_gpu = cp.asarray(embeddings)

umap_model = cuml.UMAP(n_neighbors=15,
                       n_components=100, 
                       metric='cosine')
reduced_data_gpu = umap_model.fit_transform(embeddings_gpu)

cosine_dist = pairwise_distances(reduced_data_gpu, metric='cosine')

clusterer = HDBSCAN(min_cluster_size=5, 
                    gen_min_span_tree=True,
                    metric="precomputed",
                    cluster_selection_epsilon=0.7) 
cluster_labels = clusterer.fit_predict(cosine_dist.astype(np.float64).get())

cluster_labels:

Shape: 9533
array([0, 0, 0, ..., 0, 0, 0])

cosine_dist:

Shape: (9533, 9533)
array([[5.9604645e-07, 1.6956329e-02, 5.4422319e-02, ..., 1.0555809e+00,
        1.1026136e+00, 1.3615031e+00],
       ...,
       [1.3615031e+00, 1.4514638e+00, 1.3940278e+00, ..., 3.1383842e-01,
        7.0653200e-02, 5.9604645e-07]], dtype=float32)

Is this the correct use of cluster_selection_epsilon? Thanks

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to set cluster_selection_epsilon when using cosine distances? #627

How to set cluster_selection_epsilon when using cosine distances? #627

ma9o commented Feb 22, 2024

How to set cluster_selection_epsilon when using cosine distances? #627

How to set cluster_selection_epsilon when using cosine distances? #627

Comments

ma9o commented Feb 22, 2024