-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UMAP and HDBSCAN on 7.5 million Dataset #1133
Comments
HDBSCAN should not really be taking that long, especially the GPU implementation, if you are reducing to a reasonable number of dimensions. So something odd is going on there. |
Thank you for your reply! I am performing dimensionality reduction to 8 dimensions using UMAP. I am utilizing the ml.g5.2xlarge instance with the PyTorch 2.0.0 Python 3.10 GPU Optimized image. Initially, I am applying a Power Transformer to preprocess my numerical columns. Following this, I apply UMAP for dimension reduction, and then use HDBSCAN for clustering. Below is the code I am using: from cuml.manifold import UMAP as cumlUMAP
from cuml.cluster import HDBSCAN as cumlHDBSCAN
from sklearn.preprocessing import PowerTransformer
from sklearn.metrics import silhouette_score
# Apply Power transformer
def transform_data(df):
# Separate numerical and categorical columns
numerical_cols = df.select_dtypes(include=['int', 'float']).columns
categorical_cols = df.select_dtypes(include=['category']).columns
# Apply PowerTransformer to numerical data
transformer = PowerTransformer()
transformed_num = transformer.fit_transform(df[numerical_cols])
transformed_df = pd.DataFrame(transformed_num, columns=numerical_cols)
transformed_df[categorical_cols] = df[categorical_cols]
return transformed_df
transformed_df = transform_data(sample_data)
# Apply UMAP
umap_model = cumlUMAP(n_neighbors=11, min_dist=0.14, n_components=8, metric='euclidean', random_state=42, verbose=6)
embedding = umap_model.fit_transform(transformed_df)
# Apply HDBSCAN
clusterer = cumlHDBSCAN(gen_min_span_tree=True)
labels = clusterer.fit_predict(embedding)
# Calculate silhouette score on the sampled embeddings and labels
silhouette = silhouette_score(embedding, labels, sample_size=300000) |
Hmm, not sure why HDBSCAN is so slow then. I would not imagine it would take that long. You may want to try using fast_hdbscan instead -- on a machine with a reasonable number of cores it should be faster than that despite being CPU based. I also wouldn't use silhouette score; it's not the best for these purposes. DBCV would be best, but Davies-Bouldin might also work okay. |
Thank you for your response! A single iteration of UMAP using CPU implementation took about 4.5 hours. Would you recommend performing UMAP with GPU implementation followed by fast_hdbscan, and then using the relative_validity_ (DBCV) metric? Also, do you think Bayesian optimization would be effective for tuning the parameters of UMAP and HDBSCAN together? |
I think GPU UMAP and fast_hdbscan should work well together. I'm not sure Bayesian optimization will be any better than a well planned grid-search strategy, but it depends on how many parameters you are going to try to optimize over. |
Thank you so much again! I'll try GPU UMAP and fast_hdbscan! After checking, I found that fast_hdbscan does not support relative_validity_. Moreover, calculating metrics like DBCV and Davies-Bouldin for a dataset with 7.5 million records could be quite time-consuming. I'm interested in your thoughts on which evaluation metric to use and the best approach. Would it be reasonable to sample 5% or 10% of the data after performing HDBSCAN clustering and calculate the DBCV score on the sampled data? |
Hello, I worked on a smaller dataset using GPU UMAP and Fast HDBSCAN, evaluating the results with the Davies-Bouldin metric. After performing Bayesian optimization, I found hyperparameters that resulted in a score of less than 2. However, the issue is that nearly 90% of the data points end up in a single cluster. Despite trying various parameter combinations, this issue persists. Is there a way to address this problem, or should I re-run HDBSCAN on the dominant cluster using the same UMAP embedding? For reference, here’s how the data looks in the 2D UMAP embedding, with the center being very dense. |
If you really need smaller clusters then |
Thank you for your reply! I'm working with around 1 million data points. My objective function constraints include limiting the cluster count to between 5 and 20, with outliers kept under 20%. The best hyperparameters yield a Davies Bouldin score of 1.4, as shown below. Additionally, I've attached my objective function for reference: def umap_hdbscan_objective(n_neighbors, min_dist, n_components, min_samples, min_cluster_size, outlier_percentage=0.2):
"""
Objective function for optimizing UMAP and HDBSCAN parameters using the Davies-Bouldin Index
Returns:
float: Davies-Bouldin score if clustering is valid (lower values indicate better clustering), otherwise a penalty score indicating poor clustering
Notes:
- Clustering is considered failed if:
- No valid embeddings or labels are produced (returns -1000.0)
- All points are classified as outliers (returns -1000.1)
- Less than 5 clusters are detected (returns -1000.2)
- More than 20 clusters are detected (returns -1000.3)
- The number of outliers exceeds the specified percentage of the total data points (returns -1000.4)
"""
umap_embedding, cluster_labels = umap_hdbscan(
df=transformed_df,
n_neighbors=round(n_neighbors),
min_dist=min_dist,
n_components=round(n_components),
# metric=metric,
min_samples=round(min_samples),
min_cluster_size=round(min_cluster_size),
verbose=True
)
if umap_embedding is None or cluster_labels is None:
print('Clustering failed or no valid embeddings/labels')
return -1000.0 # High score indicating poor clustering
if max(cluster_labels) == -1:
print('No clusters found - all points are outliers')
return -1000.1 # High score indicating poor clustering
# check for number of clusters
num_clusters = len(set(cluster_labels))
if num_clusters < 5:
print(f'Too few clusters detected: {num_clusters}')
return -1000.2 # High score indicating poor clustering
if num_clusters > 20:
print(f'Too many clusters detected: {num_clusters}')
return -1000.3 # High score indicating poor clustering
# check for outliers count
total_points = len(transformed_df)
outlier_count = (cluster_labels == -1).sum()
# Calculate the threshold based on the specified percentage of total data points
outlier_threshold = outlier_percentage * total_points
# Check outlier condition
if outlier_count > outlier_threshold:
print(f'Outliers exceed {outlier_percentage*100}% of the total number of data points: {outlier_count}')
return -1000.4 # High penalty score indicating poor clustering
# cluster_size_counts = pd.Series(cluster_labels).value_counts()
# min_cluster_size = cluster_size_counts.min()
# max_cluster_size = cluster_size_counts.max()
db_score = davies_bouldin_score(umap_embedding, cluster_labels)
print(f"Davies-Bouldin score: {db_score}")
return -db_score # Lower is better, closer to 0 is ideal |
Hi, I used
I’m looking for suggestions on how to improve the clustering results or alternative approaches to achieve the desired clustering constraints |
I'm not sure how to achieve all your desiderata here. You can just run K-Means with k=20 and get something that meets your requirements; that may not exactly be a desired result though. It may be the case that the data does not cluster in the way you wnat it to. |
Hello,
I'm working with a very large dataset consisting of 7.5 million rows and 18 columns, which represents customer purchase behavior. I initially used UMAP for dimensionality reduction and attempted to tune the UMAP hyperparameters on a 5% subset of the dataset. However, I found that not all UMAP hyperparameters scale well, so I aim to tune both UMAP and HDBSCAN hyperparameters on the full dataset.
I'm using AWS SageMaker, so memory is not a constraint. Initially, I used the CPU implementation, but a single iteration was very time-consuming. Therefore, I switched to using RAPIDS AI for the GPU implementation of both UMAP and HDBSCAN. This approach ran significantly faster on G5 instances, with UMAP taking 16 minutes and HDBSCAN taking 1 hour and 16 minutes.
Afterwards, I used the Silhouette Score for evaluation (RAPIDS AI doesn't support relative validity of HDBSCAN). However, when attempting to run Bayesian Optimization, I encountered memory errors due to the G5 instance having only one GPU with 24GB of total GPU memory.
Question:
How can I effectively tune hyperparameters for such a large dataset? Is it reasonable to tune the hyperparameters on a 5% subset and then apply those hyperparameters to the entire dataset?
Any ideas or references would be greatly appreciated. Thank you!
@lmcinnes
The text was updated successfully, but these errors were encountered: