cluster-optimizer

Installation

You can install this package with pip using the following command:

pip install git+https://github.com/ndgigliotti/cluster-optimizer.git@main

Purpose

This project provides a simple, Scikit-Learn-compatible, hyperparameter optimization tool for clustering. It's intended for situations where predicting clusters for new data points is a low priority. Many clustering algorithms in Scikit-Learn are transductive, meaning that they are not designed to be applied to new observations. Even if using an inductive clustering algorithm like K-Means, you might not have any desire to predict clusters for new observations. Or, even if you do have such a desire, prediction might be a lower priority than finding the best clusters in the data.

Since Scikit-Learn's GridSearchCV uses cross-validation, and is designed to optimize inductive machine learning models, an alternative tool is necessary.

`ClusterOptimizer`

The ClusterOptimizer class is a hyperparameter search tool for optimizing clustering algorithms. It simply fits one model per hyperparameter combination and selects the best. It's a spin-off of GridSearchCV, and the implementation is derived from Scikit-Learn. The only difference is that it doesn't use cross-validation and is designed to work with special clustering scorers. It's not always necessary to provide a target variable, since clustering metrics such as silhouette, Calinski-Harabasz, and Davies-Bouldin are designed for unsupervised clustering.

The interface is largely the same as GridSearchCV. One minor difference is that the search results are stored in the results_ attribute, rather than cv_results_.

Transductive Clustering Scorers

You can use ClusterOptimizer by passing the string name of a Scikit-Learn clustering metric, e.g. 'silhouette', 'calinski_harabasz', or 'rand_score' (the '_score' suffix is optional). You can also create a special scorer for transductive clustering using scorer.make_scorer on any score function with the signature score_func(labels_true, labels_fit) or score_func(X, labels_fit).

Recognized Scorer Names

Note that the '_score' suffix is always optional.

'silhouette_score'
'silhouette_score_euclidean'
'silhouette_score_cosine'
'davies_bouldin_score'
'calinski_harabasz_score'
'mutual_info_score'
'normalized_mutual_info_score'
'adjusted_mutual_info_score'
'rand_score'
'adjusted_rand_score'
'completeness_score'
'fowlkes_mallows_score'
'homogeneity_score'
'v_measure_score'

Caveats

Comparing Clustering Algorithms

It's important to consider your dataset and goals before comparing clustering algorithms in a grid search. Just because one algorithm gets a higher score than another does not necessarily make it a better choice. Different clustering algorithms have different benefits, drawbacks, and use cases.

Future Work

Write automated tests.
Develop alternative to BaseSearchCV.
Add multi-metric compatibility.
Remove noise "cluster" and impose noise limit.
Update docstrings taken from Scikit-Learn.
Add more search types (e.g. randomized).

Credits

Most of the credit goes to the developers of Scikit-Learn for the engineering behind the search estimators. It's not very hard to spam a bunch of models with different hyperparameters, but it's hard to do it in a robust way with a friendly interface and wide compatibility.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

cluster-optimizer

Installation

Purpose

`ClusterOptimizer`

Transductive Clustering Scorers

Recognized Scorer Names

Caveats

Comparing Clustering Algorithms

Future Work

Credits

Files

README.md

Latest commit

History

README.md

File metadata and controls

cluster-optimizer

Installation

Purpose

ClusterOptimizer

Transductive Clustering Scorers

Recognized Scorer Names

Caveats

Comparing Clustering Algorithms

Future Work

Credits

`ClusterOptimizer`