This repository implements the clustering algorithm Clustering via Uncoupled REgression (CURE) from Wang's paper Efficient Clustering for Stretched Mixtures: Landscape and Optimality (NeurIPS 2020). It also conducts several experiments to benchmark the performance of CURE.
Many traditional clustering algorithms struggle to cluster elliptically distributed data. K-Means, for example, assumes the data is spherically distributed and performs poorly when data is elliptically distributed. CURE seeks to solve this problem by creating a clustering algorithm that excels at clustering elliptically distributed data.
CURE seeks to find the weights that minimize the loss function
where are the weights, are the optimal weights that minimize the above equation, is a sample data point with features, is a matrix of datapoints each with features, and is the value of the average data point. To get we preappend a column of ones to the data (which is really ) to give us an intercept term.
The discriminative function is defined as
We embed the data into 1D space via the dot product . Then we plug into the discriminative function to separate, ie. discriminate, the embedded data into two clusters. How does this work? and both have minimums at both =±1 so minimizing these equations will map many of our datapoints to and many of them to , resulting in two different clusters. Because becomes huge for large values ( is quartic), we construct which is just like except that when is too big we clip its growth with linear functions. More explitically, has three parts. When is small, , we will minimize which has two valleys around ±1. When is too big, , we will minimize a linear function so our values don't blow up. When is somewhere in between, , we use a cubic spline to connect the valleys to the linear function.
So embeds the data in 1D space and computes on average how well the data is separated into two clusters located at =±1. The lower this value, the better. However, we can minimize this function by clustering all of the datapoints into a single cluster. To avoid this trivial solution, we use the penalty term . This term encourages the data to be evenly split between the two clusters at =±1 as this term has the lowest value when which only occurs when the data is evenly split between both clusters.
Putting it all together, the loss function embeds the data in 1D space and computes on average how well the data is separated into two evenly sized clusters located at x=±1. We minimize this score and record the weights, , that result in this minimization.
Once we have computed , the clustering comes into play: where is the predicted label of the th sample for all datapoints . The function puts all positive datapoints into one cluster and all negative datapoints into a different cluster. And these are our clusters! That's it.
I designed CURE so it resembles many of the classifiers in scikit-learn
. And I did this by creating a cure
class with __init__
and three methods.
When you initialize the class, you specify the the random seed and the parameters a, b
.
When you call fit
, it returns the weights that result in the best performance by minimizing the loss function. It relies upon scipy
's minimization function.
When you call predict
, it uses the weights computed in fit
to predict a label for each datapoint. It uses the equation
.
This calls both fit
and predict
on the training data.
For our first experiment, let's use CURE to cluster some elliptically distributed data.
True Clustering | CURE Clustering |
---|---|
From these two plots, it seems like CURE pretty perfectly predicts which datapoint belongs to which cluster. This is a great visual check that everything is working.
Next, we look at the adjusted rand index (ARI), a measure of the similarity between two different data clusterings (here, and for all datapoints ) that is adjusted for the chance grouping of elements. This is the clustering analogue for accuracy with a lower bound of -1 and an upper bound of 1; an ARI of 0 corresponds to the average clustering, ie a random guess. Our clustering achieved an adjusted rand index of 0.98 which is pretty great!
And finally we have the misclassification rate which is a tiny 0.4% or 0.004. It seems like we may have made a few incorrect predictions by the border between the data. Nonetheless, CURE clearly does a great job of clustering on this elliptically distributed data.
Experiment 2 compares the performance of CURE and many other clustering algorithms on various datasets.
The clustering algorithms tested are:
- CURE
- KMeans
- Meanshift
- Spectral clustering
- Ward
- Agglomerative Clustering
- DBSCAN
- OPTICS
- BIRCH
- Gaussian Mixture
Then for every dataset and algorithm, I recorded the time it took for the algorithm to run and the Adjusted Rand Index (ARI).
- CURE had near-constant runtime, regardless of the distribution of the data.
- CURE achieved perfect classification on the two elliptically distributed datasets. In other words, it did exactly what it was suppossed to do.
- CURE only performed well on clusters that were linearly seperable, ie it performed quite poorly on the first two datasets. This suggests that CURE might only work well on data with a linear decision boundary.
- Generally speaking, CURE was on par with the rest of the well known clustering algorithms. However, Spectral Clustering performed the best overall.