Skip to content

Commit

Permalink
add clustering_quality()
Browse files Browse the repository at this point in the history
  • Loading branch information
jaksle authored and alyst committed Jan 16, 2024
1 parent 1cade0d commit 92c7e7f
Show file tree
Hide file tree
Showing 8 changed files with 649 additions and 8 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
doc/build
Manifest.toml
*.swp
.vscode
docs/build/
150 changes: 145 additions & 5 deletions docs/source/validate.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ It shows how similar are the two clusterings on a cluster level.
counts(a::ClusteringResult, b::ClusteringResult)
```


## Rand index

[Rand index](http://en.wikipedia.org/wiki/Rand_index) is a measure of
Expand All @@ -28,7 +27,6 @@ even when the original class labels are not used.
randindex
```


## Silhouettes

[Silhouettes](http://en.wikipedia.org/wiki/Silhouette_(clustering)) is
Expand All @@ -46,14 +44,156 @@ s_i = \frac{b_i - a_i}{\max(a_i, b_i)}, \ \text{where}
from the ``i``-th point to the points in the ``k``-th cluster.

Note that ``s_i \le 1``, and that ``s_i`` is close to ``1`` when the ``i``-th
point lies well within its own cluster. This property allows using
`mean(silhouettes(assignments, counts, X))` as a measure of clustering quality.
point lies well within its own cluster. This property allows using average silhouette value
`mean(silhouettes(assignments, counts, X))` as a measure of clustering quality; it is also available using `clustering_quality(...; quality_index = :silhouettes)` method.
Higher values indicate better separation of clusters w.r.t. point distances.

```@docs
silhouettes
```

## Clustering quality indices

A group of clustering evaluation metrics which are intrinsic, i.e. depend only on the clustering itself. They can be used to compare different clustering algorithms or choose the optimal number of clusters.



| **index name** | **quality_index** | **type** | **direction** | **cluster centers** |
|:-----------------:|:--------------------:|:----------:|:-------------:|:-------------------:|
| Calinski-Harabasz | `:calinsky_harabasz` | hard/fuzzy | up | required |
| Xie-Beni | `:xie_beni` | hard/fuzzy | down | required |
| Davis-Bouldin | `:davis_bouldin` | hard | down | required |
| Dunn | `:dunn` | hard | up | not required |
| silhouettes | `:silhouettes` | hard | up | not required |


```@docs
Clustering.clustering_quality
```

Notation for the index definitions below:
- ``x_1, x_2, \ldots, x_n``: data points,
- ``C_1, C_2, \ldots, C_k``: clusters,
- ``c_j`` and ``c``: cluster centers and global dataset center,
- ``d``: a similarity (distance) function,
- ``w_{ij}``: weights measuring membership of a point ``x_i`` to a cluster ``C_j``,
- ``\alpha``: a fuzziness parameter.

### Calinski-Harabasz index

Option `:calinski_harabasz`. Higher values indicate better quality. Measures corrected ratio between global inertia of the cluster centers and the summed internal inertias of clusters. For hard and fuzzy (soft) clustering it is defined as

```math
\frac{n-k}{k-1}\frac{\sum_{C_j}|C_j|d(c_j,c)}{\sum\limits_{C_j}\sum\limits_{x_i\in C_j} d(x_i,c_j)} \quad \text{and}\quad
\frac{n-k}{k-1} \frac{\sum\limits_{C_j}\left(\sum\limits_{x_i}w_{ij}^\alpha\right) d(c_j,c)}{\sum_{C_j} \sum_{x_i} w_{ij}^\alpha d(x_i,c_j)}
```
respectively.


### Xie-Beni index
Option `:xie_beni`. Lower values indicate better quality. Measures ratio between summed inertia of clusters and minimum distance between cluster centres. For hard clustering and fuzzy (soft) clustering. It is defined as
```math
\frac{\sum_{C_j}\sum_{x_i\in C_j}d(x_i,c_j)}{n\min\limits_{c_{j_1}\neq c_{j_2}} d(c_{j_1},c_{j_2}) }
\quad \text{and}\quad
\frac{\sum_{C_j}\sum_{x_i} w_{ij}^\alpha d(x_i,c_j)}{n\min\limits_{c_{j_1}\neq c_{j_2}} d(c_{j_1},c_{j_2}) }
```
respectively.

### [Davis-Bouldin index](https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index)
Option `:davis_bouldin`. Lower values indicate better quality. It measures average cohesion based on the cluster diameters and distances between cluster centers. It is defined as

```math
\frac{1}{k}\sum_{C_{j_1}}\max_{c_{j_2}\neq c_{j_1}}\frac{S(C_{j_1})+S(C_{j_2})}{d(c_{j_1},c_{j_2})}
```
where
```math
S(C_j) = \frac{1}{|C_j|}\sum_{x_i\in C_j}d(x_i,c_j).
```
### [Dunn index](https://en.wikipedia.org/wiki/Dunn_index)
Option `:dunn`. Higher values indicate better quality. More computationally demanding index which can be used when the centres are not known. It measures ratio between the nearest neighbour distance divided by the maximum cluster diameter. It is defined as
```math
\frac{\min\limits_{ C_{j_1}\neq C_{j_2}} \mathrm{dist}(C_{j_1},C_{j_2})}{\max\limits_{C_j}\mathrm{diam}(C_j)}
```
where
```math
\mathrm{dist}(C_{j_1},C_{j_2}) = \min\limits_{x_{i_1}\in C_{j_1},x_{i_2}\in C_{j_2}} d(x_{i_1},x_{i_2}),\quad \mathrm{diam}(C_j) = \max\limits_{x_{i_1},x_{i_2}\in C_j} d(x_{i_1},x_{i_2}).
```

### Average silhouette index

Option `:silhouettes`. Higher values indicate better quality. It returns the average over silhouette values in the whole data set. See section [Silhouettes](#silhouettes) for a more detailed description of the method.


### References
> Olatz Arbelaitz *et al.* (2013). *An extensive comparative study of cluster validity indices*. Pattern Recognition. 46 1: 243-256. [doi:10.1016/j.patcog.2012.07.021](https://doi.org/10.1016/j.patcog.2012.07.021)
> Aybükë Oztürk, Stéphane Lallich, Jérôme Darmont. (2018). *A Visual Quality Index for Fuzzy C-Means*. 14th International Conference on Artificial Intelligence Applications and Innovations (AIAI 2018). 546-555. [doi:10.1007/978-3-319-92007-8_46](https://doi.org/10.1007/978-3-319-92007-8_46).
### Examples

Exemplary data with 3 real clusters.
```@example
using Plots, Clustering
X = hcat([4., 5.] .+ 0.4 * randn(2, 10),
[9., -5.] .+ 0.4 * randn(2, 5),
[-4., -9.] .+ 1 * randn(2, 5))
scatter(view(X, 1, :), view(X, 2, :),
label = "data points",
xlabel = "x",
ylabel = "y",
legend = :right,
)
```

Hard clustering quality for K-means method with 2 to 5 clusters:

```@example
using Plots, Clustering
X = hcat([4., 5.] .+ 0.4 * randn(2, 10),
[9., -5.] .+ 0.4 * randn(2, 5),
[-4., -9.] .+ 1 * randn(2, 5))
nclusters = 2:5
clusterings = kmeans.(Ref(X), nclusters)
plot((
plot(nclusters,
clustering_quality.(Ref(X), clusterings, quality_index = qidx),
marker = :circle,
title = ":$qidx", label = nothing,
) for qidx in [:silhouettes, :calinski_harabasz, :xie_beni, :davies_bouldin, :dunn])...,
layout = (3, 2),
xaxis = "N clusters",
plot_title = "\"Hard\" clustering quality indices"
)
```

Fuzzy clustering quality for fuzzy C-means method with 2 to 5 clusters:
```@example
using Plots, Clustering
X = hcat([4., 5.] .+ 0.4 * randn(2, 10),
[9., -5.] .+ 0.4 * randn(2, 5),
[-4., -9.] .+ 1 * randn(2, 5))
fuzziness = 2
fuzzy_nclusters = 2:5
fuzzy_clusterings = fuzzy_cmeans.(Ref(X), fuzzy_nclusters, fuzziness)
plot((
plot(fuzzy_nclusters,
clustering_quality.(Ref(X), fuzzy_clusterings,
fuzziness = fuzziness, quality_index = qidx),
marker = :circle,
title = ":$qidx", label = nothing,
) for qidx in [:calinski_harabasz, :xie_beni])...,
layout = (2, 1),
xaxis = "N clusters",
plot_title = "\"Soft\" clustering quality indices"
)
```

## Variation of Information

Expand All @@ -64,7 +204,7 @@ information*, but it is a true metric, *i.e.* it is symmetric and satisfies
the triangle inequality.

```@docs
varinfo
Clustering.varinfo
```


Expand Down
55 changes: 55 additions & 0 deletions examples/clustering_quality.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
using Plots, Clustering

## test data with 3 clusters
X = hcat([4., 5.] .+ 0.4 * randn(2, 10),
[9., -5.] .+ 0.4 * randn(2, 5),
[-4., -9.] .+ 1 * randn(2, 5))

## visualisation of the exemplary data
scatter(X[1,:], X[2,:],
label = "data points",
xlabel = "x",
ylabel = "y",
legend = :right,
)

nclusters = 2:5

## hard clustering quality
clusterings = kmeans.(Ref(X), nclusters)
hard_indices = [:silhouettes, :calinski_harabasz, :xie_beni, :davies_bouldin, :dunn]

kmeans_quality = Dict(
qidx => clustering_quality.(Ref(X), clusterings, quality_index = qidx)
for qidx in hard_indices)

plot((
plot(nclusters, kmeans_quality[qidx],
marker = :circle,
title = qidx,
label = nothing,
) for qidx in hard_indices)...,
layout = (3, 2),
xaxis = "N clusters",
plot_title = "\"Hard\" clustering quality indices"
)

## soft clustering quality
fuzziness = 2
fuzzy_clusterings = fuzzy_cmeans.(Ref(X), nclusters, fuzziness)
soft_indices = [:calinski_harabasz, :xie_beni]

fuzzy_cmeans_quality = Dict(
qidx => clustering_quality.(Ref(X), fuzzy_clusterings, fuzziness = fuzziness, quality_index = qidx)
for qidx in soft_indices)

plot((
plot(nclusters, fuzzy_cmeans_quality[qidx],
marker = :circle,
title = qidx,
label = nothing,
) for qidx in soft_indices)...,
layout = (2, 1),
xaxis = "N clusters",
plot_title = "\"Soft\" clustering quality indices"
)
8 changes: 7 additions & 1 deletion src/Clustering.jl
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,9 @@ module Clustering
# silhouette
silhouettes,

# quality indices
clustering_quality,

# varinfo
varinfo,

Expand All @@ -70,6 +73,7 @@ module Clustering
# pair confusion matrix
confusion


## source files

include("utils.jl")
Expand All @@ -84,13 +88,15 @@ module Clustering

include("counts.jl")
include("cluster_distances.jl")

include("silhouette.jl")
include("clustering_quality.jl")

include("randindex.jl")
include("varinfo.jl")
include("vmeasure.jl")
include("mutualinfo.jl")
include("confusion.jl")

include("hclust.jl")

include("deprecate.jl")
Expand Down
Loading

0 comments on commit 92c7e7f

Please sign in to comment.