add clustering_quality()

JuliaStats · Jan 16, 2024 · 92c7e7f · 92c7e7f
1 parent 1cade0d
commit 92c7e7f
Show file tree

Hide file tree

Showing 8 changed files with 649 additions and 8 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,5 @@
 doc/build
 Manifest.toml
+*.swp
+.vscode
+docs/build/
diff --git a/docs/source/validate.md b/docs/source/validate.md
@@ -16,7 +16,6 @@ It shows how similar are the two clusterings on a cluster level.
 counts(a::ClusteringResult, b::ClusteringResult)
 ```
 
-
 ## Rand index
 
 [Rand index](http://en.wikipedia.org/wiki/Rand_index) is a measure of
@@ -28,7 +27,6 @@ even when the original class labels are not used.
 randindex
 ```
 
-
 ## Silhouettes
 
 [Silhouettes](http://en.wikipedia.org/wiki/Silhouette_(clustering)) is
@@ -46,14 +44,156 @@ s_i = \frac{b_i - a_i}{\max(a_i, b_i)}, \ \text{where}
    from the ``i``-th point to the points in the ``k``-th cluster.
 
 Note that ``s_i \le 1``, and that ``s_i`` is close to ``1`` when the ``i``-th
-point lies well within its own cluster. This property allows using
-`mean(silhouettes(assignments, counts, X))` as a measure of clustering quality.
+point lies well within its own cluster. This property allows using average silhouette value
+`mean(silhouettes(assignments, counts, X))` as a measure of clustering quality; it is also available using `clustering_quality(...; quality_index = :silhouettes)` method.
 Higher values indicate better separation of clusters w.r.t. point distances.
 
 ```@docs
 silhouettes
 ```
 
+## Clustering quality indices
+
+A group of clustering evaluation metrics which are intrinsic, i.e. depend only on the clustering itself. They can be used to compare different clustering algorithms or choose the optimal number of clusters.
+
+
+
+|   **index name**  |   **quality_index**  |  **type**  | **direction** | **cluster centers** |
+|:-----------------:|:--------------------:|:----------:|:-------------:|:-------------------:|
+| Calinski-Harabasz | `:calinsky_harabasz` | hard/fuzzy |       up      |       required      |
+|      Xie-Beni     |      `:xie_beni`     | hard/fuzzy |      down     |       required      |
+|   Davis-Bouldin   |   `:davis_bouldin`   |    hard    |      down     |       required      |
+|        Dunn       |        `:dunn`       |    hard    |       up      |     not required    |
+|    silhouettes    |    `:silhouettes`    |    hard    |       up      |     not required    |
+
+
+```@docs
+Clustering.clustering_quality
+```
+
+Notation for the index definitions below:
+- ``x_1, x_2, \ldots, x_n``: data points,
+- ``C_1, C_2, \ldots, C_k``: clusters,
+- ``c_j`` and ``c``: cluster centers and global dataset center,
+- ``d``: a similarity (distance) function,
+- ``w_{ij}``: weights measuring membership of a point ``x_i`` to a cluster ``C_j``,
+- ``\alpha``:  a fuzziness parameter.
+
+### Calinski-Harabasz index
+
+Option `:calinski_harabasz`. Higher values indicate better quality. Measures corrected ratio between global inertia of the cluster centers and the summed internal inertias of clusters. For hard and fuzzy (soft) clustering it is defined as
+
+```math
+
+\frac{n-k}{k-1}\frac{\sum_{C_j}|C_j|d(c_j,c)}{\sum\limits_{C_j}\sum\limits_{x_i\in C_j} d(x_i,c_j)} \quad \text{and}\quad
+\frac{n-k}{k-1} \frac{\sum\limits_{C_j}\left(\sum\limits_{x_i}w_{ij}^\alpha\right) d(c_j,c)}{\sum_{C_j} \sum_{x_i} w_{ij}^\alpha d(x_i,c_j)}
+```
+respectively.
+
+
+### Xie-Beni index
+Option `:xie_beni`. Lower values indicate better quality. Measures ratio between summed inertia of clusters and minimum distance between cluster centres. For hard clustering and fuzzy (soft) clustering. It is defined as
+```math
+\frac{\sum_{C_j}\sum_{x_i\in C_j}d(x_i,c_j)}{n\min\limits_{c_{j_1}\neq c_{j_2}} d(c_{j_1},c_{j_2}) }
+\quad \text{and}\quad
+\frac{\sum_{C_j}\sum_{x_i} w_{ij}^\alpha d(x_i,c_j)}{n\min\limits_{c_{j_1}\neq c_{j_2}} d(c_{j_1},c_{j_2}) }
+```
+respectively.
+
+### [Davis-Bouldin index](https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index)
+Option `:davis_bouldin`. Lower values indicate better quality. It measures average cohesion based on the cluster diameters and distances between cluster centers. It is defined as
+
+```math
+\frac{1}{k}\sum_{C_{j_1}}\max_{c_{j_2}\neq c_{j_1}}\frac{S(C_{j_1})+S(C_{j_2})}{d(c_{j_1},c_{j_2})}
+```
+where
+```math
+S(C_j) = \frac{1}{|C_j|}\sum_{x_i\in C_j}d(x_i,c_j).
+```
+### [Dunn index](https://en.wikipedia.org/wiki/Dunn_index)
+Option `:dunn`. Higher values indicate better quality. More computationally demanding index which can be used when the centres are not known. It measures ratio between the nearest neighbour distance divided by the maximum cluster diameter. It is defined as
+```math
+\frac{\min\limits_{ C_{j_1}\neq C_{j_2}} \mathrm{dist}(C_{j_1},C_{j_2})}{\max\limits_{C_j}\mathrm{diam}(C_j)}
+```
+where
+```math
+\mathrm{dist}(C_{j_1},C_{j_2}) = \min\limits_{x_{i_1}\in C_{j_1},x_{i_2}\in C_{j_2}} d(x_{i_1},x_{i_2}),\quad \mathrm{diam}(C_j) = \max\limits_{x_{i_1},x_{i_2}\in C_j} d(x_{i_1},x_{i_2}).
+```
+
+### Average silhouette index
+
+Option `:silhouettes`. Higher values indicate better quality. It returns the average over silhouette values in the whole data set. See section [Silhouettes](#silhouettes) for a more detailed description of the method.
+
+
+### References
+> Olatz Arbelaitz *et al.* (2013). *An extensive comparative study of cluster validity indices*. Pattern Recognition. 46 1: 243-256. [doi:10.1016/j.patcog.2012.07.021](https://doi.org/10.1016/j.patcog.2012.07.021)
+
+> Aybükë Oztürk, Stéphane Lallich, Jérôme Darmont. (2018). *A Visual Quality Index for Fuzzy C-Means*.  14th International Conference on Artificial Intelligence Applications and Innovations (AIAI 2018). 546-555. [doi:10.1007/978-3-319-92007-8_46](https://doi.org/10.1007/978-3-319-92007-8_46).
+
+### Examples
+
+Exemplary data with 3 real clusters.
+```@example
+using Plots, Clustering
+X = hcat([4., 5.] .+ 0.4 * randn(2, 10),
+         [9., -5.] .+ 0.4 * randn(2, 5),
+         [-4., -9.] .+ 1 * randn(2, 5))
+
+
+scatter(view(X, 1, :), view(X, 2, :),
+    label = "data points",
+    xlabel = "x",
+    ylabel = "y",
+    legend = :right,
+)
+```
+
+Hard clustering quality for K-means method with 2 to 5 clusters:
+
+```@example
+using Plots, Clustering
+X = hcat([4., 5.] .+ 0.4 * randn(2, 10),
+         [9., -5.] .+ 0.4 * randn(2, 5),
+         [-4., -9.] .+ 1 * randn(2, 5))
+
+nclusters = 2:5
+clusterings = kmeans.(Ref(X), nclusters)
+
+plot((
+    plot(nclusters,
+         clustering_quality.(Ref(X), clusterings, quality_index = qidx),
+         marker = :circle,
+         title = ":$qidx", label = nothing,
+    ) for qidx in [:silhouettes, :calinski_harabasz, :xie_beni, :davies_bouldin, :dunn])...,
+    layout = (3, 2),
+    xaxis = "N clusters",
+    plot_title = "\"Hard\" clustering quality indices"
+)
+```
+
+Fuzzy clustering quality for fuzzy C-means method with 2 to 5 clusters:
+```@example
+using Plots, Clustering
+X = hcat([4., 5.] .+ 0.4 * randn(2, 10),
+         [9., -5.] .+ 0.4 * randn(2, 5),
+         [-4., -9.] .+ 1 * randn(2, 5))
+
+fuzziness = 2
+fuzzy_nclusters = 2:5
+fuzzy_clusterings = fuzzy_cmeans.(Ref(X), fuzzy_nclusters, fuzziness)
+
+plot((
+    plot(fuzzy_nclusters,
+         clustering_quality.(Ref(X), fuzzy_clusterings,
+                             fuzziness = fuzziness, quality_index = qidx),
+         marker = :circle,
+         title = ":$qidx", label = nothing,
+    ) for qidx in [:calinski_harabasz, :xie_beni])...,
+    layout = (2, 1),
+    xaxis = "N clusters",
+    plot_title = "\"Soft\" clustering quality indices"
+)
+```
 
 ## Variation of Information
 
@@ -64,7 +204,7 @@ information*, but it is a true metric, *i.e.* it is symmetric and satisfies
 the triangle inequality.
 
 ```@docs
-varinfo
+Clustering.varinfo
 ```
 
 

diff --git a/examples/clustering_quality.jl b/examples/clustering_quality.jl
@@ -0,0 +1,55 @@
+using Plots, Clustering
+
+## test data with 3 clusters
+X = hcat([4., 5.] .+ 0.4 * randn(2, 10),
+         [9., -5.] .+ 0.4 * randn(2, 5),
+         [-4., -9.] .+ 1 * randn(2, 5))
+
+## visualisation of the exemplary data
+scatter(X[1,:], X[2,:],
+    label = "data points",
+    xlabel = "x",
+    ylabel = "y",
+    legend = :right,
+)
+
+nclusters = 2:5
+
+## hard clustering quality
+clusterings = kmeans.(Ref(X), nclusters)
+hard_indices = [:silhouettes, :calinski_harabasz, :xie_beni, :davies_bouldin, :dunn]
+
+kmeans_quality = Dict(
+    qidx => clustering_quality.(Ref(X), clusterings, quality_index = qidx)
+    for qidx in hard_indices)
+
+plot((
+    plot(nclusters, kmeans_quality[qidx],
+         marker = :circle,
+         title = qidx,
+         label = nothing,
+    ) for qidx in hard_indices)...,
+    layout = (3, 2),
+    xaxis = "N clusters",
+    plot_title = "\"Hard\" clustering quality indices"
+)
+
+## soft clustering quality
+fuzziness = 2
+fuzzy_clusterings = fuzzy_cmeans.(Ref(X), nclusters, fuzziness)
+soft_indices = [:calinski_harabasz, :xie_beni]
+
+fuzzy_cmeans_quality = Dict(
+    qidx => clustering_quality.(Ref(X), fuzzy_clusterings, fuzziness = fuzziness, quality_index = qidx)
+    for qidx in soft_indices)
+
+plot((
+    plot(nclusters, fuzzy_cmeans_quality[qidx],
+        marker = :circle,
+        title = qidx,
+        label = nothing,
+    ) for qidx in soft_indices)...,
+    layout = (2, 1),
+    xaxis = "N clusters",
+    plot_title = "\"Soft\" clustering quality indices"
+)
diff --git a/src/Clustering.jl b/src/Clustering.jl
@@ -49,6 +49,9 @@ module Clustering
     # silhouette
     silhouettes,
 
+    # quality indices
+    clustering_quality,
+
     # varinfo
     varinfo,
 
@@ -70,6 +73,7 @@ module Clustering
     # pair confusion matrix
     confusion
 
+
     ## source files
 
     include("utils.jl")
@@ -84,13 +88,15 @@ module Clustering
 
     include("counts.jl")
     include("cluster_distances.jl")
+
     include("silhouette.jl")
+    include("clustering_quality.jl")
+
     include("randindex.jl")
     include("varinfo.jl")
     include("vmeasure.jl")
     include("mutualinfo.jl")
     include("confusion.jl")
-
     include("hclust.jl")
 
     include("deprecate.jl")