From c39ebb8ab35d758dd67ac3a276d9985401018b67 Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Tue, 19 Dec 2023 05:37:27 +0000 Subject: [PATCH] build based on 0bade2b --- dev/.documenter-siteinfo.json | 2 +- dev/affprop.html | 2 +- dev/algorithms.html | 4 +- dev/dbscan.html | 2 +- dev/fuzzycmeans.html | 42 +- dev/hclust.html | 4 +- dev/index.html | 2 +- dev/init.html | 10 +- ...means-ccfdcd40.svg => kmeans-7d3de746.svg} | 362 +++++++++--------- dev/kmeans.html | 14 +- dev/kmedoids.html | 4 +- dev/mcl.html | 2 +- dev/validate.html | 6 +- 13 files changed, 228 insertions(+), 228 deletions(-) rename dev/{kmeans-ccfdcd40.svg => kmeans-7d3de746.svg} (63%) diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index 745da955..0fe7b2ba 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.9.4","generation_timestamp":"2023-12-19T04:33:06","documenter_version":"1.2.1"}} \ No newline at end of file +{"documenter":{"julia_version":"1.9.4","generation_timestamp":"2023-12-19T05:37:24","documenter_version":"1.2.1"}} \ No newline at end of file diff --git a/dev/affprop.html b/dev/affprop.html index 7ddeaade..4b48b0ce 100644 --- a/dev/affprop.html +++ b/dev/affprop.html @@ -1,3 +1,3 @@ Affinity Propagation · Clustering.jl

Affinity Propagation

Affinity propagation is a clustering algorithm based on message passing between data points. Similar to K-medoids, it looks at the (dis)similarities in the data, picks one exemplar data point for each cluster, and assigns every point in the data set to the cluster with the closest exemplar.

Clustering.affinitypropFunction
affinityprop(S::AbstractMatrix; [maxiter=200], [tol=1e-6], [damp=0.5],
-             [display=:none]) -> AffinityPropResult

Perform affinity propagation clustering based on a similarity matrix S.

$S_{ij}$ ($i ≠ j$) is the similarity (or the negated distance) between the $i$-th and $j$-th points, $S_{ii}$ defines the availability of the $i$-th point as an exemplar.

Arguments

  • damp::Real: the dampening coefficient, $0 ≤ \mathrm{damp} < 1$. Larger values indicate slower (and probably more stable) update. $\mathrm{damp} = 0$ disables dampening.
  • maxiter, tol, display: see common options

References

Brendan J. Frey and Delbert Dueck. Clustering by Passing Messages Between Data Points. Science, vol 315, pages 972-976, 2007.

source
Clustering.AffinityPropResultType
AffinityPropResult <: ClusteringResult

The output of affinity propagation clustering (affinityprop).

Fields

  • exemplars::Vector{Int}: indices of exemplars (cluster centers)
  • assignments::Vector{Int}: cluster assignments for each data point
  • iterations::Int: number of iterations executed
  • converged::Bool: converged or not
source
+ [display=:none]) -> AffinityPropResult

Perform affinity propagation clustering based on a similarity matrix S.

$S_{ij}$ ($i ≠ j$) is the similarity (or the negated distance) between the $i$-th and $j$-th points, $S_{ii}$ defines the availability of the $i$-th point as an exemplar.

Arguments

References

Brendan J. Frey and Delbert Dueck. Clustering by Passing Messages Between Data Points. Science, vol 315, pages 972-976, 2007.

source
Clustering.AffinityPropResultType
AffinityPropResult <: ClusteringResult

The output of affinity propagation clustering (affinityprop).

Fields

  • exemplars::Vector{Int}: indices of exemplars (cluster centers)
  • assignments::Vector{Int}: cluster assignments for each data point
  • iterations::Int: number of iterations executed
  • converged::Bool: converged or not
source
diff --git a/dev/algorithms.html b/dev/algorithms.html index 0d0dd805..04a6da34 100644 --- a/dev/algorithms.html +++ b/dev/algorithms.html @@ -1,3 +1,3 @@ -Basics · Clustering.jl

Basics

The package implements a variety of clustering algorithms:

Most of the clustering functions in the package have a similar interface, making it easy to switch between different clustering algorithms.

Inputs

A clustering algorithm, depending on its nature, may accept an input matrix in either of the following forms:

  • Data matrix $X$ of size $d \times n$, the $i$-th column of $X$ (X[:, i]) is a data point (data sample) in $d$-dimensional space.
  • Distance matrix $D$ of size $n \times n$, where $D_{ij}$ is the distance between the $i$-th and $j$-th points, or the cost of assigning them to the same cluster.

Common Options

Many clustering algorithms are iterative procedures. The functions share the basic options for controlling the iterations:

  • maxiter::Integer: maximum number of iterations.
  • tol::Real: minimal allowed change of the objective during convergence. The algorithm is considered to be converged when the change of objective value between consecutive iterations drops below tol.
  • display::Symbol: the level of information to be displayed. It may take one of the following values:
    • :none: nothing is shown
    • :final: only shows a brief summary when the algorithm ends
    • :iter: shows the progress at each iteration

Results

A clustering function would return an object (typically, an instance of some ClusteringResult subtype) that contains both the resulting clustering (e.g. assignments of points to the clusters) and the information about the clustering algorithm (e.g. the number of iterations and whether it converged).

The following generic methods are supported by any subtype of ClusteringResult:

StatsBase.countsMethod
counts(R::ClusteringResult) -> Vector{Int}

Get the vector of cluster sizes.

counts(R)[k] is the number of points assigned to the $k$-th cluster.

source
Clustering.wcountsMethod
wcounts(R::ClusteringResult) -> Vector{Float64}
-wcounts(R::FuzzyCMeansResult) -> Vector{Float64}

Get the weighted cluster sizes as the sum of weights of points assigned to each cluster.

For non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R)).

source
Clustering.assignmentsMethod
assignments(R::ClusteringResult) -> Vector{Int}

Get the vector of cluster indices for each point.

assignments(R)[i] is the index of the cluster to which the $i$-th point is assigned.

source
+Basics · Clustering.jl

Basics

The package implements a variety of clustering algorithms:

Most of the clustering functions in the package have a similar interface, making it easy to switch between different clustering algorithms.

Inputs

A clustering algorithm, depending on its nature, may accept an input matrix in either of the following forms:

  • Data matrix $X$ of size $d \times n$, the $i$-th column of $X$ (X[:, i]) is a data point (data sample) in $d$-dimensional space.
  • Distance matrix $D$ of size $n \times n$, where $D_{ij}$ is the distance between the $i$-th and $j$-th points, or the cost of assigning them to the same cluster.

Common Options

Many clustering algorithms are iterative procedures. The functions share the basic options for controlling the iterations:

  • maxiter::Integer: maximum number of iterations.
  • tol::Real: minimal allowed change of the objective during convergence. The algorithm is considered to be converged when the change of objective value between consecutive iterations drops below tol.
  • display::Symbol: the level of information to be displayed. It may take one of the following values:
    • :none: nothing is shown
    • :final: only shows a brief summary when the algorithm ends
    • :iter: shows the progress at each iteration

Results

A clustering function would return an object (typically, an instance of some ClusteringResult subtype) that contains both the resulting clustering (e.g. assignments of points to the clusters) and the information about the clustering algorithm (e.g. the number of iterations and whether it converged).

The following generic methods are supported by any subtype of ClusteringResult:

StatsBase.countsMethod
counts(R::ClusteringResult) -> Vector{Int}

Get the vector of cluster sizes.

counts(R)[k] is the number of points assigned to the $k$-th cluster.

source
Clustering.wcountsMethod
wcounts(R::ClusteringResult) -> Vector{Float64}
+wcounts(R::FuzzyCMeansResult) -> Vector{Float64}

Get the weighted cluster sizes as the sum of weights of points assigned to each cluster.

For non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R)).

source
Clustering.assignmentsMethod
assignments(R::ClusteringResult) -> Vector{Int}

Get the vector of cluster indices for each point.

assignments(R)[i] is the index of the cluster to which the $i$-th point is assigned.

source
diff --git a/dev/dbscan.html b/dev/dbscan.html index 2ab94b62..febe39f5 100644 --- a/dev/dbscan.html +++ b/dev/dbscan.html @@ -4,4 +4,4 @@ [min_neighbors=1], [min_cluster_size=1], [nntree_kwargs...]) -> DbscanResult

Cluster points using the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm.

Arguments

Optional keyword arguments to control the algorithm:

Example

points = randn(3, 10000)
 # DBSCAN clustering, clusters with less than 20 points will be discarded:
-clustering = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)

References:

source
Clustering.DbscanResultType
DbscanResult <: ClusteringResult

The output of dbscan function.

Fields

  • clusters::Vector{DbscanCluster}: clusters, length K
  • seeds::Vector{Int}: indices of the first points of each cluster's core, length K
  • counts::Vector{Int}: cluster sizes (number of assigned points), length K
  • assignments::Vector{Int}: vector of clusters indices, where each point was assigned to, length N
source
Clustering.DbscanClusterType
DbscanCluster

DBSCAN cluster, part of DbscanResult returned by dbscan function.

Fields

  • size::Int: number of points in a cluster (core + boundary)
  • core_indices::Vector{Int}: indices of points in the cluster core, a.k.a. seeds (have at least min_neighbors neighbors in the cluster)
  • boundary_indices::Vector{Int}: indices of the cluster points outside of core
source
+clustering = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)

References:

source
Clustering.DbscanResultType
DbscanResult <: ClusteringResult

The output of dbscan function.

Fields

  • clusters::Vector{DbscanCluster}: clusters, length K
  • seeds::Vector{Int}: indices of the first points of each cluster's core, length K
  • counts::Vector{Int}: cluster sizes (number of assigned points), length K
  • assignments::Vector{Int}: vector of clusters indices, where each point was assigned to, length N
source
Clustering.DbscanClusterType
DbscanCluster

DBSCAN cluster, part of DbscanResult returned by dbscan function.

Fields

  • size::Int: number of points in a cluster (core + boundary)
  • core_indices::Vector{Int}: indices of points in the cluster core, a.k.a. seeds (have at least min_neighbors neighbors in the cluster)
  • boundary_indices::Vector{Int}: indices of the cluster points outside of core
source
diff --git a/dev/fuzzycmeans.html b/dev/fuzzycmeans.html index ada98f55..4632450e 100644 --- a/dev/fuzzycmeans.html +++ b/dev/fuzzycmeans.html @@ -1,8 +1,8 @@ Fuzzy C-means · Clustering.jl

Fuzzy C-means

Fuzzy C-means is a clustering method that provides cluster membership weights instead of "hard" classification (e.g. K-means).

From a mathematical standpoint, fuzzy C-means solves the following optimization problem:

\[\arg\min_\mathcal{C} \ \sum_{i=1}^n \sum_{j=1}^C w_{ij}^\mu \| \mathbf{x}_i - \mathbf{c}_j \|^2, \ \text{where}\ w_{ij} = \left(\sum_{k=1}^{C} \left(\frac{\left\|\mathbf{x}_i - \mathbf{c}_j \right\|}{\left\|\mathbf{x}_i - \mathbf{c}_k \right\|}\right)^{\frac{2}{\mu-1}}\right)^{-1}\]

Here, $\mathbf{c}_j$ is the center of the $j$-th cluster, $w_{ij}$ is the membership weight of the $i$-th point in the $j$-th cluster, and $\mu > 1$ is a user-defined fuzziness parameter.

Clustering.fuzzy_cmeansFunction
fuzzy_cmeans(data::AbstractMatrix, C::Integer, fuzziness::Real;
-             [dist_metric::SemiMetric], [...]) -> FuzzyCMeansResult

Perform Fuzzy C-means clustering over the given data.

Arguments

  • data::AbstractMatrix: $d×n$ data matrix. Each column represents one $d$-dimensional data point.
  • C::Integer: the number of fuzzy clusters, $2 ≤ C < n$.
  • fuzziness::Real: clusters fuzziness ($μ$ in the mathematical formulation), $μ > 1$.

Optional keyword arguments:

  • dist_metric::SemiMetric (defaults to Euclidean): the SemiMetric object that defines the distance between the data points
  • maxiter, tol, display, rng: see common options
source
Clustering.FuzzyCMeansResultType
FuzzyCMeansResult{T<:AbstractFloat}

The output of fuzzy_cmeans function.

Fields

  • centers::Matrix{T}: the $d×C$ matrix with columns being the centers of resulting fuzzy clusters
  • weights::Matrix{Float64}: the $n×C$ matrix of assignment weights ($\mathrm{weights}_{ij}$ is the weight (probability) of assigning $i$-th point to the $j$-th cluster)
  • iterations::Int: the number of executed algorithm iterations
  • converged::Bool: whether the procedure converged
source
Clustering.wcountsFunction
wcounts(R::ClusteringResult) -> Vector{Float64}
-wcounts(R::FuzzyCMeansResult) -> Vector{Float64}

Get the weighted cluster sizes as the sum of weights of points assigned to each cluster.

For non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R)).

source

Examples

using Clustering
+             [dist_metric::SemiMetric], [...]) -> FuzzyCMeansResult

Perform Fuzzy C-means clustering over the given data.

Arguments

  • data::AbstractMatrix: $d×n$ data matrix. Each column represents one $d$-dimensional data point.
  • C::Integer: the number of fuzzy clusters, $2 ≤ C < n$.
  • fuzziness::Real: clusters fuzziness ($μ$ in the mathematical formulation), $μ > 1$.

Optional keyword arguments:

  • dist_metric::SemiMetric (defaults to Euclidean): the SemiMetric object that defines the distance between the data points
  • maxiter, tol, display, rng: see common options
source
Clustering.FuzzyCMeansResultType
FuzzyCMeansResult{T<:AbstractFloat}

The output of fuzzy_cmeans function.

Fields

  • centers::Matrix{T}: the $d×C$ matrix with columns being the centers of resulting fuzzy clusters
  • weights::Matrix{Float64}: the $n×C$ matrix of assignment weights ($\mathrm{weights}_{ij}$ is the weight (probability) of assigning $i$-th point to the $j$-th cluster)
  • iterations::Int: the number of executed algorithm iterations
  • converged::Bool: whether the procedure converged
source
Clustering.wcountsFunction
wcounts(R::ClusteringResult) -> Vector{Float64}
+wcounts(R::FuzzyCMeansResult) -> Vector{Float64}

Get the weighted cluster sizes as the sum of weights of points assigned to each cluster.

For non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R)).

source

Examples

using Clustering
 
 # make a random dataset with 1000 points
 # each point is a 5-dimensional vector
@@ -21,23 +21,23 @@
 # get the point memberships over all the clusters
 # memberships is a 20x3 matrix
 memberships = R.weights
1000×3 Matrix{Float64}:
- 0.33406   0.334528  0.331412
- 0.332505  0.332657  0.334838
- 0.33481   0.33336   0.33183
- 0.332275  0.332665  0.335059
- 0.327807  0.33451   0.337683
- 0.334801  0.334398  0.330801
- 0.333061  0.334498  0.332442
- 0.333149  0.334944  0.331906
- 0.3305    0.332512  0.336988
- 0.331442  0.333599  0.334958
+ 0.335664  0.329882  0.334454
+ 0.332821  0.335165  0.332014
+ 0.33041   0.338552  0.331039
+ 0.332359  0.33499   0.332651
+ 0.334451  0.329962  0.335586
+ 0.334749  0.332705  0.332545
+ 0.331236  0.333163  0.3356
+ 0.335613  0.330398  0.333989
+ 0.330459  0.33789   0.331651
+ 0.333712  0.331966  0.334322
  ⋮                   
- 0.332512  0.332994  0.334494
- 0.333461  0.332953  0.333586
- 0.329862  0.334123  0.336016
- 0.332913  0.333622  0.333465
- 0.331471  0.333205  0.335324
- 0.330328  0.333264  0.336408
- 0.336638  0.333544  0.329819
- 0.335453  0.332626  0.331921
- 0.336177  0.333567  0.330256
+ 0.335377 0.331347 0.333276 + 0.332275 0.334708 0.333017 + 0.330707 0.336478 0.332815 + 0.333728 0.33271 0.333562 + 0.333414 0.332617 0.333969 + 0.333544 0.334717 0.331739 + 0.334073 0.331234 0.334693 + 0.33159 0.336781 0.33163 + 0.331509 0.33421 0.334281 diff --git a/dev/hclust.html b/dev/hclust.html index 54b31dbe..39d9daf6 100644 --- a/dev/hclust.html +++ b/dev/hclust.html @@ -1,5 +1,5 @@ -Hierarchical Clustering · Clustering.jl

Hierarchical Clustering

Hierarchical clustering algorithms build a dendrogram of nested clusters by repeatedly merging or splitting clusters.

The hclust function implements several classical algorithms for hierarchical clustering (the algorithm to use is defined by the linkage parameter):

Clustering.hclustFunction
hclust(d::AbstractMatrix; [linkage], [uplo], [branchorder]) -> Hclust

Perform hierarchical clustering using the distance matrix d and the cluster linkage function.

Returns the dendrogram as a Hclust object.

Arguments

  • d::AbstractMatrix: the pairwise distance matrix. $d_{ij}$ is the distance between $i$-th and $j$-th points.
  • linkage::Symbol: cluster linkage function to use. linkage defines how the distances between the data points are aggregated into the distances between the clusters. Naturally, it affects what clusters are merged on each iteration. The valid choices are:
    • :single (the default): use the minimum distance between any of the cluster members
    • :average: use the mean distance between any of the cluster members
    • :complete: use the maximum distance between any of the members
    • :ward: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters
    • :ward_presquared: same as :ward, but assumes that the distances in d are already squared.
  • uplo::Symbol (optional): specifies whether the upper (:U) or the lower (:L) triangle of d should be used to get the distances. If not specified, the method expects d to be symmetric.
  • branchorder::Symbol (optional): algorithm to order leaves and branches. The valid choices are:
    • :r (the default): ordering based on the node heights and the original elements order (compatible with R's hclust)
    • :barjoseph (or :optimal): branches are ordered to reduce the distance between neighboring leaves from separate branches using the "fast optimal leaf ordering" algorithm from Bar-Joseph et. al. Bioinformatics (2001)
source
Clustering.HclustType
Hclust{T<:Real}

The output of hclust, hierarchical clustering of data points.

Provides the bottom-up definition of the dendrogram as the sequence of merges of the two lower subtrees into a higher level subtree.

This type mostly follows R's hclust class.

Fields

  • merges::Matrix{Int}: $N×2$ matrix encoding subtree merges:
    • each row specifies the left and right subtrees (referenced by their $id$s) that are merged
    • negative subtree $id$ denotes the leaf node and corresponds to the data point at position $-id$
    • positive $id$ denotes nontrivial subtree (the row merges[id, :] specifies its left and right subtrees)
  • linkage::Symbol: the name of cluster linkage function used to construct the hierarchy (see hclust)
  • heights::Vector{T}: subtree heights, i.e. the distances between the left and right branches of each subtree calculated using the specified linkage
  • order::Vector{Int}: the data point indices ordered so that there are no intersecting branches on the dendrogram plot. This ordering also puts the points of the same cluster close together.

See also: hclust.

source

Single-linkage clustering using distance matrix:

using Clustering
+Hierarchical Clustering · Clustering.jl

Hierarchical Clustering

Hierarchical clustering algorithms build a dendrogram of nested clusters by repeatedly merging or splitting clusters.

The hclust function implements several classical algorithms for hierarchical clustering (the algorithm to use is defined by the linkage parameter):

Clustering.hclustFunction
hclust(d::AbstractMatrix; [linkage], [uplo], [branchorder]) -> Hclust

Perform hierarchical clustering using the distance matrix d and the cluster linkage function.

Returns the dendrogram as a Hclust object.

Arguments

  • d::AbstractMatrix: the pairwise distance matrix. $d_{ij}$ is the distance between $i$-th and $j$-th points.
  • linkage::Symbol: cluster linkage function to use. linkage defines how the distances between the data points are aggregated into the distances between the clusters. Naturally, it affects what clusters are merged on each iteration. The valid choices are:
    • :single (the default): use the minimum distance between any of the cluster members
    • :average: use the mean distance between any of the cluster members
    • :complete: use the maximum distance between any of the members
    • :ward: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters
    • :ward_presquared: same as :ward, but assumes that the distances in d are already squared.
  • uplo::Symbol (optional): specifies whether the upper (:U) or the lower (:L) triangle of d should be used to get the distances. If not specified, the method expects d to be symmetric.
  • branchorder::Symbol (optional): algorithm to order leaves and branches. The valid choices are:
    • :r (the default): ordering based on the node heights and the original elements order (compatible with R's hclust)
    • :barjoseph (or :optimal): branches are ordered to reduce the distance between neighboring leaves from separate branches using the "fast optimal leaf ordering" algorithm from Bar-Joseph et. al. Bioinformatics (2001)
source
Clustering.HclustType
Hclust{T<:Real}

The output of hclust, hierarchical clustering of data points.

Provides the bottom-up definition of the dendrogram as the sequence of merges of the two lower subtrees into a higher level subtree.

This type mostly follows R's hclust class.

Fields

  • merges::Matrix{Int}: $N×2$ matrix encoding subtree merges:
    • each row specifies the left and right subtrees (referenced by their $id$s) that are merged
    • negative subtree $id$ denotes the leaf node and corresponds to the data point at position $-id$
    • positive $id$ denotes nontrivial subtree (the row merges[id, :] specifies its left and right subtrees)
  • linkage::Symbol: the name of cluster linkage function used to construct the hierarchy (see hclust)
  • heights::Vector{T}: subtree heights, i.e. the distances between the left and right branches of each subtree calculated using the specified linkage
  • order::Vector{Int}: the data point indices ordered so that there are no intersecting branches on the dendrogram plot. This ordering also puts the points of the same cluster close together.

See also: hclust.

source

Single-linkage clustering using distance matrix:

using Clustering
 D = rand(1000, 1000);
 D += D'; # symmetric distance matrix (optional)
-result = hclust(D, linkage=:single)
Hclust{Float64}([-690 -813; -554 -732; … ; -16 997; -195 998], [0.0035044882382555542, 0.004548714784891494, 0.004950483509694847, 0.005373743972241107, 0.005886254023239723, 0.0063835889565238, 0.0065741127822909196, 0.006700805844352842, 0.006809675623588918, 0.007200475155618946  …  0.09944908633303073, 0.1004498060257667, 0.10136046307512614, 0.10594225340631724, 0.10744167473659694, 0.10969608920802298, 0.11276149705686223, 0.11297511526379667, 0.12343594323017404, 0.12775784522521183], [195, 16, 735, 657, 987, 367, 339, 142, 8, 844  …  190, 603, 18, 903, 755, 944, 120, 148, 128, 995], :single)

The resulting dendrogram could be converted into disjoint clusters with the help of cutree function.

Clustering.cutreeFunction
cutree(hclu::Hclust; [k], [h]) -> Vector{Int}

Cut the hclu dendrogram to produce clusters at the specified level of granularity.

Returns the cluster assignments vector $z$ ($z_i$ is the index of the cluster for the $i$-th data point).

Arguments

  • k::Integer (optional) the number of desired clusters.
  • h::Real (optional) the height at which the tree is cut.

If both k and h are specified, it's guaranteed that the number of clusters is not less than k and their height is not above h.

See also: hclust

source
+result = hclust(D, linkage=:single)
Hclust{Float64}([-25 -655; -43 -338; … ; -406 997; -320 998], [0.0024236034900564363, 0.002665958319813755, 0.0038603982170157813, 0.0038753826540584013, 0.004084505268348582, 0.005173840217677972, 0.005571644990888691, 0.005984996483560989, 0.006006900496367318, 0.0060073404883634884  …  0.096992187278233, 0.09966557655538466, 0.10026980491751936, 0.10102353867674474, 0.1011779425623146, 0.10219804440492952, 0.104764893691627, 0.11060523094641495, 0.1313882749229459, 0.13443246402322673], [320, 406, 246, 829, 195, 945, 414, 303, 827, 768  …  871, 72, 737, 142, 25, 655, 199, 716, 955, 987], :single)

The resulting dendrogram could be converted into disjoint clusters with the help of cutree function.

Clustering.cutreeFunction
cutree(hclu::Hclust; [k], [h]) -> Vector{Int}

Cut the hclu dendrogram to produce clusters at the specified level of granularity.

Returns the cluster assignments vector $z$ ($z_i$ is the index of the cluster for the $i$-th data point).

Arguments

  • k::Integer (optional) the number of desired clusters.
  • h::Real (optional) the height at which the tree is cut.

If both k and h are specified, it's guaranteed that the number of clusters is not less than k and their height is not above h.

See also: hclust

source
diff --git a/dev/index.html b/dev/index.html index 337bf824..f6f72e1e 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Introduction · Clustering.jl
+Introduction · Clustering.jl
diff --git a/dev/init.html b/dev/init.html index 6548de37..54c4ff29 100644 --- a/dev/init.html +++ b/dev/init.html @@ -1,6 +1,6 @@ -Initialization · Clustering.jl

Initialization

A clustering algorithm usually requires initialization before it could be started.

Seeding

Seeding is a type of clustering initialization, which provides a few seeds – points from a data set that would serve as the initial cluster centers (one for each cluster).

Each seeding algorithm implemented by Clustering.jl is a subtype of SeedingAlgorithm:

Clustering.initseeds!Function
initseeds!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,
-           X::AbstractMatrix) -> iseeds

Initialize iseeds with the indices of cluster seeds for the X data matrix using the alg seeding algorithm.

source
Clustering.initseeds_by_costs!Function
initseeds_by_costs!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,
-                    costs::AbstractMatrix) -> iseeds

Initialize iseeds with the indices of cluster seeds for the costs matrix using the alg seeding algorithm.

Here, costs[i, j] is the cost of assigning points $i$ and $j$ to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.

source

There are several seeding methods described in the literature. Clustering.jl implements three popular ones:

Clustering.KmppAlgType
KmppAlg <: SeedingAlgorithm

Kmeans++ seeding (:kmpp).

Chooses the seeds sequentially. The probability of a point to be chosen is proportional to the minimum cost of assigning it to the existing seeds.

References

D. Arthur and S. Vassilvitskii (2007). k-means++: the advantages of careful seeding. 18th Annual ACM-SIAM symposium on Discrete algorithms, 2007.

source
Clustering.KmCentralityAlgType
KmCentralityAlg <: SeedingAlgorithm

K-medoids initialization based on centrality (:kmcen).

Choose the $k$ points with the highest centrality as seeds.

References

Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for K-medoids clustering. doi:10.1016/j.eswa.2008.01.039

source
Clustering.RandSeedAlgType
RandSeedAlg <: SeedingAlgorithm

Random seeding (:rand).

Chooses an arbitrary subset of $k$ data points as cluster seeds.

source

In practice, we have found that Kmeans++ is the most effective choice.

For convenience, the package defines the two wrapper functions that accept the short name of the seeding algorithm and the number of clusters and take care of allocating iseeds and applying the proper SeedingAlgorithm:

Clustering.initseedsFunction
initseeds(alg::Union{SeedingAlgorithm, Symbol},
-          X::AbstractMatrix, k::Integer) -> Vector{Int}

Select k seeds from a $d×n$ data matrix X using the alg algorithm.

alg could be either an instance of SeedingAlgorithm or a symbolic name of the algorithm.

Returns the vector of k seed indices.

source
Clustering.initseeds_by_costsFunction
initseeds_by_costs(alg::Union{SeedingAlgorithm, Symbol},
-                   costs::AbstractMatrix, k::Integer) -> Vector{Int}

Select k seeds from the $n×n$ costs matrix using algorithm alg.

Here, costs[i, j] is the cost of assigning points iandj` to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.

Returns the vector of k seed indices.

source
+Initialization · Clustering.jl

Initialization

A clustering algorithm usually requires initialization before it could be started.

Seeding

Seeding is a type of clustering initialization, which provides a few seeds – points from a data set that would serve as the initial cluster centers (one for each cluster).

Each seeding algorithm implemented by Clustering.jl is a subtype of SeedingAlgorithm:

Clustering.initseeds!Function
initseeds!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,
+           X::AbstractMatrix) -> iseeds

Initialize iseeds with the indices of cluster seeds for the X data matrix using the alg seeding algorithm.

source
Clustering.initseeds_by_costs!Function
initseeds_by_costs!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,
+                    costs::AbstractMatrix) -> iseeds

Initialize iseeds with the indices of cluster seeds for the costs matrix using the alg seeding algorithm.

Here, costs[i, j] is the cost of assigning points $i$ and $j$ to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.

source

There are several seeding methods described in the literature. Clustering.jl implements three popular ones:

Clustering.KmppAlgType
KmppAlg <: SeedingAlgorithm

Kmeans++ seeding (:kmpp).

Chooses the seeds sequentially. The probability of a point to be chosen is proportional to the minimum cost of assigning it to the existing seeds.

References

D. Arthur and S. Vassilvitskii (2007). k-means++: the advantages of careful seeding. 18th Annual ACM-SIAM symposium on Discrete algorithms, 2007.

source
Clustering.KmCentralityAlgType
KmCentralityAlg <: SeedingAlgorithm

K-medoids initialization based on centrality (:kmcen).

Choose the $k$ points with the highest centrality as seeds.

References

Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for K-medoids clustering. doi:10.1016/j.eswa.2008.01.039

source
Clustering.RandSeedAlgType
RandSeedAlg <: SeedingAlgorithm

Random seeding (:rand).

Chooses an arbitrary subset of $k$ data points as cluster seeds.

source

In practice, we have found that Kmeans++ is the most effective choice.

For convenience, the package defines the two wrapper functions that accept the short name of the seeding algorithm and the number of clusters and take care of allocating iseeds and applying the proper SeedingAlgorithm:

Clustering.initseedsFunction
initseeds(alg::Union{SeedingAlgorithm, Symbol},
+          X::AbstractMatrix, k::Integer) -> Vector{Int}

Select k seeds from a $d×n$ data matrix X using the alg algorithm.

alg could be either an instance of SeedingAlgorithm or a symbolic name of the algorithm.

Returns the vector of k seed indices.

source
Clustering.initseeds_by_costsFunction
initseeds_by_costs(alg::Union{SeedingAlgorithm, Symbol},
+                   costs::AbstractMatrix, k::Integer) -> Vector{Int}

Select k seeds from the $n×n$ costs matrix using algorithm alg.

Here, costs[i, j] is the cost of assigning points iandj` to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.

Returns the vector of k seed indices.

source
diff --git a/dev/kmeans-ccfdcd40.svg b/dev/kmeans-7d3de746.svg similarity index 63% rename from dev/kmeans-ccfdcd40.svg rename to dev/kmeans-7d3de746.svg index 7b15f6ef..11927e05 100644 --- a/dev/kmeans-ccfdcd40.svg +++ b/dev/kmeans-7d3de746.svgdiff --git a/dev/kmeans.html b/dev/kmeans.html index 3726d012..45737154 100644 --- a/dev/kmeans.html +++ b/dev/kmeans.html @@ -1,5 +1,5 @@ -K-means · Clustering.jl

K-means

K-means is a classical method for clustering or vector quantization. It produces a fixed number of clusters, each associated with a center (also known as a prototype), and each data point is assigned to a cluster with the nearest center.

From a mathematical standpoint, K-means is a coordinate descent algorithm that solves the following optimization problem:

\[\text{minimize} \ \sum_{i=1}^n \| \mathbf{x}_i - \boldsymbol{\mu}_{z_i} \|^2 \ \text{w.r.t.} \ (\boldsymbol{\mu}, z)\]

Here, $\boldsymbol{\mu}_k$ is the center of the $k$-th cluster, and $z_i$ is an index of the cluster for $i$-th point $\mathbf{x}_i$.

Clustering.kmeansFunction
kmeans(X, k, [...]) -> KmeansResult

K-means clustering of the $d×n$ data matrix X (each column of X is a $d$-dimensional data point) into k clusters.

Arguments

  • init (defaults to :kmpp): how cluster seeds should be initialized, could be one of the following:
    • a Symbol, the name of a seeding algorithm (see Seeding for a list of supported methods);
    • an instance of SeedingAlgorithm;
    • an integer vector of length $k$ that provides the indices of points to use as initial seeds.
  • weights: $n$-element vector of point weights (the cluster centers are the weighted means of cluster members)
  • maxiter, tol, display: see common options
source
Clustering.KmeansResultType
KmeansResult{C,D<:Real,WC<:Real} <: ClusteringResult

The output of kmeans and kmeans!.

Type parameters

  • C<:AbstractMatrix{<:AbstractFloat}: type of the centers matrix
  • D<:Real: type of the assignment cost
  • WC<:Real: type of the cluster weight
source

If you already have a set of initial center vectors, kmeans! could be used:

Clustering.kmeans!Function
kmeans!(X, centers; [kwargs...]) -> KmeansResult

Update the current cluster centers ($d×k$ matrix, where $d$ is the dimension and $k$ the number of centroids) using the $d×n$ data matrix X (each column of X is a $d$-dimensional data point).

See kmeans for the description of optional kwargs.

source

Examples

using Clustering
+K-means · Clustering.jl

K-means

K-means is a classical method for clustering or vector quantization. It produces a fixed number of clusters, each associated with a center (also known as a prototype), and each data point is assigned to a cluster with the nearest center.

From a mathematical standpoint, K-means is a coordinate descent algorithm that solves the following optimization problem:

\[\text{minimize} \ \sum_{i=1}^n \| \mathbf{x}_i - \boldsymbol{\mu}_{z_i} \|^2 \ \text{w.r.t.} \ (\boldsymbol{\mu}, z)\]

Here, $\boldsymbol{\mu}_k$ is the center of the $k$-th cluster, and $z_i$ is an index of the cluster for $i$-th point $\mathbf{x}_i$.

Clustering.kmeansFunction
kmeans(X, k, [...]) -> KmeansResult

K-means clustering of the $d×n$ data matrix X (each column of X is a $d$-dimensional data point) into k clusters.

Arguments

  • init (defaults to :kmpp): how cluster seeds should be initialized, could be one of the following:
    • a Symbol, the name of a seeding algorithm (see Seeding for a list of supported methods);
    • an instance of SeedingAlgorithm;
    • an integer vector of length $k$ that provides the indices of points to use as initial seeds.
  • weights: $n$-element vector of point weights (the cluster centers are the weighted means of cluster members)
  • maxiter, tol, display: see common options
source
Clustering.KmeansResultType
KmeansResult{C,D<:Real,WC<:Real} <: ClusteringResult

The output of kmeans and kmeans!.

Type parameters

  • C<:AbstractMatrix{<:AbstractFloat}: type of the centers matrix
  • D<:Real: type of the assignment cost
  • WC<:Real: type of the cluster weight
source

If you already have a set of initial center vectors, kmeans! could be used:

Clustering.kmeans!Function
kmeans!(X, centers; [kwargs...]) -> KmeansResult

Update the current cluster centers ($d×k$ matrix, where $d$ is the dimension and $k$ the number of centroids) using the $d×n$ data matrix X (each column of X is a $d$-dimensional data point).

See kmeans for the description of optional kwargs.

source

Examples

using Clustering
 
 # make a random dataset with 1000 random 5-dimensional points
 X = rand(5, 1000)
@@ -12,11 +12,11 @@
 a = assignments(R) # get the assignments of points to clusters
 c = counts(R) # get the cluster sizes
 M = R.centers # get the cluster centers
5×20 Matrix{Float64}:
- 0.2117    0.491851  0.758457  0.602743  …  0.217038  0.852979  0.748562
- 0.770759  0.309261  0.347395  0.692755     0.5091    0.213795  0.752339
- 0.69697   0.730276  0.182796  0.707557     0.67861   0.692465  0.528834
- 0.319476  0.263525  0.726214  0.257265     0.789839  0.729913  0.220093
- 0.403965  0.148082  0.294128  0.788793     0.215303  0.208538  0.204926

Scatter plot of the K-means clustering results:

using RDatasets, Clustering, Plots
+ 0.244608  0.775184  0.600061  0.811659  …  0.78442   0.72628   0.23542
+ 0.745749  0.203884  0.793348  0.77787      0.463496  0.207723  0.723428
+ 0.273952  0.653845  0.780293  0.590436     0.290323  0.168559  0.766353
+ 0.693956  0.319781  0.772319  0.274935     0.177837  0.600826  0.690446
+ 0.744836  0.74821   0.676119  0.806888     0.349246  0.259978  0.234391

Scatter plot of the K-means clustering results:

using RDatasets, Clustering, Plots
 iris = dataset("datasets", "iris"); # load the data
 
 features = collect(Matrix(iris[:, 1:4])'); # features to use for clustering
@@ -24,4 +24,4 @@
 
 # plot with the point color mapped to the assigned cluster index
 scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
-        color=:lightrainbow, legend=false)
Example block output
+ color=:lightrainbow, legend=false)
Example block output
diff --git a/dev/kmedoids.html b/dev/kmedoids.html index 44f27e65..fe5f8710 100644 --- a/dev/kmedoids.html +++ b/dev/kmedoids.html @@ -1,3 +1,3 @@ -K-medoids · Clustering.jl

K-medoids

K-medoids is a clustering algorithm that works by finding $k$ data points (called medoids) such that the total distance between each data point and the closest medoid is minimal.

Clustering.kmedoidsFunction
kmedoids(dist::AbstractMatrix, k::Integer; ...) -> KmedoidsResult

Perform K-medoids clustering of $n$ points into k clusters, given the dist matrix ($n×n$, dist[i, j] is the distance between the j-th and i-th points).

Arguments

  • init (defaults to :kmpp): how medoids should be initialized, could be one of the following:
    • a Symbol indicating the name of a seeding algorithm (see Seeding for a list of supported methods).
    • an integer vector of length k that provides the indices of points to use as initial medoids.
  • maxiter, tol, display: see common options

Note

The function implements a K-means style algorithm instead of PAM (Partitioning Around Medoids). K-means style algorithm converges in fewer iterations, but was shown to produce worse (10-20% higher total costs) results (see e.g. Schubert & Rousseeuw (2019)).

source
Clustering.kmedoids!Function
kmedoids!(dist::AbstractMatrix, medoids::Vector{Int};
-          [kwargs...]) -> KmedoidsResult

Update the current cluster medoids using the dist matrix.

The medoids field of the returned KmedoidsResult points to the same array as medoids argument.

See kmedoids for the description of optional kwargs.

source
Clustering.KmedoidsResultType
KmedoidsResult{T} <: ClusteringResult

The output of kmedoids function.

Fields

  • medoids::Vector{Int}: the indices of $k$ medoids
  • assignments::Vector{Int}: the indices of clusters the points are assigned to, so that medoids[assignments[i]] is the index of the medoid for the $i$-th point
  • costs::Vector{T}: assignment costs, i.e. costs[i] is the cost of assigning $i$-th point to its medoid
  • counts::Vector{Int}: cluster sizes
  • totalcost::Float64: total assignment cost (the sum of costs)
  • iterations::Int: the number of executed algorithm iterations
  • converged::Bool: whether the procedure converged
source

References

  1. Teitz, M.B. and Bart, P. (1968). Heuristic Methods for Estimating the Generalized Vertex Median of a Weighted Graph. Operations Research, 16(5), 955–961. doi:10.1287/opre.16.5.955
  2. Schubert, E. and Rousseeuw, P.J. (2019). Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS Algorithms. SISAP, 171-187. doi:10.1007/978-3-030-32047-8_16
+K-medoids · Clustering.jl

K-medoids

K-medoids is a clustering algorithm that works by finding $k$ data points (called medoids) such that the total distance between each data point and the closest medoid is minimal.

Clustering.kmedoidsFunction
kmedoids(dist::AbstractMatrix, k::Integer; ...) -> KmedoidsResult

Perform K-medoids clustering of $n$ points into k clusters, given the dist matrix ($n×n$, dist[i, j] is the distance between the j-th and i-th points).

Arguments

  • init (defaults to :kmpp): how medoids should be initialized, could be one of the following:
    • a Symbol indicating the name of a seeding algorithm (see Seeding for a list of supported methods).
    • an integer vector of length k that provides the indices of points to use as initial medoids.
  • maxiter, tol, display: see common options

Note

The function implements a K-means style algorithm instead of PAM (Partitioning Around Medoids). K-means style algorithm converges in fewer iterations, but was shown to produce worse (10-20% higher total costs) results (see e.g. Schubert & Rousseeuw (2019)).

source
Clustering.kmedoids!Function
kmedoids!(dist::AbstractMatrix, medoids::Vector{Int};
+          [kwargs...]) -> KmedoidsResult

Update the current cluster medoids using the dist matrix.

The medoids field of the returned KmedoidsResult points to the same array as medoids argument.

See kmedoids for the description of optional kwargs.

source
Clustering.KmedoidsResultType
KmedoidsResult{T} <: ClusteringResult

The output of kmedoids function.

Fields

  • medoids::Vector{Int}: the indices of $k$ medoids
  • assignments::Vector{Int}: the indices of clusters the points are assigned to, so that medoids[assignments[i]] is the index of the medoid for the $i$-th point
  • costs::Vector{T}: assignment costs, i.e. costs[i] is the cost of assigning $i$-th point to its medoid
  • counts::Vector{Int}: cluster sizes
  • totalcost::Float64: total assignment cost (the sum of costs)
  • iterations::Int: the number of executed algorithm iterations
  • converged::Bool: whether the procedure converged
source

References

  1. Teitz, M.B. and Bart, P. (1968). Heuristic Methods for Estimating the Generalized Vertex Median of a Weighted Graph. Operations Research, 16(5), 955–961. doi:10.1287/opre.16.5.955
  2. Schubert, E. and Rousseeuw, P.J. (2019). Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS Algorithms. SISAP, 171-187. doi:10.1007/978-3-030-32047-8_16
diff --git a/dev/mcl.html b/dev/mcl.html index 4700348d..b256dfc7 100644 --- a/dev/mcl.html +++ b/dev/mcl.html @@ -1,2 +1,2 @@ -MCL (Markov Cluster Algorithm) · Clustering.jl

MCL (Markov Cluster Algorithm)

Markov Cluster Algorithm works by simulating a stochastic (Markov) flow in a weighted graph, where each node is a data point, and the edge weights are defined by the adjacency matrix. ... When the algorithm converges, it produces the new edge weights that define the new connected components of the graph (i.e. the clusters).

Clustering.mclFunction
mcl(adj::AbstractMatrix; [kwargs...]) -> MCLResult

Perform MCL (Markov Cluster Algorithm) clustering using $n×n$ adjacency (points similarity) matrix adj.

Arguments

Keyword arguments to control the MCL algorithm:

  • add_loops::Bool (enabled by default): whether the edges of weight 1.0 from the node to itself should be appended to the graph
  • expansion::Number (defaults to 2): MCL expansion constant
  • inflation::Number (defaults to 2): MCL inflation constant
  • save_final_matrix::Bool (disabled by default): whether to save the final equilibrium state in the mcl_adj field of the result; could provide useful diagnostic if the method doesn't converge
  • prune_tol::Number: pruning threshold
  • display, maxiter, tol: see common options

References

Stijn van Dongen, "Graph clustering by flow simulation", 2001

Original MCL implementation.

source
Clustering.MCLResultType
MCLResult <: ClusteringResult

The output of mcl function.

Fields

  • mcl_adj::AbstractMatrix: the final MCL adjacency matrix (equilibrium state matrix if the algorithm converged), empty if save_final_matrix option is disabled
  • assignments::Vector{Int}: indices of the points clusters. assignments[i] is the index of the cluster for the $i$-th point ($0$ if unassigned)
  • counts::Vector{Int}: the $k$-length vector of cluster sizes
  • nunassigned::Int: the number of standalone points not assigned to any cluster
  • iterations::Int: the number of elapsed iterations
  • rel_Δ::Float64: the final relative Δ
  • converged::Bool: whether the method converged
source
+MCL (Markov Cluster Algorithm) · Clustering.jl

MCL (Markov Cluster Algorithm)

Markov Cluster Algorithm works by simulating a stochastic (Markov) flow in a weighted graph, where each node is a data point, and the edge weights are defined by the adjacency matrix. ... When the algorithm converges, it produces the new edge weights that define the new connected components of the graph (i.e. the clusters).

Clustering.mclFunction
mcl(adj::AbstractMatrix; [kwargs...]) -> MCLResult

Perform MCL (Markov Cluster Algorithm) clustering using $n×n$ adjacency (points similarity) matrix adj.

Arguments

Keyword arguments to control the MCL algorithm:

  • add_loops::Bool (enabled by default): whether the edges of weight 1.0 from the node to itself should be appended to the graph
  • expansion::Number (defaults to 2): MCL expansion constant
  • inflation::Number (defaults to 2): MCL inflation constant
  • save_final_matrix::Bool (disabled by default): whether to save the final equilibrium state in the mcl_adj field of the result; could provide useful diagnostic if the method doesn't converge
  • prune_tol::Number: pruning threshold
  • display, maxiter, tol: see common options

References

Stijn van Dongen, "Graph clustering by flow simulation", 2001

Original MCL implementation.

source
Clustering.MCLResultType
MCLResult <: ClusteringResult

The output of mcl function.

Fields

  • mcl_adj::AbstractMatrix: the final MCL adjacency matrix (equilibrium state matrix if the algorithm converged), empty if save_final_matrix option is disabled
  • assignments::Vector{Int}: indices of the points clusters. assignments[i] is the index of the cluster for the $i$-th point ($0$ if unassigned)
  • counts::Vector{Int}: the $k$-length vector of cluster sizes
  • nunassigned::Int: the number of standalone points not assigned to any cluster
  • iterations::Int: the number of elapsed iterations
  • rel_Δ::Float64: the final relative Δ
  • converged::Bool: whether the method converged
source
diff --git a/dev/validate.html b/dev/validate.html index 410fb651..a64f7f4e 100644 --- a/dev/validate.html +++ b/dev/validate.html @@ -1,8 +1,8 @@ Evaluation & Validation · Clustering.jl

Evaluation & Validation

Clustering.jl package provides a number of methods to evaluate the results of a clustering algorithm and/or to validate its correctness.

Cross tabulation

Cross tabulation, or contingency matrix, is a basis for many clustering quality measures. It shows how similar are the two clusterings on a cluster level.

Clustering.jl extends StatsBase.counts() with methods that accept ClusteringResult arguments:

StatsBase.countsMethod
counts(a::ClusteringResult, b::ClusteringResult) -> Matrix{Int}
 counts(a::ClusteringResult, b::AbstractVector{<:Integer}) -> Matrix{Int}
-counts(a::AbstractVector{<:Integer}, b::ClusteringResult) -> Matrix{Int}

Calculate the cross tabulation (aka contingency matrix) for the two clusterings of the same data points.

Returns the $n_a × n_b$ matrix C, where $n_a$ and $n_b$ are the numbers of clusters in a and b, respectively, and C[i, j] is the size of the intersection of i-th cluster from a and j-th cluster from b.

The clusterings could be specified either as ClusteringResult instances or as vectors of data point assignments.

source

Rand index

Rand index is a measure of the similarity between the two data clusterings. From a mathematical standpoint, Rand index is related to the prediction accuracy, but is applicable even when the original class labels are not used.

Clustering.randindexFunction
randindex(a, b) -> NTuple{4, Float64}

Compute the tuple of Rand-related indices between the clusterings c1 and c2.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

Returns a tuple of indices:

  • Hubert & Arabie Adjusted Rand index
  • Rand index (agreement probability)
  • Mirkin's index (disagreement probability)
  • Hubert's index ($P(\mathrm{agree}) - P(\mathrm{disagree})$)

References

Lawrence Hubert and Phipps Arabie (1985). Comparing partitions. Journal of Classification 2 (1): 193-218

Meila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173-187.

Steinley, Douglas (2004). Properties of the Hubert-Arabie Adjusted Rand Index. Psychological Methods, Vol. 9, No. 3: 386-396

source

Silhouettes

Silhouettes is a method for evaluating the quality of clustering. Particularly, it provides a quantitative way to measure how well each point lies within its cluster in comparison to the other clusters.

The Silhouette value for the $i$-th data point is:

\[s_i = \frac{b_i - a_i}{\max(a_i, b_i)}, \ \text{where}\]

  • $a_i$ is the average distance from the $i$-th point to the other points in the same cluster $z_i$,
  • $b_i ≝ \min_{k \ne z_i} b_{ik}$, where $b_{ik}$ is the average distance from the $i$-th point to the points in the $k$-th cluster.

Note that $s_i \le 1$, and that $s_i$ is close to $1$ when the $i$-th point lies well within its own cluster. This property allows using mean(silhouettes(assignments, counts, X)) as a measure of clustering quality. Higher values indicate better separation of clusters w.r.t. point distances.

Clustering.silhouettesFunction
silhouettes(assignments::Union{AbstractVector, ClusteringResult}, point_dists::Matrix) -> Vector{Float64}
+counts(a::AbstractVector{<:Integer}, b::ClusteringResult) -> Matrix{Int}

Calculate the cross tabulation (aka contingency matrix) for the two clusterings of the same data points.

Returns the $n_a × n_b$ matrix C, where $n_a$ and $n_b$ are the numbers of clusters in a and b, respectively, and C[i, j] is the size of the intersection of i-th cluster from a and j-th cluster from b.

The clusterings could be specified either as ClusteringResult instances or as vectors of data point assignments.

source

Rand index

Rand index is a measure of the similarity between the two data clusterings. From a mathematical standpoint, Rand index is related to the prediction accuracy, but is applicable even when the original class labels are not used.

Clustering.randindexFunction
randindex(a, b) -> NTuple{4, Float64}

Compute the tuple of Rand-related indices between the clusterings c1 and c2.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

Returns a tuple of indices:

  • Hubert & Arabie Adjusted Rand index
  • Rand index (agreement probability)
  • Mirkin's index (disagreement probability)
  • Hubert's index ($P(\mathrm{agree}) - P(\mathrm{disagree})$)

References

Lawrence Hubert and Phipps Arabie (1985). Comparing partitions. Journal of Classification 2 (1): 193-218

Meila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173-187.

Steinley, Douglas (2004). Properties of the Hubert-Arabie Adjusted Rand Index. Psychological Methods, Vol. 9, No. 3: 386-396

source

Silhouettes

Silhouettes is a method for evaluating the quality of clustering. Particularly, it provides a quantitative way to measure how well each point lies within its cluster in comparison to the other clusters.

The Silhouette value for the $i$-th data point is:

\[s_i = \frac{b_i - a_i}{\max(a_i, b_i)}, \ \text{where}\]

  • $a_i$ is the average distance from the $i$-th point to the other points in the same cluster $z_i$,
  • $b_i ≝ \min_{k \ne z_i} b_{ik}$, where $b_{ik}$ is the average distance from the $i$-th point to the points in the $k$-th cluster.

Note that $s_i \le 1$, and that $s_i$ is close to $1$ when the $i$-th point lies well within its own cluster. This property allows using mean(silhouettes(assignments, counts, X)) as a measure of clustering quality. Higher values indicate better separation of clusters w.r.t. point distances.

Clustering.silhouettesFunction
silhouettes(assignments::Union{AbstractVector, ClusteringResult}, point_dists::Matrix) -> Vector{Float64}
 silhouettes(assignments::Union{AbstractVector, ClusteringResult}, points::Matrix;
-            metric::SemiMetric, [batch_size::Integer]) -> Vector{Float64}

Compute silhouette values for individual points w.r.t. given clustering.

Returns the $n$-length vector of silhouette values for each individual point.

Arguments

  • assignments::Union{AbstractVector{Int}, ClusteringResult}: the vector of point assignments (cluster indices)
  • points::AbstractMatrix: if metric is nothing it is an $n×n$ matrix of pairwise distances between the points, otherwise it is an $d×n$ matrix of d dimensional clustered data points.
  • metric::Union{SemiMetric, Nothing}: an instance of Distances Metric object or nothing, indicating the distance metric used for calculating point distances.
  • batch_size::Union{Integer, Nothing}: if integer is given, calculate silhouettes in batches of batch_size points each, throws DimensionMismatch if batched calculation is not supported by given metric.

References

Peter J. Rousseeuw (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics. 20: 53–65. Marco Gaido (2023). Distributed Silhouette Algorithm: Evaluating Clustering on Big Data

source

Variation of Information

Variation of information (also known as shared information distance) is a measure of the distance between the two clusterings. It is devised from the mutual information, but it is a true metric, i.e. it is symmetric and satisfies the triangle inequality.

Clustering.varinfoFunction
varinfo(a, b) -> Float64

Compute the variation of information between the two clusterings of the same data points.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

References

Meila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173–187.

source

V-measure

V-measure can be used to compare the clustering results with the existing class labels of data points or with the alternative clustering. It is defined as the harmonic mean of homogeneity ($h$) and completeness ($c$) of the clustering:

\[V_{\beta} = (1+\beta)\frac{h \cdot c}{\beta \cdot h + c}.\]

Both $h$ and $c$ can be expressed in terms of the mutual information and entropy measures from the information theory. Homogeneity ($h$) is maximized when each cluster contains elements of as few different classes as possible. Completeness ($c$) aims to put all elements of each class in single clusters. The $\beta$ parameter ($\beta > 0$) could used to control the weights of $h$ and $c$ in the final measure. If $\beta > 1$, completeness has more weight, and when $\beta < 1$ it's homogeneity.

Clustering.vmeasureFunction
vmeasure(a, b; [β = 1.0]) -> Float64

V-measure between the two clusterings.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

The β parameter defines trade-off between homogeneity and completeness:

  • if $β > 1$, completeness is weighted more strongly,
  • if $β < 1$, homogeneity is weighted more strongly.

References

Andrew Rosenberg and Julia Hirschberg, 2007. V-Measure: A conditional entropy-based external cluster evaluation measure

source

Mutual information

Mutual information quantifies the "amount of information" obtained about one random variable through observing the other random variable. It is used in determining the similarity of two different clusterings of a dataset.

Clustering.mutualinfoFunction
mutualinfo(a, b; normed=true) -> Float64

Compute the mutual information between the two clusterings of the same data points.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

If normed parameter is true the return value is the normalized mutual information (symmetric uncertainty), see "Data Mining Practical Machine Tools and Techniques", Witten & Frank 2005.

References

Vinh, Epps, and Bailey, (2009). “Information theoretic measures for clusterings comparison”.

Proceedings of the 26th Annual International Conference on Machine Learning - ICML ‘09.

source

Confusion matrix

Pair confusion matrix arising from two clusterings is a 2×2 contingency table representation of the partition co-occurrence, see counts.

Clustering.confusionFunction
confusion([T = Int],
+            metric::SemiMetric, [batch_size::Integer]) -> Vector{Float64}

Compute silhouette values for individual points w.r.t. given clustering.

Returns the $n$-length vector of silhouette values for each individual point.

Arguments

  • assignments::Union{AbstractVector{Int}, ClusteringResult}: the vector of point assignments (cluster indices)
  • points::AbstractMatrix: if metric is nothing it is an $n×n$ matrix of pairwise distances between the points, otherwise it is an $d×n$ matrix of d dimensional clustered data points.
  • metric::Union{SemiMetric, Nothing}: an instance of Distances Metric object or nothing, indicating the distance metric used for calculating point distances.
  • batch_size::Union{Integer, Nothing}: if integer is given, calculate silhouettes in batches of batch_size points each, throws DimensionMismatch if batched calculation is not supported by given metric.

References

Peter J. Rousseeuw (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics. 20: 53–65. Marco Gaido (2023). Distributed Silhouette Algorithm: Evaluating Clustering on Big Data

source

Variation of Information

Variation of information (also known as shared information distance) is a measure of the distance between the two clusterings. It is devised from the mutual information, but it is a true metric, i.e. it is symmetric and satisfies the triangle inequality.

Clustering.varinfoFunction
varinfo(a, b) -> Float64

Compute the variation of information between the two clusterings of the same data points.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

References

Meila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173–187.

source

V-measure

V-measure can be used to compare the clustering results with the existing class labels of data points or with the alternative clustering. It is defined as the harmonic mean of homogeneity ($h$) and completeness ($c$) of the clustering:

\[V_{\beta} = (1+\beta)\frac{h \cdot c}{\beta \cdot h + c}.\]

Both $h$ and $c$ can be expressed in terms of the mutual information and entropy measures from the information theory. Homogeneity ($h$) is maximized when each cluster contains elements of as few different classes as possible. Completeness ($c$) aims to put all elements of each class in single clusters. The $\beta$ parameter ($\beta > 0$) could used to control the weights of $h$ and $c$ in the final measure. If $\beta > 1$, completeness has more weight, and when $\beta < 1$ it's homogeneity.

Clustering.vmeasureFunction
vmeasure(a, b; [β = 1.0]) -> Float64

V-measure between the two clusterings.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

The β parameter defines trade-off between homogeneity and completeness:

  • if $β > 1$, completeness is weighted more strongly,
  • if $β < 1$, homogeneity is weighted more strongly.

References

Andrew Rosenberg and Julia Hirschberg, 2007. V-Measure: A conditional entropy-based external cluster evaluation measure

source

Mutual information

Mutual information quantifies the "amount of information" obtained about one random variable through observing the other random variable. It is used in determining the similarity of two different clusterings of a dataset.

Clustering.mutualinfoFunction
mutualinfo(a, b; normed=true) -> Float64

Compute the mutual information between the two clusterings of the same data points.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

If normed parameter is true the return value is the normalized mutual information (symmetric uncertainty), see "Data Mining Practical Machine Tools and Techniques", Witten & Frank 2005.

References

Vinh, Epps, and Bailey, (2009). “Information theoretic measures for clusterings comparison”.

Proceedings of the 26th Annual International Conference on Machine Learning - ICML ‘09.

source

Confusion matrix

Pair confusion matrix arising from two clusterings is a 2×2 contingency table representation of the partition co-occurrence, see counts.

Clustering.confusionFunction
confusion([T = Int],
           a::Union{ClusteringResult, AbstractVector},
-          b::Union{ClusteringResult, AbstractVector}) -> Matrix{T}

Calculate the confusion matrix of the two clusterings.

Returns the 2×2 confusion matrix C of type T (Int by default) that represents partition co-occurrence or similarity matrix between two clusterings a and b by considering all pairs of samples and counting pairs that are assigned into the same or into different clusters.

Considering a pair of samples that is in the same group as a positive pair, and a pair is in the different group as a negative pair, then the count of true positives is C₁₁, false negatives is C₁₂, false positives C₂₁, and true negatives is C₂₂:

PositiveNegative
PositiveC₁₁C₁₂
NegativeC₂₁C₂₂
source

Other packages

  • ClusteringBenchmarks.jl provides benchmark datasets and implements additional methods for evaluating clustering performance.
+ b::Union{ClusteringResult, AbstractVector}) -> Matrix{T}

Calculate the confusion matrix of the two clusterings.

Returns the 2×2 confusion matrix C of type T (Int by default) that represents partition co-occurrence or similarity matrix between two clusterings a and b by considering all pairs of samples and counting pairs that are assigned into the same or into different clusters.

Considering a pair of samples that is in the same group as a positive pair, and a pair is in the different group as a negative pair, then the count of true positives is C₁₁, false negatives is C₁₂, false positives C₂₁, and true negatives is C₂₂:

PositiveNegative
PositiveC₁₁C₁₂
NegativeC₂₁C₂₂
source

Other packages