title | tags | authors | affiliations | date | bibliography | |||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CluSim: a python package for calculating clustering similarity |
|
|
|
13 March 2019 |
paper.bib |
Clustering is a primary method to reveal the structure of data [@Jain1999clustering]. To understand, evaluate, and leverage data clusterings, we need to quantitatively compare them. Clustering comparison is the basis for method evaluation, consensus clustering, and tracking the temporal evolution of clusters, among many other tasks. For instance, the evaluation of a clustering method is usually achieved by comparing the method's result to a planted reference clustering, assuming that the more similar the method's solution is to the reference clustering, the better the method. Despite the importance of clustering comparison, no consensus has been reached for a standardized assessment; each similarity measure rewards and penalizes different criteria, sometimes producing contradictory conclusions.
Clustering similarity measures can be classified based on the cluster types: i) partitions that group elements into non-overlapping clusters, ii) hierarchical clusterings that group elements into a nested series of partitions (a.k.a. dendrogram), or iii) overlapping clusterings with elements belonging to multiple clusters. One approach to aid with the interpretation of the similarity score establishes a baseline in the context of a random ensemble of clusterings. Such a correction procedure requires two choices: a model for random clusterings and how clusterings are drawn from the random model. With few exceptions, similarity measures are only designed to compare clusterings of the same type, and the decisions required for the correction procedure are usually ignored or relegated to the status of technical trivialities [@Gates2017impact].
Here, we introduce CluSim, a python package providing a unified library of over 20 clustering similarity measures for partitions, dendrograms, and overlapping clusterings. To our knowledge, this package constitutes the first collection of clustering similarity measures for all three clustering types and extended access to random models of clusterings [@Gates2018element]. We illustrate the use of the package through two examples: an evaluation of measure behavior with variation in 3 clustering properties (membership, cluster sizes, and number of clusters) and a clustering comparison of Gene Expression data in the context of different random models.
The basic class in the CluSim package is a Clustering, or an assignment of labeled elements (i.e. data points or network vertices) into clusters (the groups). Hierarchical Clusterings also contain a dendrogram, or more generally an acyclic graph, capturing the nested structure of the clusters. In CluSim, a Clustering can be instantiated from 7 different common formats, including full support for scipy, scikit-learn, and dendropy clustering formats [@scikitlearn; @Sukumaran2010dendropy].
CluSim provides more than 20 clustering similarity and distance measures for the comparison between two Clusterings. All similarity measures produce a score in the range
The clustering similarity measures presented here differ in how each evaluates the trade-offs between variation in three primary characteristics of clusterings: the grouping of elements into clusters, the number of clusters, and the size distribution of those clusters [@Gates2018element]. To illustrate these trade-offs, we present three simple examples in Fig. 1. In the first example, 1,024 elements are grouped into 32 clusters of equal size and compared against a similar clustering with a fraction of the elements randomly exchanged between the clusters, keeping the same cluster sizes. As seen in Fig. 1a), all similarity measures decrease as the fraction of shuffled elements increases, but the measures differ on whether they can differentiate between clusterings that are completely random, or if there is a discontinuous jump in the similarity value. In the second example, 1,024 elements are grouped into 32 clusters of equal size and compared against a similar clustering with increasing cluster size skew-ness. As seen in Fig. 1b), some similarity measures decrease as the fraction cluster size heterogeneity increases, while others increase. Finally, in the third example, 1,024 elements are grouped into 8 clusters of equal size and compared against a similar clustering with an increasing number of equal-sized clusters. As seen in Fig. 1c), most similarity measures decrease as the number of clusters increases, but the normalized mutual information increases. For more details and an extended interpretation of these experiments, see the analysis in Gates et al (2019) [@Gates2018element]. Ultimately, the practitioner should choose a clustering similarity measure that is sensitive to the relevant features of the clusterings for the problem at hand.
To facilitate comparisons within a set of clusterings, it is often argued to consider clustering similarity in the context of a random baseline [@Hubert1985adjrand; @Vinh2009nmicorrection; @Gates2017impact]. The CluSim package provides both analytic and statistical sampling methods for calculating such a correction for chance. Analytic solutions are available for the Rand index and Normalized Mutual Information using five random models: the permutation model, both one-sided and two-sided models for clusterings with a fixed number of clusters, and both one-sided and two-sided models for all random clusterings. See Gates & Ahn (2017) [@Gates2017impact] for detailed derivations and explanations of the differences between clustering random models. For all other similarity measures, the correction for chance is estimated by randomly sampling the random ensemble of Clusterings using the provided random Clustering generators.
A typical comparison using a correction for chance is illustrated in Fig. 2. Agglomerative Hierarchical Clustering was applied to gene expression data from [@deSouto2008geneclustering] and compared to the true classification of cancer types using the Rand Index (
The authors would like to thank Ian Wood for thoughtful discussions, and Andre David for suggestions which significantly improved the presentation of this work.