Skip to content

Commit

Permalink
doc variation of info
Browse files Browse the repository at this point in the history
  • Loading branch information
lindahua committed Aug 10, 2014
1 parent 9d25d68 commit c72d19c
Show file tree
Hide file tree
Showing 7 changed files with 38 additions and 5 deletions.
2 changes: 1 addition & 1 deletion doc/source/affprop.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Affinity Propagation
======================

*Affinity propagation* is a clustering algorithm based on *message passing* between data points. Similar to *K-medoids*, it finds a subset of points as *exemplars* based on (dis)similarities, and assigns each point in the given data set to the closest exemplar.
`Affinity propagation <http://en.wikipedia.org/wiki/Affinity_propagation>`_ is a clustering algorithm based on *message passing* between data points. Similar to *K-medoids*, it finds a subset of points as *exemplars* based on (dis)similarities, and assigns each point in the given data set to the closest exemplar.

This package implements the affinity propagation algorithm based on the following paper:

Expand Down
2 changes: 1 addition & 1 deletion doc/source/dbscan.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
DBSCAN
=========

*Density-based Spatial Clustering of Applications with Noise (DBSCAN)* is a data clustering algorithm that finds clusters through density-based expansion of seed points. The algorithm is proposed by:
`Density-based Spatial Clustering of Applications with Noise (DBSCAN) <http://en.wikipedia.org/wiki/DBSCAN>`_ is a data clustering algorithm that finds clusters through density-based expansion of seed points. The algorithm is proposed by:

Martin Ester, Hans-peter Kriegel, Jörg S, and Xiaowei Xu
*A density-based algorithm for discovering clusters in large spatial databases with noise.*
Expand Down
2 changes: 1 addition & 1 deletion doc/source/kmeans.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
K-means
==========

*K-means* is a classic method for clustering or vector quantization. The K-means algorithms produces a fixed number of clusters, each associated with a *center* (also known as a *prototype*), and each sample belongs to a cluster with the nearest center.
`K-means <http://en.wikipedia.org/wiki/K_means>`_ is a classic method for clustering or vector quantization. The K-means algorithms produces a fixed number of clusters, each associated with a *center* (also known as a *prototype*), and each sample belongs to a cluster with the nearest center.

From a mathematical standpoint, K-means is an coordinate descent algorithm to solve the following optimization problem:

Expand Down
2 changes: 1 addition & 1 deletion doc/source/kmedoids.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
K-medoids
===========

*K-medoids* is a clustering algorithm that seeks a subset of points out of a given set such that the total costs or distances between each point to the closest point in the chosen subset is minimal. This chosen subset of points are called *medoids*.
`K-medoids <http://en.wikipedia.org/wiki/K-medoids>`_ is a clustering algorithm that seeks a subset of points out of a given set such that the total costs or distances between each point to the closest point in the chosen subset is minimal. This chosen subset of points are called *medoids*.

This package implements a K-means style algorithm instead of PAM, which is considered to be much more efficient and reliable. Particularly, the algorithm is implemented by the ``kmedoids`` function.

Expand Down
2 changes: 1 addition & 1 deletion doc/source/silhouette.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Silhouettes
=============

*Silhouettes* is a method for validating clusters of data. Particularly, it provides a quantitative way to measure how well each item lies within its cluster as opposed to others. The *Silhouette* value of a data point is defined as:
`Silhouettes <http://en.wikipedia.org/wiki/Silhouette_(clustering)>`_ is a method for validating clusters of data. Particularly, it provides a quantitative way to measure how well each item lies within its cluster as opposed to others. The *Silhouette* value of a data point is defined as:

.. math::
Expand Down
1 change: 1 addition & 0 deletions doc/source/validate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ This package provides a variety of ways to validate or evaluate clustering resul
.. toctree::

silhouette.rst
varinfo.rst
32 changes: 32 additions & 0 deletions doc/source/varinfo.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
Variation of Information
==========================

`Variation of information <http://en.wikipedia.org/wiki/Variation_of_information>`_ (also known as *shared information distance*) is a measure of the distance between two clusterings. It is devised based on mutual information, but it is a true metric, *i.e.* it satisfies symmetry and triangle inequality.

**References:**

Meila, Marina (2003).
*Comparing Clusterings by the Variation of Information.*
Learning Theory and Kernel Machines: 173–187.

This package provides the ``varinfo`` function that implements this metric:

.. function:: varinfo(k1, a1, k2, a2)

Compute the variation of information between two assignments.

:param k1: The number of clusters in the first clustering.
:param a1: The assignment vector for the first clustering.
:param k2: The number of clusters in the second clustering.
:param a2: The assignment vector for the second clustering.

:return: the value of variation of information.

.. function:: varinfo(R, k0, a0)

This method takes ``R``, an instance of ``ClusteringResult``, as input, and computes the variation of information between its corresponding clustering with one given by ``(k0, a0)``, where ``k0`` is the number of clusters in the other clustering, while ``a0`` is the corresponding assignment vector.

.. function:: varinfo(R1, R2)

This method takes ``R1`` and ``R2`` (both are instances of ``ClusteringResult``) and computes the variation of information between them.

0 comments on commit c72d19c

Please sign in to comment.