k-Means

the most popular clustering algorithm through algernating minimization

Brief Description

k-Means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

Quick View

Category	Usage	Methematics	Application Field
Unsupervised Learning	Clustering

Steps

Initialize the center of the clusters (pick centroids $\mathbf{\mu}$) => k randmly chosen in input data
Alternating Optimization
1. Attribute the closest cluster to each data point (calculate distance)
  - Optimize $S_i$. Each $\mathbf{x}_n$ "optimally partitioned" using its cloeset $\mu_i$
2. Set the position of each cluster to the mean of all data points belonging to that cluster (move to new cluster center)
  - Optimize $\mu_i$. Each $\mu_n$ "optimally computed as the consensus within $S_m$
Repeat STEP 2 until convergence

Deciding umber of cluster

An incorrect choice of the number of clusters will invalidate the whold process
Try k-means clustering with different number of clusters and measure the resulting sum of squares

When to use k-means clustering

Best used when the number of cluster centers is specified due to a well-defined list of types shown in the data
Don't use
- Overlapping data => Euclidean distance doesn't measure that underlined in fact as well
- If data is noisy or full of outliers

Pros and Cons

Advantages
- K-means speed > hierarchical clusterning speed (if k is small)
- K-means may produce tighter clusters than hierarchical clustering
Disadvantages
- Difficulty in comparing quality of the clusters produced
- Fixed number of clusters can make it difficult to predict what k should be (e.g. K-means Trap)
- Strong sensitivity to outliers and noise
- Doesn't work well with non-circular cluster shape (Non-globular cluster)
- Low capability to pass the local optimum

Terminology

Centroid
Elbow method - find the best K

Concepts

Distance

Distance/Similarity Measurement

Distance Calculation Library

numpy.linalg.norm(a-b)
scipy.spatial.distance.euclidean(a, b)

Links

Tutorial

Youtube - K-Means Clustering
Siraj Raval - K-Means Clustering
- Code

Wikipedia

k-means clustering

Scikit Learn

Example of K-means Clustering
Clustering
sklearn.cluster.KMeans

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KMeans.md

KMeans.md

k-Means

Brief Description

Quick View

Steps

Deciding umber of cluster

When to use k-means clustering

Pros and Cons

Terminology

Concepts

Distance

Distance Calculation Library

Links

Tutorial

Wikipedia

Scikit Learn

Files

KMeans.md

Latest commit

History

KMeans.md

File metadata and controls

k-Means

Brief Description

Quick View

Steps

Deciding umber of cluster

When to use k-means clustering

Pros and Cons

Terminology

Concepts

Distance

Distance Calculation Library

Links

Tutorial

Wikipedia

Scikit Learn