Skip to content

Latest commit

 

History

History
82 lines (54 loc) · 3.15 KB

File metadata and controls

82 lines (54 loc) · 3.15 KB

k-Means

the most popular clustering algorithm through algernating minimization

Brief Description

k-Means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

Quick View

Category Usage Methematics Application Field
Unsupervised Learning Clustering

Steps

  1. Initialize the center of the clusters (pick centroids $\mathbf{\mu}$) => k randmly chosen in input data
  2. Alternating Optimization
    1. Attribute the closest cluster to each data point (calculate distance)
      • Optimize $S_i$. Each $\mathbf{x}_n$ "optimally partitioned" using its cloeset $\mu_i$
    2. Set the position of each cluster to the mean of all data points belonging to that cluster (move to new cluster center)
      • Optimize $\mu_i$. Each $\mu_n$ "optimally computed as the consensus within $S_m$
  3. Repeat STEP 2 until convergence

Deciding umber of cluster

  • An incorrect choice of the number of clusters will invalidate the whold process
  • Try k-means clustering with different number of clusters and measure the resulting sum of squares

When to use k-means clustering

  • Best used when the number of cluster centers is specified due to a well-defined list of types shown in the data

  • Don't use

    • Overlapping data => Euclidean distance doesn't measure that underlined in fact as well
    • If data is noisy or full of outliers

Pros and Cons

  • Advantages
    • K-means speed > hierarchical clusterning speed (if k is small)
    • K-means may produce tighter clusters than hierarchical clustering
  • Disadvantages
    • Difficulty in comparing quality of the clusters produced
    • Fixed number of clusters can make it difficult to predict what k should be (e.g. K-means Trap)
    • Strong sensitivity to outliers and noise
    • Doesn't work well with non-circular cluster shape (Non-globular cluster)
    • Low capability to pass the local optimum

Terminology

  • Centroid
  • Elbow method - find the best K

Concepts

Distance

Distance Calculation Library

  • numpy.linalg.norm(a-b)
  • scipy.spatial.distance.euclidean(a, b)

Links

Tutorial

Wikipedia

Scikit Learn