the most popular clustering algorithm through algernating minimization
k-Means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
Category | Usage | Methematics | Application Field |
---|---|---|---|
Unsupervised Learning | Clustering |
- Initialize the center of the clusters (pick centroids
$\mathbf{\mu}$ ) => k randmly chosen in input data - Alternating Optimization
- Attribute the closest cluster to each data point (calculate distance)
- Optimize
$S_i$ . Each$\mathbf{x}_n$ "optimally partitioned" using its cloeset$\mu_i$
- Optimize
- Set the position of each cluster to the mean of all data points belonging to that cluster (move to new cluster center)
- Optimize
$\mu_i$ . Each$\mu_n$ "optimally computed as the consensus within$S_m$
- Optimize
- Attribute the closest cluster to each data point (calculate distance)
- Repeat STEP 2 until convergence
- An incorrect choice of the number of clusters will invalidate the whold process
- Try k-means clustering with different number of clusters and measure the resulting sum of squares
-
Best used when the number of cluster centers is specified due to a well-defined list of types shown in the data
-
Don't use
- Overlapping data => Euclidean distance doesn't measure that underlined in fact as well
- If data is noisy or full of outliers
- Advantages
- K-means speed > hierarchical clusterning speed (if k is small)
- K-means may produce tighter clusters than hierarchical clustering
- Disadvantages
- Difficulty in comparing quality of the clusters produced
- Fixed number of clusters can make it difficult to predict what k should be (e.g. K-means Trap)
- Strong sensitivity to outliers and noise
- Doesn't work well with non-circular cluster shape (Non-globular cluster)
- Low capability to pass the local optimum
- Centroid
- Elbow method - find the best K
numpy.linalg.norm(a-b)
scipy.spatial.distance.euclidean(a, b)