Calinski Harabasz Index

NannyML · Jan 17, 2025 · d8c8b9e · d8c8b9e
1 parent 9edd30f
commit d8c8b9e
Showing 1 changed file with 31 additions and 5 deletions.
diff --git a/book/4-clustering.tex b/book/4-clustering.tex
@@ -41,8 +41,8 @@ \subsection{Mutual Info Score}
 % ---------- rand index ----------
 \clearpage
 \thispagestyle{clusteringstyle}
-\section{Rand index}
-\subsection{Rand index}
+\section{Rand Index}
+\subsection{Rand Index}
 
 The Rand Index (RI) is a clustering metric that measures the similarity between two clusterings using the predicted labels generated by an algorithm
 and the true labels or labels comming from a reference clustering.
@@ -77,11 +77,37 @@ \subsection{Rand index}
     RI scores even when the clusterings are significantly different.
 }
 
-% ---------- calinski harabasz score ----------
+% ---------- calinski harabasz index ----------
 \clearpage
 \thispagestyle{clusteringstyle}
-\section{CH Score}
-\subsection{Calinski Harabasz Score}
+\section{CH Index}
+\subsection{Calinski Harabasz Index}
+
+The Calinski–Harabasz Index (CH Index), also known as the Variance Ratio Criterion, is a clustering evaluation metric that does
+not require ground-truth labels. It measures the quality of clustering by comparing the dispersion between clusters to the
+dispersion within clusters.
+
+\begin{center}
+    FORMULA GOES HERE
+\end{center}
+
+The CH Index is defined as the ratio of the between-clusters dispersion (BCSS) to the within-cluster dispersion (WCSS),
+normalized by their respective degrees of freedom. We normalize BCSS and WCSS by their degrees of freedom to ensure comparability
+across different values of $k$, avoiding artificial inflation of the score for higher cluster counts.
+
+\textbf{When to use Calinski-Harabasz Index?}
+
+Use CH Index when no ground-truth labels are available to validate the clustering quality. It can also be used to identify the
+optimal number of clusters by maximizing the CH Index across different cluster counts.
+
+\coloredboxes{
+    \item The CH Index does not rely on labeled data.
+    \item The use of degrees of freedom normalization ensures fair comparison across varying $k$ and sample sizes.
+}
+{
+    \item The calculation assumes a Euclidean distance metric, which may limit its applicability for non-Euclidean data.
+}
+
 
 
 % ---------- contingency matrix ----------