Skip to content

Commit

Permalink
Calinski Harabasz Index
Browse files Browse the repository at this point in the history
  • Loading branch information
santiviquez committed Jan 17, 2025
1 parent 9edd30f commit d8c8b9e
Showing 1 changed file with 31 additions and 5 deletions.
36 changes: 31 additions & 5 deletions book/4-clustering.tex
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ \subsection{Mutual Info Score}
% ---------- rand index ----------
\clearpage
\thispagestyle{clusteringstyle}
\section{Rand index}
\subsection{Rand index}
\section{Rand Index}
\subsection{Rand Index}

The Rand Index (RI) is a clustering metric that measures the similarity between two clusterings using the predicted labels generated by an algorithm
and the true labels or labels comming from a reference clustering.
Expand Down Expand Up @@ -77,11 +77,37 @@ \subsection{Rand index}
RI scores even when the clusterings are significantly different.
}

% ---------- calinski harabasz score ----------
% ---------- calinski harabasz index ----------
\clearpage
\thispagestyle{clusteringstyle}
\section{CH Score}
\subsection{Calinski Harabasz Score}
\section{CH Index}
\subsection{Calinski Harabasz Index}

The Calinski–Harabasz Index (CH Index), also known as the Variance Ratio Criterion, is a clustering evaluation metric that does
not require ground-truth labels. It measures the quality of clustering by comparing the dispersion between clusters to the
dispersion within clusters.

\begin{center}
FORMULA GOES HERE
\end{center}

The CH Index is defined as the ratio of the between-clusters dispersion (BCSS) to the within-cluster dispersion (WCSS),
normalized by their respective degrees of freedom. We normalize BCSS and WCSS by their degrees of freedom to ensure comparability
across different values of $k$, avoiding artificial inflation of the score for higher cluster counts.

\textbf{When to use Calinski-Harabasz Index?}

Use CH Index when no ground-truth labels are available to validate the clustering quality. It can also be used to identify the
optimal number of clusters by maximizing the CH Index across different cluster counts.

\coloredboxes{
\item The CH Index does not rely on labeled data.
\item The use of degrees of freedom normalization ensures fair comparison across varying $k$ and sample sizes.
}
{
\item The calculation assumes a Euclidean distance metric, which may limit its applicability for non-Euclidean data.
}



% ---------- contingency matrix ----------
Expand Down

0 comments on commit d8c8b9e

Please sign in to comment.