This project focuses on performing Hierarchical Agglomerative Clustering on a dataset containing district names and their corresponding component percentages for soil. The goal is to group districts based on the similarity of their soil component percentages using different linkage methods. The project also includes plotting a dendrogram to visualize the clustering results and identifying the break line for clustering.
The dataset consists of district names and their corresponding component percentages for soil. It is assumed that the dataset has been preprocessed and contains no missing data.
-
Data Cleaning: Remove any rows with missing data from the dataset to ensure that only complete data points are considered for clustering.
-
Data Sampling: Randomly choose 155 data points from the dataset for the clustering task. This sampling step ensures that the clustering analysis is performed on a representative subset of the data.
-
Hierarchical Agglomerative Clustering: Apply Hierarchical Agglomerative Clustering to the sampled data using different linkage methods. Some commonly used linkage methods include:
- Single linkage
- Complete linkage
- Average linkage
- Ward linkage
-
Clustering Comparison: Compare the outputs of the clustering analysis using the different linkage methods. Evaluate the resulting clusters based on their similarity and dissimilarity in terms of the soil component percentages.
-
Dendrogram Visualization: Plot a dendrogram to visualize the hierarchical clustering results. The dendrogram will illustrate the hierarchical structure of the clusters and the distances between data points. Include a break line in the dendrogram to indicate the optimal number of clusters or the desired level of clustering.