Agglomerative and Divisive Hierarchical Clustering

Course Assignment for CS F415- Data Mining @ BITS Pilani, Hyderabad Campus.

Done under the guidance of Dr. Aruna Malapati, Assistant Professor, BITS Pilani, Hyderabad Campus.

Introduction

Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:

Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.

The main purpose of this project is to get an in depth understanding of how the Divisive and Agglomerative hierarchical clustering algorithms work.

More on Hierarchical clustering

Data

We used the Human Gene DNA Sequence dataset, which can be found here. The dataset contains 311 gene sequences. The data can be found in the folder 'data'.

Instructions to run the scripts

Run the following command:

Divisive clustering

python divisive.py

Agglomerative clustering

python agglomerative.py

Equations used

Maximum or complete-linkage clustering -> Max(d(a,b))
Minimum or single-linkage clustering -> Min(d(a,b))
Mean or average linkage clustering -> sum of all d(a,b)/(|A|+|B|)
Diameter of a cluster -> Max(d(x,y))

where x, y are points in the same cluster and, a belongs to A, b belongs to B.

Pre-processing done

The file was read sequence by sequence and was saved in the form of a dictionary, where the key is the gene sequence's name and the value contains the entire gene string.

A mapping was created from the unique gene sequences in the dataset to integers so that each sequence corresponded to a unique integer.

The entire data was mapped to integers to reduce the storage and computational requirement.

Machine specs

Processor: i7-7500U

Ram: 16 GB DDR4

OS: Ubuntu 16.04 LTS

Results

CLustering was performed using the agglomerative and divisive methods and the following dendrograms were obtained-

Agglomerative

Divisive

Group Members

Shubham Jha

Praneet Mehta

Abhinav Jain

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Agglomerative		Agglomerative
Divisive		Divisive
Results		Results
Problem Statement.pdf		Problem Statement.pdf
README.html		README.html
README.md		README.md
Report.pdf		Report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agglomerative and Divisive Hierarchical Clustering

Table of contents

Introduction

Data

Instructions to run the scripts

Divisive clustering

Agglomerative clustering

Equations used

Pre-processing done

Machine specs

Results

Agglomerative

Divisive

Group Members

About

Releases

Packages

Languages

gao5411/hierarchical-clustering

Folders and files

Latest commit

History

Repository files navigation

Agglomerative and Divisive Hierarchical Clustering

Table of contents

Introduction

Data

Instructions to run the scripts

Divisive clustering

Agglomerative clustering

Equations used

Pre-processing done

Machine specs

Results

Agglomerative

Divisive

Group Members

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages