An implementation of the k-means algorithm using Hadoop and HDFS written in Java.
The program was developed and tested on a Windows 10 machine using hadoop-3.35 and Maven structure with K = 3 .
This project implements the k-means clustering algorithm on Hadoop using sythetic data as a sample. The data can be found at src/main/resources/data.txt
and were generated by the DataGenerator.java
component biased towards 3 initial centers located at src/main/resources/centroid.txt
. A visual representation of
the said data can be obtained by running the DataPlotter.java
file
- Watch this Video and follow the steps closely.
- Open the windows cmd as an administrator
- Navigate to the folder you installed hadoop ex
C:\hadoop-3.3.5
- Navigate to
hadoop/sbin
- Type
start-all.cmd
to start all the hadoop services (demons) - To confirm that it is working go to your browser and in the url type
http://localhost:9870/
. Keep this tab open. This will come in handy later
- When setting env variables make sure
JAVA_HOME
andHADOOP_HOME
don't contain any spaces in the path. - Hadoop runs on Java 8 or later
- If you are still getting any errors especially java exceptions try to search them on the web.
You can install Hadoop in ubuntu by following This article
Put the data.txt and centroid.txt files from the resources folder in hdfs in the same directory. You can do that by opening a terminal and running
$ hdfs dfs -copyFromLocal <path-to-data.txt> <destination-folder-in-hdfs>
$ hdfs dfs -copyFromLocal <path-to-centroid.txt> <destination-folder-in-hdfs>
1. Clone this repository and navigate tothe folder:
$ git clone https://github.com/nickkatsios/MapReduce-Kmeans.git
$ cd MapReduce-Kmeans
2. Build project using Maven:
$ mvn install
A target folder should be generated with a MapReduce-Kmeans-1.0-SNAPSHOT.jar
jar file inside.
3. Run the k-means algorithm using:
$ cd target
$ hadoop jar KmeansTest-1.0-SNAPSHOT.jar gr.aueb.dmst.nickkatsios.KMeans <input-hdfs-directory> <output-hdfs-directory>
With the input direcory being the directory where you put your data.txt and centroid.txt files. And output directory the directory name the output folders are based upon.
You are done With the example data and centroid files convergence should be reached after ~10 iterations.
- In your browser tab where
http://localhost:9870/
(the namenode) is running navigate to utilities --> browse the file system - After convergence x number of folders should be generated each containing the output of each iteration based on the output path/name specified in the jar execution. Navigate to the most recent one.
- Download the
part-r-0000
file and open it with a text editor. It should contain the final centers (x,y).
This project was made as an assignement of the Big Data Management Systems course at DMST AUEB.
Team members
Nikolaos Katsios 8200071
Theodoros Skondras Mexis 8200156