Hadoop Data Filter

This is a small hadoop job for filtering data from a lage data set.

The code uses Maven, and the job can be build using maven package.

I will add commands and the like for running the job later.

There is a vagrant box available which has been set up as a psuedo cluster for running the hadoop jobs.

Compilation

- Install Eclipse
- Install m2e
- Get the repo
- Import the project
- Convert to a Maven project (Configure > Convert to Maven)
- Run as maven project (type in `package` as Goal)
- [local machine] cp target/hadoop-0.0.1-SNAPSHOT.jar vagrant/  
- [in vagrant] cp /vagrant/hadoop-0.0.1-SNAPSHOT.jar /opt/hadoop/hadoop-filter.jar
*note the version number for the jar

Usage

First log into the vagrant box

Next become hduser

- sudo -s
- su - hduser

got to the hadoop directory

- cd /opt/hadoop

start hadoop

- ./hadoop/bin/start-all.sh

run the hadoop job

- ./hadoop/bin/hadoop jar hadoop-filter.jar FilterData /home/hduser/dataset/ /home/hduser/data-output

A bit about hadoop

Hadoop makes use of HDFS, the hadoop distributed File System. This means that hadoop cannot read and write to the local filesystem, but only to it's own HDFS.

Not really a problem though as we can interact with the filesystem using hadoop fs or hadoop dfs. Basically use standard bash style commands to interact:

-	./hadoop/bin/hadoop fs -ls /home/hduser
	Will give you an ls for the hdfs://home/hduser directory
-	./hadoop/bin/hadoop fs -cat /home/hduser/blah.txt
	Will cat the /home/hduser/blah.txt file on the hdfs cluster

Obviously you will need to push some actual data for the job to run on:

-	./hadoop/bin/hadoop dfs -put <local> <hdfs>

On occasion, you may want to set a file as part of the distributed cache, in this case, you would put the file on the HDFS FileSystem and configure it as part of the cache.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
base		base
src		src
vagrant		vagrant
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hadoop Data Filter

Compilation

Usage

About

Releases

Packages

Languages

ranoble/hadoop-datafilter

Folders and files

Latest commit

History

Repository files navigation

Hadoop Data Filter

Compilation

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages