Introduction to Big Data Analytics using Hadoop MapReduce with Python

This full-day workshop aims to provide participants with a detailed knowledge on:

The concepts of MapReduce programming paradigm
How this programming paradigm can leverage the characteristics of the Hadoop Distributed File System (HDFS) to process massive amount of data
How to integrate Hadoop-based data analytics components into standard HPC workflow on Palmetto

By the end of the workshop, the participants are expected to be able to:

Be able to interact with data stored in HDFS
Be able to write MapReduce programs in Python and run them on data stored on HDFS
Be able to interact with YARN, the job scheduler in HDFS to find out information about the running MapReduce programs
Be able to analyze YARN log information to debug and optimize MapReduce programs
Be able to write PBS scripts to launch Hadoop jobs and extract results.

Having an account on Palmetto is required for this workshop.

If you need to register for a new account, please make sure that you specify the followings in the new account registration form:

Account Type: Educational
Course Information: Introduction to Big Data Analytics using Hadoop MapReduce with Python
Check the box on Jobs that require distributed large-scale storage (Hadoop)

Information about how to get authenticated in order to interact with Clemson University's Hadoop cluster, Cypress, can be found at: https://www.palmetto.clemson.edu/hadoop/pages/userguide.html#access

Name		Name	Last commit message	Last commit date
Latest commit History 478 Commits
bin		bin
codes		codes
intro-to-hadoop		intro-to-hadoop
pictures/10		pictures/10
text		text
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
init_hadoop.sh		init_hadoop.sh
intro-to-hadoop-00.ipynb		intro-to-hadoop-00.ipynb
intro-to-hadoop-01.ipynb		intro-to-hadoop-01.ipynb
intro-to-hadoop-02.ipynb		intro-to-hadoop-02.ipynb
intro-to-hadoop-03.ipynb		intro-to-hadoop-03.ipynb
intro-to-hadoop-04.ipynb		intro-to-hadoop-04.ipynb
nodes.txt		nodes.txt
stop_hadoop.sh		stop_hadoop.sh
test_hadoop.sh		test_hadoop.sh

Provide feedback