This full-day workshop aims to provide participants with a detailed knowledge on:
- The concepts of MapReduce programming paradigm
- How this programming paradigm can leverage the characteristics of the Hadoop Distributed File System (HDFS) to process massive amount of data
- How to integrate Hadoop-based data analytics components into standard HPC workflow on Palmetto
By the end of the workshop, the participants are expected to be able to:
- Be able to interact with data stored in HDFS
- Be able to write MapReduce programs in Python and run them on data stored on HDFS
- Be able to interact with YARN, the job scheduler in HDFS to find out information about the running MapReduce programs
- Be able to analyze YARN log information to debug and optimize MapReduce programs
- Be able to write PBS scripts to launch Hadoop jobs and extract results.
Having an account on Palmetto is required for this workshop.
If you need to register for a new account, please make sure that you specify the followings in the new account registration form:
- Account Type: Educational
- Course Information: Introduction to Big Data Analytics using Hadoop MapReduce with Python
- Check the box on Jobs that require distributed large-scale storage (Hadoop)
A new account can be requested at https://citi.sites.clemson.edu/new-account
Information about how to get authenticated in order to interact with Clemson University's Hadoop cluster, Cypress, can be found at: https://www.palmetto.clemson.edu/hadoop/pages/userguide.html#access