Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

clemsonciti/workshop-python-intro-to-hadoop

Repository files navigation

Introduction to Big Data Analytics using Hadoop MapReduce with Python

This full-day workshop aims to provide participants with a detailed knowledge on:

  • The concepts of MapReduce programming paradigm
  • How this programming paradigm can leverage the characteristics of the Hadoop Distributed File System (HDFS) to process massive amount of data
  • How to integrate Hadoop-based data analytics components into standard HPC workflow on Palmetto

By the end of the workshop, the participants are expected to be able to:

  • Be able to interact with data stored in HDFS
  • Be able to write MapReduce programs in Python and run them on data stored on HDFS
  • Be able to interact with YARN, the job scheduler in HDFS to find out information about the running MapReduce programs
  • Be able to analyze YARN log information to debug and optimize MapReduce programs
  • Be able to write PBS scripts to launch Hadoop jobs and extract results.

Having an account on Palmetto is required for this workshop.

If you need to register for a new account, please make sure that you specify the followings in the new account registration form:

  • Account Type: Educational
  • Course Information: Introduction to Big Data Analytics using Hadoop MapReduce with Python
  • Check the box on Jobs that require distributed large-scale storage (Hadoop)

A new account can be requested at https://citi.sites.clemson.edu/new-account

Information about how to get authenticated in order to interact with Clemson University's Hadoop cluster, Cypress, can be found at: https://www.palmetto.clemson.edu/hadoop/pages/userguide.html#access