As a part of the Cloud Computing course, we worked on a range of bi-weekly projects involving different topics including Hadoop & MapReduce, Apache Spark, and many more.
Developed Spark programs to perform data analytics on the ‘hetrec2011-lastfm-2k' dataset. This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system. We have also developed programs using Spark to perform real-time log analysis and provide execution time of data processing with and without cached RDD (resilient distributed dataset).
Configured Spark distribution on top of the Hadoop cluster. We used YARN for scheduling/running Spark applications on our setup. The entire Spark setup is configured on top of a two node Hadoop cluster.