Skip to content

pankajkoti/data-lake-with-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Summary of the Project

This project applies data lake principles using Spark and builds an ETL pipeline in Python. The ETL pipeline is mainly responsible for following tasks:

  1. Get raw data from data lake in S3
  2. Process the raw data using Spark deployed on an EMR cluster in AWS to extract data according to modelled fact and dimensional tables
  3. After processing of the Spark tasks, load the extracted data files back to S3 in parquet format which is an optimised format for analytical queries

The data is for a demo startup called Sparkify where analysts would like to perform queries on user activity of songs. The data files have song data and user activity log data

Steps for running the ETL

  1. Launch an EMR cluster in AWS with Spark installed on it
  2. Provide AWS credentials in dl.cfg file which are user specific and can be generated using AWS console
  3. For your account, create a S3 bucket named 'sparkify-output' to store the output parquet files
  4. Run the etl

    python etl.py

  5. You can check if data has been loaded correctly by checking S3 files in the above mentioned output bucket

Project stucture

  1. Data directories: These are files stored in S3 data lake at location: s3://udacity-dend/

    • song_data : The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. The contents of one such file looks like the following:

      {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

    • log_data: The second dataset consists of log files in JSON format generated by an event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations. The log files in the dataset you'll be working with are partitioned by year and month. Following is a snapshot log_data

    Log Data preview

  2. dl.cfg - Configuration file to record AWS credentials which will be used by the ETL to access AWS services

  3. etl.py - ETL file responsible for all the tasks mentioned in the Summary of Project above

Database Design

Data Modeling:

Using the song and log datasets, we create a star schema optimized for queries on song play analysis.

Following is how we modelled the data in 1 fact table and 4 corresponding dimension tables:

Fact Table:

  1. songplays - records in log data associated with song plays i.e. records with the page NextSong
    • songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimesion Tables:

  1. users - users in the app
    • user_id, first_name, last_name, gender, level
  2. songs - songs in music database
    • song_id, title, artist_id, year, duration
  3. artists - artists in music database
    • artist_id, name, location, latitude, longitude
  4. time - timestamps of records in songplays broken down into specific units
    • start_time, hour, day, week, month, year, weekday

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published