This project applies data lake principles using Spark and builds an ETL pipeline in Python. The ETL pipeline is mainly responsible for following tasks:
- Get raw data from data lake in S3
- Process the raw data using Spark deployed on an EMR cluster in AWS to extract data according to modelled fact and dimensional tables
- After processing of the Spark tasks, load the extracted data files back to S3 in parquet format which is an optimised format for analytical queries
The data is for a demo startup called Sparkify where analysts would like to perform queries on user activity of songs. The data files have song data and user activity log data
- Launch an EMR cluster in AWS with Spark installed on it
- Provide AWS credentials in dl.cfg file which are user specific and can be generated using AWS console
- For your account, create a S3 bucket named 'sparkify-output' to store the output parquet files
- Run the etl
python etl.py
- You can check if data has been loaded correctly by checking S3 files in the above mentioned output bucket
-
Data directories: These are files stored in S3 data lake at location:
s3://udacity-dend/
-
song_data : The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. The contents of one such file looks like the following:
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
-
log_data: The second dataset consists of log files in JSON format generated by an event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations. The log files in the dataset you'll be working with are partitioned by year and month. Following is a snapshot log_data
-
-
dl.cfg - Configuration file to record AWS credentials which will be used by the ETL to access AWS services
-
etl.py - ETL file responsible for all the tasks mentioned in the Summary of Project above
Using the song and log datasets, we create a star schema optimized for queries on song play analysis.
Following is how we modelled the data in 1 fact table and 4 corresponding dimension tables:
- songplays - records in log data associated with song plays i.e. records with the page NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
- users - users in the app
- user_id, first_name, last_name, gender, level
- songs - songs in music database
- song_id, title, artist_id, year, duration
- artists - artists in music database
- artist_id, name, location, latitude, longitude
- time - timestamps of records in songplays broken down into specific units
- start_time, hour, day, week, month, year, weekday