Summary of the Project

This project applies data lake principles using Spark and builds an ETL pipeline in Python. The ETL pipeline is mainly responsible for following tasks:

Get raw data from data lake in S3
Process the raw data using Spark deployed on an EMR cluster in AWS to extract data according to modelled fact and dimensional tables
After processing of the Spark tasks, load the extracted data files back to S3 in parquet format which is an optimised format for analytical queries

The data is for a demo startup called Sparkify where analysts would like to perform queries on user activity of songs. The data files have song data and user activity log data

Steps for running the ETL

Launch an EMR cluster in AWS with Spark installed on it
Provide AWS credentials in dl.cfg file which are user specific and can be generated using AWS console
For your account, create a S3 bucket named 'sparkify-output' to store the output parquet files
Run the etl

python etl.py
You can check if data has been loaded correctly by checking S3 files in the above mentioned output bucket

Project stucture

Data directories: These are files stored in S3 data lake at location: s3://udacity-dend/
- song_data : The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. The contents of one such file looks like the following:
  
  {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
- log_data: The second dataset consists of log files in JSON format generated by an event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations. The log files in the dataset you'll be working with are partitioned by year and month. Following is a snapshot log_data
dl.cfg - Configuration file to record AWS credentials which will be used by the ETL to access AWS services
etl.py - ETL file responsible for all the tasks mentioned in the Summary of Project above

Database Design

Data Modeling:

Using the song and log datasets, we create a star schema optimized for queries on song play analysis.

Following is how we modelled the data in 1 fact table and 4 corresponding dimension tables:

Fact Table:

songplays - records in log data associated with song plays i.e. records with the page NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimesion Tables:

users - users in the app
- user_id, first_name, last_name, gender, level
songs - songs in music database
- song_id, title, artist_id, year, duration
artists - artists in music database
- artist_id, name, location, latitude, longitude
time - timestamps of records in songplays broken down into specific units
- start_time, hour, day, week, month, year, weekday

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
images		images
.gitignore		.gitignore
README.md		README.md
dl.cfg		dl.cfg
etl.py		etl.py
sparkify-pk-10.ipynb		sparkify-pk-10.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary of the Project

Steps for running the ETL

Project stucture

Database Design

Data Modeling:

Fact Table:

Dimesion Tables:

About

Releases

Packages

Languages

pankajkoti/data-lake-with-spark

Folders and files

Latest commit

History

Repository files navigation

Summary of the Project

Steps for running the ETL

Project stucture

Database Design

Data Modeling:

Fact Table:

Dimesion Tables:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages