DATA WAREHOUSE ON AWS REDSHIFT

Author: Tianlin He

Date: 19 Sep 2020

Tag: #Udacity #Data Engineering #AWS #RedShift #Python SDK

Project Overview

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

As their data engineer, you are tasked with building an ETL pipeline that

extracts their data from S3,
stages them in Redshift,
transforms data into a set of dimensional tables for business analytics.

You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.

Dataset

You'll be working with two datasets that reside in **S3 with the following link:

Song data: s3://udacity-dend/song_data
Log data: s3://udacity-dend/log_data
Log data json path: s3://udacity-dend/log_json_path.json to load the Log data.

Song Data

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like:

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Data

The log data are in JSON format files generated by this event simulator based on the songs in the dataset above. The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset:

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

And below is an example of what the data in a log file, 2018-11-12-events.json, looks like:

Project Development

The project is developed in a local enviornment (MacOS) and based on pure Python SDK codes.

The steps, including creating a cluster, create and load tables, testing, were pipelined in a Jupyter notebook etl.ipynb

Overview of the steps

Define queries in sql_queries.py to create empty tables, copy data from S3 and insert data into tables
Set up AWS environment by:
1. Create IAM user and take note of key and secret (this is the only step to be done on AWS web interface)
2. Copy key and secret to dwh.cfg
3. Create IAM role and Red shift cluster with pre-defined DWH parameters in dwh.cfg
4. Copy the DWH_ENTPONT and IAM_ROLE displayed to dwh.cfg
5. Connect the cluster with command line sql tool
Run create_table.py to create empty tables
Run etl.py (no modification is needed) to 1) copy data from S3 to staging 2) insert data from staging to tables
Test the created tables with sql command
Clean up resources by deleting the cluster and IAM role

Files

etl.ipynb
sql_queries.py
create_tables.py (no modification is needed)
etl.py (no modification is needed)
sample_dwh.cfg is a copy of dwh.cfg without key and secret, becuase they ought NOT to be accessible to the public

Test

1. How many rows are there in the staging tables?

%sql SELECT COUNT(*) FROM staging_songs

count
14896

%sql SELECT COUNT(*) FROM staging_events

count
8056

2. Display the first 5 rows of `songplays`table

%sql SELECT * FROM songplays LIMIT 5

songplay_id	start_time	user_id	level	song_id	artist_id	session_id	location	user_agent
0	2018-11-03 01:04:33	None	free	None	None	52	None	None
64	2018-11-03 17:39:13	15	paid	None	None	199	Chicago-Naperville-Elgin, IL-IN-WI	"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36"
128	2018-11-03 18:36:39	15	paid	None	None	199	Chicago-Naperville-Elgin, IL-IN-WI	"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36"
192	2018-11-05 06:32:52	49	free	None	None	224	San Francisco-Oakland-Hayward, CA	Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0
256	2018-11-05 10:09:06	95	paid	None	None	222	Winston-Salem, NC	"Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53"

3. Find out the most recently played record

%%sql SELECT * 
FROM songplays 
WHERE start_time=(SELECT MAX(start_time) FROM songplays);

songplay_id	start_time	user_id	level	song_id	artist_id	session_id	location	user_agent
6266	2018-11-30 19:54:24	5	free	None	None	985	Detroit-Warren-Dearborn, MI	"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.77.4 (KHTML, like Gecko) Version/7.0.5 Safari/537.77.4"

4. Count the number of free and paid users

%sql SELECT level, COUNT(*) FROM users GROUP BY level

level	count
free	83
paid	22

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
img		img
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
sample_dwh.cfg		sample_dwh.cfg
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DATA WAREHOUSE ON AWS REDSHIFT

Project Overview

Dataset

Song Data

Log Data

Project Development

Overview of the steps

Files

Test

1. How many rows are there in the staging tables?

2. Display the first 5 rows of `songplays`table

3. Find out the most recently played record

4. Count the number of free and paid users

About

Releases

Packages

Languages

tianlinhe/data_warehouse_RedShift

Folders and files

Latest commit

History

Repository files navigation

DATA WAREHOUSE ON AWS REDSHIFT

Project Overview

Dataset

Song Data

Log Data

Project Development

Overview of the steps

Files

Test

1. How many rows are there in the staging tables?

2. Display the first 5 rows of songplaystable

3. Find out the most recently played record

4. Count the number of free and paid users

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

2. Display the first 5 rows of `songplays`table

Packages