This project is concerned with building an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables (ie. Star schema using distribution strategies for optimal performance)
There are 2 sources of this project which reside on S3. Here are the S3 links for each:
- Song data: s3://udacity-dend/song_data
- Log data: s3://udacity-dend/log_data
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.
The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
And below is an example of what the data in a log file, 2018-11-12-events.json, looks like.
Using the song and event datasets, star schema is created, which is optimized for queries on song play analysis. This includes the following tables.
- songplays - records in event data associated with song plays i.e. records with page NextSong
songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
- users - users in the app
user_id, first_name, last_name, gender, level
- songs - songs in music database
song_id, title, artist_id, year, duration
- artists - artists in music database
artist_id, name, location, lattitude, longitude
- time - timestamps of records in songplays broken down into specific units
start_time, hour, day, week, month, year, weekday
You'll need an AWS account with the resources to provision a Redshift cluster (recommended: 4 x dc2.large EC2 instances)
You'll also need this software installed on your system
In addition you'll need the PostgreSQL python driver which can be obtained via pip
pip install psycopg2
The project includes four files:
- create_table.py - is where fact and dimension tables for the star schema in Redshift are created.
- etl.py - is where data is loaded from S3 into staging tables on Redshift and then processed that data into analytics tables (Star schema) on Redshift.
- sql_queries.py - is where SQL statements are defined, which will be imported into the two other files above.
- dwh.cfg - this is a configuration file which includes AWS credentials, cluster details, IAM details, S3 details and Datawarehouse details.
-
Provision a Redshift cluster within AWS utilizing either the Quick Launch wizard, AWS CLI, or the various AWS SDKs (e.g. boto3 for python).
-
Update the credentials and configuration details in 'dwh.cfg' for [AWS] & [DWH] groups.
-
Execute the jupyter file for creating a Redshift cluster.
-
Once, cluster is created, record the cluster details into the 'dwh.cfg' for the remaining sections [CLUSTER] & [IAM_ROLE]. (Don not change the S3 credentials as this are they contain the source details).
-
Create the database tables by running the 'create_tables.py' script.
python create_tables.py
- Execute 'etl.py' to perform the data loading.
python etl.py
-
You can use Query Editor in the AWS Redshift console for checking the table schemas in your redshift database.
-
Optionally a PostgreSQL client (or
psycopg2
) can be used to connect to the Sparkify db to perform analytical queries afterwards. Alternatively, you can use the Redshift Query editor to fire the analytical queries.
Eg.- i. Top 5 users ranked according to their app usage
SELECT u.first_name, u.last_name, count(*) usage_frequency FROM songplays sp JOIN users u ON u.user_id = sp.user_id GROUP BY u.first_name, u.last_name ORDER BY usage_frequency DESC LIMIT 5;
ii. Top Artist whose songs users listen more frequently
SELECT a.name artist_name, count(*) hits FROM songplays sp JOIN artists a ON a.artist_id = sp.artist_id GROUP BY artist_name ORDER BY hits DESC LIMIT 5;
- Delete your redshift cluster when finished.