This project builds an ELT pipeline that extracts data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for analytics team to continue finding insights in what songs their users are listening to.
Cloud Data Warehouse
|____create_tables.py # database/table creation script
|____etl.py # ELT builder
|____sql_queries.py # SQL query collections
|____dwh.cfg # AWS configuration file
|____test.ipynb # testing
ELT pipeline builder
load_staging_tables
- Load raw data from S3 buckets to Redshift staging tables
insert_tables
- Transform staging table data to dimensional tables for data analysis
Creating Staging, Fact and Dimension table schema
drop_tables
create_tables
SQL query statement collecitons for create_tables.py
and etl.py
*_table_drop
*_table_create
staging_*_copy
*_table_insert
staging_events
artist VARCHAR,
auth VARCHAR,
firstName VARCHAR,
gender CHAR(1),
itemInSession INT,
lastName VARCHAR,
length FLOAT,
level VARCHAR,
location TEXT,
method VARCHAR,
page VARCHAR,
registration VARCHAR,
sessionId INT,
song VARCHAR,
status INT,
ts BIGINT,
userAgent TEXT,
userId INT
staging_songs
artist_id VARCHAR,
artist_latitude FLOAT,
artist_location TEXT,
artist_longitude FLOAT,
artist_name VARCHAR,
duration FLOAT,
num_songs INT,
song_id VARCHAR,
title VARCHAR,
year INT
songplays
songplay_id INT IDENTITY(0,1),
start_time TIMESTAMP,
user_id INT,
level VARCHAR,
song_id VARCHAR,
artist_id VARCHAR,
session_id INT,
location TEXT,
user_agent TEXT
users
user_id INT,
first_name VARCHAR,
last_name VARCHAR,
gender CHAR(1),
level VARCHAR
songs
song_id VARCHAR,
title VARCHAR,
artist_id VARCHAR,
year INT,
duration FLOAT
artists
artist_id VARCHAR,
name VARCHAR,
location TEXT ,
latitude FLOAT ,
longitude FLOAT
time
start_time TIMESTAMP,
hour INT,
day INT,
week INT,
month INT,
year INT,
weekday VARCHAR