A data warehouse using a star schema built in Amazon Redshift. An ETL pipeline is used to create the data warehouse for a fictional Sparkify analytic team.
- If Python libraries not installed already, in terminal run pip install -r requirements.txt
- The first time you run the ETL, you need generate an Identity Access Management (IAM) user in Amazon Web Services (AWS). Save the key and secret to use in the configuration file dwh.cfg. The user should have the following policies:
- AmazonRedshiftReadOnlyAccess
- AdministratorAccess and (AWS) administration privileges. IMPORTANT NOTE: When done creating the Redshift cluster it would be wise to remove the IAM secret from your dwh.cfg configuration file.
- Enter the IAM key and secret in the file dwh.cfg into the variables "KEY" and "SECRET". Continue to fill out the rest of the dwh.cfg with the appropriate information to create your Redshift cluster. Don't worry about the variables DB_ENDPOINT and ARN. Those variables input will be entered in later.
- Once the dwh.cfg file is complete, in terminal run: python 1_create_redshift.py This will create the Redshift cluster. At the end, record and save the Endpoint and ARN information into the dwh.cfg file.
- In the teriminal run: python 2_create_tables.py This will create the redshift tables to be populated in the next step.
- In terminal run: python 3_etl.py This load data from the sources files in AWS S3 to Redshift Analytic tables
- In terminal run: python 4_qa_check.py This will run some checks on the data to make sure the ETL process worked correctly.
- When done using Redshift database, in terminal run: python 5_delete_redshift.py
The project has three main types of files:
- Python Scripts
- Jupyter Notebooks
- Configuration File
- Bash file
In the main directory are (5) python scripts. Below is a breakdown of the scripts:
- 1_create_redshift.py: Creates a Redshift cluster database on AWS
- 2_create_tables.py: Creates Redshift tables
- 3_etl.py: Run the ETL process to transfers the data from the source files from S3 to Redshift tables.
- 4_qa_check.py: Does basic QA checks on the ETL process to make sure the data looks correct
- 5_delete_redshift.py: Deletes the Redshift database and polices assigned.
- sql_queries.py: This where all the SQL queries used in the all the python scripts above is stored.
In the subdirectory Notebooks are Jupyter Notebooks that break down the python scripts into steps for easier debugging and future process improvement testing.
In the main directory is the file dwh.cfg. It is configuration file that needs information to run python scripts:
- AWS: Defines the AWS Key and Secret to use for Redshift cluster creation
- CLUSTER: Holds the information for Redshift cluster creation
- IAM_ROLE: For accessing the Redshift cluster with appropriate privileges.
- S3: The paths in AWS to access the S3 source files for the ETL process.
The Bash file job_create.sh runs all the python scripts to create the Redshift tables without having to run all the python scripts manually.