- Clone this repository on your local machine and push your solution into a new one in your own GitHub account
- Please consider your git history as this will be reviewed
- Commit & push code regularly
- Use the Netflix csv file as your data source
- Once complete, please send us a link to your repository
Step 1 : Create a database to store the data using a Dimensional Modeling Design. (MSSQL / MySQL / Postgres)
Output - SQL Scripts
Step 2 : Create a python programme that will run the above SQL scripts and ETL the data from the csv file into the database.
Output – Python programme
Step 3 : Enhance the data by adding the cast members' gender (Male / Female). https://www.aminer.cn/gender/api or any other source you want to use.
Output – Python programme
Step 4 : Write SQL Scripts to validate the data loaded.
Output - Missing data report
Output - Invalid / strange data report
Step 5 : Write SQL Scripts to return the following:
-
What is the most common first name among actors and actresses?
-
Which Movie had the longest timespan from release to appearing on Netflix?
-
Which Month of the year had the most new releases historically?
-
Which year had the largest increase year on year (percentage wise) for TV Shows?
-
List the actresses that have appeared in a movie with Woody Harrelson more than once.
Step 6 : Combine all the previous steps into a solid Python programme that has unit testing. Feel free to create a main file that can be ran via a Python command line.
Output – Python programme
Column | Value | Description |
---|---|---|
show_id | String | Unique ID for every Movie / Tv Show |
type | String | Identifier - A Movie or TV Show |
Title | String | Title of the Movie / Tv Show |
Director | String | Director of the Movie |
Cast | String | Actors involved in the movie / show |
Country | String | Country where the movie / show was produced |
Date_added | Date | Date it was added on Netflix |
Release_year | Integer | Actual Release year of the move / show |
Rating | String | TV Rating of the movie / show |
Duration | String | Total Duration - in minutes or number of seasons |
Listed_in | String | Genre |
description | String | The summary description |