Skip to content

Latest commit

 

History

History
311 lines (231 loc) · 11 KB

README.md

File metadata and controls

311 lines (231 loc) · 11 KB

Steven's Star Schema Project Implementation

GitHub last commit GitHub commit activity

Star Schema DesignRaw DataCreating the SchemaDestroy SchemasRebuilding SchemasLoad Data in BronzeTransform and Load Data in GoldBusiness OutcomesAutomated TestsSetting up the WorkflowPictures

This project is based around a bike sharing program, that allows riders to purchase a pass at a kiosk or use a mobile application to unlock a bike at stations around the city and use the bike for a specified amount of time. The bikes can be returned to the same station or another station.

We created a star schema that ingests data into a Bronze layer. The incoming data data in the Bronze layer is then manipulated by enfocing a given schema and saved into the Silver layer. The data in the Silver layer is then transformed into the fact and dimension tables as laid out in the physical database design.

Star schema project files are located within this repository. This repository is designed to run in conjunction with Databricks, and hence, will use the Databricks File System (DBFS) to store each of the medallion layers.

Star Schema Design

The various stages of the design of the star schema is shown in the various stages below.

Conceptual Database Design

Size Limit comment in pull request about bundle size changes

Logical Database Design

Size Limit comment in pull request about bundle size changes

Physical Database Design

Size Limit comment in pull request about bundle size changes

Raw Data

Within the repository, there is a folder called zips, contianing the relevant .zip files within. These zips contain the raw data in csv files.

Zip Files Contents
payments.zip payments.csv
riders.zip riders.csv
stations.zip stations.csv
trips.zip trips.csv

Creating the Schema

Within the repository, the schemas for the each of the tables in each of the layers is written out in the file SchemaCreation.py. An example of one of these schemas, is written below:

from pyspark.sql.types import *

trips_bronze_schema = StructType ([
    StructField("trip_id", StringType(), True),
    StructField("rideable_type", StringType(), True),
    StructField("started_at", StringType(), True),
    StructField("ended_at", StringType(), True),
    StructField("start_station_id", StringType(), True),
    StructField("end_station_id", StringType(), True),
    StructField("rider_id", StringType(), True),
])

Destroy Schemas and Databases

This notebook deletes any databases already located in the DBFS. An example of this is shown below:

dbutils.fs.rm("/tmp/Steven/Bronze", True)

Rebuild the Schemas and Databases

This notebook creates empty dataframes in each layer, using the schema located in the SchemaCreations.py. An example of this is shown below:

from pyspark.sql.types import *

empty_rdd = sc.emptyRDD()

bronze_trips_df = spark.createDataFrame(empty_rdd, trips_bronze_schema)
bronze_trips_df.write.format("delta").mode("overwrite").save("/tmp/Steven/Bronze/trips")

Load Data in Bronze

The notebook called Bronze does a few tasks. Below is a summary:

Pulls the zips folder from the repository to the Github folder in DBFS
!wget "https://github.com/steviedas/StarSchemaProject/raw/main/zips/payments.zip" -P "/dbfs/tmp/Steven/Github/"
Unzips the zips file into the Landing folder in DBFS
import subprocess
import glob
zip_files = glob.glob("/dbfs/tmp/Steven/Github/*.zip")
for zip_file in zip_files:
   extract_to_dir = "/dbfs/tmp/Steven/Landing"
   subprocess.call(["unzip", "-d", extract_to_dir, zip_file])
Writes all the CSV's to Bronze as Delta Format
bronze_trips_df = spark.read.format('csv').load("/tmp/Steven/Landing/trips.csv", schema = trips_bronze_schema)
bronze_trips_df.write.format("delta").mode("overwrite").save("/tmp/Steven/Bronze/trips")

Load Data in Silver

The Silver notebook loads the created delta files from tmp/Steven/Bronze and writes them to the empty delta files located in tmp/Steven/Silver

Transform and load data in Gold

The Gold notebook creates all the facts and dimensions tables as shown in Star Schema design. A list of the tables it creates is shown below:

Fact Tables Dimension Tables
fact_payments dim_bikes
fact_trips dim_dates
dim_riders
dim_stations
dim_times

Business Outcomes

The Business Outcomes notebook answers all the following Business Outcomes: Analyse how much time is spent per ride:

  • Based on date and time factors such as day of week and time of day
  • Based on which station is the starting and/or ending station
  • Based on age of the rider at time of the day
  • Based on whether the rider is a member or a casual rider

Analyse how much money is spent

  • Per month, quarter, year
  • Per member, based on the age of the rider at account start

EXTRA - Analyse how much money is spent per member

  • Based on how many rides the rider averages per month
  • Based on how many minutes the rider spends on a bike per month

Automated Tests

The AutomatedTests notebook loads the BusinessOutcomes notebook and queries the resultant dataframes. There are several assert statements to check that the business outcomes can be answered using these queries. An example assert is shown below:

# ASSERT 1
assert one_a_week_grouped_df.count() == 7, "This dataframe has an incorrect number of rows"

Setting up the Workflow

Within Databricks, to get this project to create the whole dataset automatically, follow the steps listed below:

  1. Clone the repository to your own github account (or ask to be added as a contributor).
  2. Add the repository to Databricks account (on Databricks Repos > Add Repo)
  3. Create a Workflow in Databricks (click on Workflows in the left menu bar and click create job). Give each task a name that is suitable and fill out the following:
    1. Set Type as Notebook
    2. Set Source as Git provider - click the Add a git reference and provide it with the Git repository URL and set a branch
    3. Set the Path, this should be in this format final_notebook/<NotebookName>
    4. Save the task
  4. Repeat step 3 for all the relevant notebooks. Add the notebooks located within the repository to the workflow in this order:
    1. DestroySchemasDatabases.py
    2. RebuildSchemasDatabases.py
    3. Bronze.py
    4. Silver.py
    5. Gold.py
  5. Create/Start/Attach a cluster and run the workflow. After this is done running, you should see within DBFS a file directory as shown in the DBFS File Structure picture.

Pictures

Pictures of the various file structures created and displays of the several dimension and fact tables, are shown below:

DBFS File Structure

Size Limit comment in pull request about bundle size changes

Bikes Dimension Table

Size Limit comment in pull request about bundle size changes

Dates Dimension Table

Size Limit comment in pull request about bundle size changes

Riders Dimension Table

Size Limit comment in pull request about bundle size changes

Stations Dimension Table

Size Limit comment in pull request about bundle size changes

Times Dimension Table

Size Limit comment in pull request about bundle size changes

Fact Payments Table

Size Limit comment in pull request about bundle size changes

Fact Trips Table

Size Limit comment in pull request about bundle size changes

Successfully Running Workflow

Size Limit comment in pull request about bundle size changes

Size Limit comment in pull request about bundle size changes