The objective for this project is to build a delta lake for US flights data
and to build ETL pipline to move the data through each table in the delta lake archticture from raw (bronz) cleaned (silver) to aggregated (gold) tables
to reach at this step to a star schema model ready for analysts to create dashboards
we will also simulate the process of receiving the raw data via streaming app like Apache Kafka
The project follows the follow steps:
- Step 1: Scope the Project and Gather Data
- Step 2: Explore and Assess the Data
- Step 3: Define the Data Model
- Step 4: Run ETL to Model the Data
- Step 5: Complete Project Write Up
in this project our goal is to build a delta lake for the US flights dataset to:
- Store the raw data.
- Build a star schema data model for analysis.
- for this purpose we will use Delta Lake along with Pyspark
- reasons for our choice:
- Delta Lake is an open-source storage layer that brings reliability to data lakes.
- Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.4
- Delta Lake runs on top of an existing data lake and is fully compatible with Apache Spark APIs.
The Delta architecture design pattern consists of landing data in successively cleaner Delta Lake tables from raw (Bronze) to clean (Silver) to aggregate (Gold). This is also referred to as a “multi-hop” architecture. You move data through this pipeline as needed.
The end outputs are actionable insights, clean data for machine learning applications, dashboards, and reports of business metrics.
- We will simulate the process of streaming the flights dataset.
- Add metadata and move it to bronze table.
- Transform and augment the bronze table then move it to silver table.
- Create date gold table.
- Create daily summary gold table.
- Collect and transfor lookup tables into silver. As you can see in the following flowchart
Dataset | Source | Description |
---|---|---|
Flights Dataset from Nov.1987 to Apr.2008 | Data Expo 2009 The dataset is provided for download as csv files per year or as zipped file for the whole dataset | Monthly data reported by US certified air carriers that account for at least one percent of domestic scheduled passenger revenues--includes scheduled and actual arrival and departure times for flights |
Flights Dataset from May.2008 to Apr.2021 | Bearau of Transportation Statistics You can select the columns you need and chose the year and month to download it as csv | Monthly data reported by US certified air carriers that account for at least one percent of domestic scheduled passenger revenues--includes scheduled and actual arrival and departure times for flights. |
L_AIRPORT | link | Airport lookup table |
L_CANCELLATION | link | Cancelation code lookup table |
L_UNIQUE_CARRIERS | link | Unique carier lookup table |
L_PLANE | link | Plane lookup table |
- Flights Dataset
col_name | data_type | Description |
---|---|---|
p_Year | int | Year |
Month | int | Month |
DayofMonth | int | Day of Month |
DayOfWeek | int | Day of Week |
DepTime | double | Actual Departure Time (local time: hhmm) |
CRSDepTime | int | CRS Departure Time (local time: hhmm) |
ArrTime | double | Actual Arrival Time (local time: hhmm) |
CRSArrTime | int | CRS Arrival Time (local time: hhmm) |
UniqueCarrier | string | Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years |
FlightNum | int | Flight Number |
TailNum | string | Tail Number |
ActualElapsedTime | double | Elapsed Time of Flight, in Minutes |
CRSElapsedTime | int | CRS Elapsed Time of Flight, in Minutes |
AirTime | string | Flight Time, in Minutes |
ArrDelay | double | Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers |
DepDelay | double | Difference in minutes between scheduled and actual departure time. Early departures show negative numbers |
Origin | string | Origin Airport |
Dest | string | Destination Airport |
Distance | double | Distance between airports (miles) |
TaxiIn | string | Taxi In Time, in Minutes |
TaxiOut | string | Taxi Out Time, in Minutes |
Cancelled | int | Cancelled Flight Indicator (1=Yes) |
CancellationCode | string | Specifies The Reason For Cancellation |
Diverted | int | Diverted Flight Indicator (1=Yes) |
CarrierDelay | string | Carrier Delay, in Minutes |
WeatherDelay | string | Weather Delay, in Minutes |
NASDelay | string | Weather Delay, in Minutes |
SecurityDelay | string | Security Delay, in Minutes |
LateAircraftDelay | string | Late Aircraft Delay, in Minutes |
- L_AIRPORT
col_name | data_type | Description |
---|---|---|
Code | string | Airport Code |
Description | string | Airport discription in the following format (city, country:name) |
- L_CANCELLATION
col_name | data_type | Description |
---|---|---|
Code | string | Cancelation code A to D |
Description | string | Cancelation reason description |
- L_UNIQUE_CARRIERS
col_name | data_type | Description |
---|---|---|
Code | string | Unique carier code |
Description | string | Carier description |
- L_PLANE
col_name | data_type | Description |
---|---|---|
tailnum | string | Tail Number |
type | string | Tyoe of the plane |
manufacturer | string | Manufacturer |
issue_date | date | Date of manufacturing |
model | string | Model of the plane |
status | string | Status of the plane (valid or not) |
aircraft_type | string | Air craft type |
engine_type | string | Engine Type |
year | int | Year of manufacturing |
Flights Dataset 1. column names are not the same for the 2 sources they came from 2. Year, Month, DayofMonth columns need to be combined into date column 3. DepTime, CRSDepTime, ArrTime, CRSArrTime is numeric value instead of timestamp
- L_AIRPORT
- Description column need to be splited into airport_name, city and country columns
- L_PLANE
- there are null values in all columns except for tailnum
- Flights Dataset
- Uniform Column names.
- Simulate streaming app.
- Add metadata.
- Move to bronze table
- Add date column
- Drop Metadata columns
- Drop Month, DayofMonth and DayofWeek columns
- Transform numeric time to HH:mm format
- Move to Silver Table
- L_AIRPORT
- Split Description into airport_name, city and country columns
- Move to silver table
- L_PLANE
- Drop Nan values rows
- Move to silver table
- L_UNIQUE_CARRIERS
- No transformation needed and will move it to silver table
- L_CANCELLATION
- No transformation needed and will move it to silver table
Users can connect to the data model using power BI and the provided M function provided by the microsoft MVP Gerhard Brueckl here
As you can see in the following screenshot
And they answer any question they want with the data like:
- What is the main reason for delays.
- what is the trend for the number of delayed flights by year.
- When is the best time of day/day of week/time of year to fly to minimise delays?
- Do older planes suffer more delays?
- How does the number of people flying between different locations change over time?
- How well does weather predict plane delays? and too many other questions
Refer to the flights flowchart above
As we already build stream pipeline from raw to silver
let's retrive the rest of the data
- Compare null values for flights raw, bronze and silver
- Compare top 10 UniqueCarrier, origin, dest for all tables flights raw, bronze and silver
- Compare null values for lookup tables before and after transformations
- Compare null values for daily summary gold table
- Clearly state the rationale for the choice of tools and technologies for the project.
As mentioned in the project scop delta lake is new technology brings the best of Data lake and Data warehouses - Propose how often the data should be updated and why.
The data should be updated monthly as the source provide them - Write a description of how you would approach the problem differently under the following scenarios:
- The data was increased by 100x.
We can deploy our model to the cloude and scale up - The data populates a dashboard that must be updated on a daily basis by 7am every day.
We can schadule the stream to run everyday at 4am - The database needed to be accessed by 100+ people.
Delta lake can handle as many users with no issues
- The data was increased by 100x.