We have a dataset consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed(bzip2) and 12 gigabytes when uncompressed(csv). The goal of this project is to summarize the data by time periods, carriers,airports and understand different factors causing the airline delays.
This dataset can be find here Airline on-time performance.
- 1987-2009 (.bz2 format)
- airports.csv
- carriers.csv
- plane-data.csv
- Are some time periods more prone to delays than others.
- What are the best hours to travel.
- Are some airports more prone to delays than others.
- Are some carriers more prone to delays than others.
- Do old planes suffer more delays.
- Good understanding of Hadoop Environment.
- Hive with Partitioning and Bucketing.
- Tez/mapreduce framework.
- Dimensional modeling for OLAP.
- Splittable and non-splittable compression formats.
- Different file formats(Avro, ORC, Parquet).
- Data Driven Development for project building.
Star Schema is considered for data modeling because we are focused on analysing the data ie.. more read operations.
data_fact_table - records all the commercial flights arrival and departure details with delays.
airports_dim_table - contains information about the airports.
carriers_dim_table - contains information about the carriers.
plane_dim_table - contains information about the planes.
We have the historical data, for batch processing let us ingest the dataset into our on-prem Hadoop HDFS.
-
Connecting to hortonworks Hadoop docker using SSH
ssh [email protected] -p 2222
-
copy the entire data using SCP to docker from host machine.
scp -P 2222 /input_data [email protected]:/input_data
-
copy data from local file system to HDFS.
hdfs dfs -copyFromLocal /input_data/* /input_data
- Created staged table to understand the data types and transformations required to batch processing.
- Created static partitioning on year for fast retrieval for text format.
- Created final dimensional tables and fact table with appropriate data types and necessary transformation using hive UDF functions.
- Created Dynamic partitioning on year for fast retrieval for ORC format.
- This process has columnar format storage for best compression storage.
- Adhoc queries for getting the answers for the business logic.
Note If lets say we get the data every year or month then we can ingest data by creating a staged table
and insert the delta data into final table using insert into..
- We see that April, May and September has minimum delays than other months and are best time to fly whereas June and December has maximum delays.
- We see that morning flights between 4-5 AM are best to fly whereas 6-8 PM are worst to fly.
- We see that Chicago, Atlanta,Dallas being the most delayed planes when compared to others.
- We see that Hawaiian ,Aloha arilines are better to fly.
- We dont see year as factor for being prone to more delays.
Note we can perform many more adhoc query analysis based on the requirement.
- Run the
main.py
to get everything setup.- use tableau to visualize the requirement and necessary analysis using views.
- use HQL for necessary analysis.