Flight_Delay_Prediction

A Big Data Assignment regarding Spark, with Airbus data fetched and linear regression model

The Full report could be found here: REPORT

Getting Started

Dependencies

here are all the dependencies needed for the project

Here an easy script to download pySpark and java8. remember your path for the installation_folder

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://mirrors.viethosting.com/apache/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/installation_folder/spark-2.4.7-bin-hadoop2.7"

Clone this repo:

git clone https://github.com/LorenzoFramba/Flight_Delay_Prediction.git
cd Flight_Delay_Prediction

Install dependencies: Let's finish with running the setup.py function, to download any uninstalled library

python3 setup.py install

To Start the program

Select the --path at which the Airbus dataset is saved. If --path is not specified, the program assumes the Airbus is in the same folder as the project itselves. Make sure the name is 'year.csv' and year is a 4 digit number from 1987 to 2008.

python3 main.py --dataset 'year.csv'

You also have the option to choose the train/test split (default is 75 / 25), and the dataset sample size for training and testing with --dataset_size.

you also the ML model type between 'linear_regression', 'gradient_boosted_tree_regression', 'decision_tree_regression' and 'random_forest' (default : linear_regression).

The all option will train and test all the models, compare their respective R2 and select the best performing one.

python3 main.py --dataset 'year.csv' --model 'linear_regression' --split_size_train 75 --dataset_size 100000

Variable Selection

The selection of the variables is done by analyng patterns and correlation matrix ( select --view True to watch it). We selected this following variables together

"X1": ['DepDelay', 'TaxiOut']
"X2": ['DepDelay', 'TaxiOut', 'HotDepTime']
"X3": ['DepDelay', 'TaxiOut', 'HotDayOfWeek', 'Speed']
"X4": ['DepDelay', 'TaxiOut', 'HotDayOfWeek', 'Speed', 'HotMonth']
"X5": ['DepDelay', 'TaxiOut', 'Speed', 'HotDepTime', 'HotCRSCatDepTime', 'HotCRSCatArrTime']

By default, the model will run with the easier variable: X1. You have the option to use X5, which is the best performing one, by selecting "best" on --variables. You can also select "all" to try everything.

python3 main.py --dataset 'year.csv' --model 'all' --split_size_train 75 --variables 'best' --view True

if you have any doubts, use

python3 main.py --help

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.gitignore		.gitignore
DataValidation.ipynb		DataValidation.ipynb
Data_analysis.ipynb		Data_analysis.ipynb
ExperimentsWithPredictions.ipynb		ExperimentsWithPredictions.ipynb
README.md		README.md
Report.pdf		Report.pdf
Results.ipynb		Results.ipynb
cleanData.py		cleanData.py
getData.py		getData.py
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py
trainer.py		trainer.py
visualization.py		visualization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flight_Delay_Prediction

Getting Started

Dependencies

To Start the program

About

Releases

Packages

Languages

ostapkharysh/Flight_Big_Data

Folders and files

Latest commit

History

Repository files navigation

Flight_Delay_Prediction

Getting Started

Dependencies

To Start the program

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages