This repository contains the code and data for the Performance Review Commission Data Challenge. More information about the challenge can be found here.
To install the required Python (this repository was made using Python 3.11) packages, run the following command:
pip install -r requirements.txt
The challenge set's combined parquet file is too large to be stored in this repository. Please run the following command to combine the daily parquet files into a single file:
cd data
python combine_parquet.py
cd ..
You can then run the training script to train the model and generate the submission file:
python train_model.py
or run Bayesian optimisation:
python bayesian_search.py
Due to incombatibilities between SHAP and HistGradientBoostingRegressor, the SHAP model has to be trained separately:
python train_shap.py
Afterward, you can run the SHAP analysis in the notebook:
jupyter notebook feature_importances.ipynb
You can find all used data in the data
folder. The data is already preprocessed and ready to be used for training and evaluation.
Used features and sources can be found in the README_datapreparation.md.
The data is computed for the challenge and final submission sets.
For each set, we can compute a feature matrix X and for the challenge set, we can compute the target vector y (the given takeoff weights).
After all data sources have been prepared the feature matrix X and target vector y can be obtained by calling get_data()
in create_Xy.py
.
In this project, a HistGradientBoostingRegressor was chosen as it performs exceptionally well on structured, tabular data.
We performed hyperparameter optimisation using Bayesian optimisation with the scikit-optimize library.
We can now take a look at the model parameters and evaluate the inference of each individual parameter to the model output.
The underlying model is a HistGradientBoostingRegressor, which can be analyzed using the shap
library.
SHAP (SHapley Additive exPlanations) values can be used to explain the model's predictions and understand the importance of each feature in the prediction.
You can find detailed values and plots in the feature_importances.ipynb
notebook.
The features with the highest SHAP values are mtom, aircraft_type, oew
which are closely related to the aircraft type.
Therefore, we surmise the model performance could be improved by adding aircraft variants to the dataset, making these values more specific.
We also provide some mean values of the feature parameters of the challenge set for the different aircraft types in the data/mean_data.csv
file.
Further important features are the mean climb rate value during the climb phase having an inverse effect on the takeoff weight, e.g. indicating an empty aircraft having better climb performance. Interestingly, there also is an influence of the airline possibly indicating different operating specifics. However, the high SHAP values could likely also be related to the aircraft type, which will correlate with the airline.
The code in this repository, precisely the train_model.py
script, will reproduce v16 of the submission, giving a root mean square error of 2,355.61kg.
Felix Soest @fsoest
Paul Hollmann @paulhollmann