A basic multivariate analysis of simulated physics data using the machine learning toolkit scikit-learn. The entire analysis example is conveniently presented in form of Jupyter Notebooks.
This project mainly addresses people who are new to machine learning, scikit-learn or Python and should provide them with code implementations of basic analysis steps and plots. It does not necessarily claim to yield optimal analysis results - not even good ones, necessarily - but should rather illustrate some fundamental concepts of machine learning application.
The provided example makes use of the following Python libraries/frameworks:
The following concepts are covered and implemented:
- Data visualization
- Variable histograms
- Scatter matrix
- Correlation matrix
- RadViz
- Data preprocessing
- Standard scaling
- Data split into training, validation and test samples
- Model definition and training with decision trees / random forest
- Overfitting the data and how to prevent it
- Feature importances
- Model evaluation using training and validation sample
- MVA output distribution
- Cut efficiencies plot / MVA cut optimization
- ROC curve
- Precision-recall curve
- Confusion matrix
- Classification report
- Model application to the test sample / classifier performance assessment
Notebooks:
example analysis - random forest.ipynb
: Example analysis of simulated physics data using a random forest classifier.overtraining demo - random forest.ipynb
: Using simulated physics data and a random forest classifier to demonstrate the effects and consequences of overtraining.
This project is far from perfect, not least because the author is relatively new to Python and scikit-learn himself.
If you have any suggestions or improvements, feel free to let me know or contribute in any way you like.
This project is licensed under the MIT License - see the LICENSE file for details.
- Many thanks to S. Lehner for valuable feedback on first drafts of this project.
- Following PurpleBooth's README style
- All badges made by shields.io