This repository helps to solve two problems in local model development for data scientists:
- A lot of time is often spend in setting up the coding framework to run your machine learning model (e.g. splitting train/test data, training/evaluating model etc). This repo sets an object-orientated, easy-to-understand code base which can be adjusted accordingly to your specific problem.
- Versioning your model and tracking your model performance with different inputs can be challenging. This repo uses dvc (to version the data) and mlflow (to track model performance).
To try out the package, follow the steps below:
- Clone this repository to local machine
- cd in folder root
chmod +x setup.sh
to make bash file executable./setup.sh
to run executable - this installs a virtualenv and downloads relevant data
Then add your data!
DVC stands for 'data version control' - it is the git for data. To start using dvc, follow the below steps (for advanced usage, see the website):
- cd to folder root
source venv/bin/activate
- activating virtual environmentdvc init
- creates a '.dvc' directory to initial the dvc packagedvc add data/train.csv
- creates a .dvc file (a placeholder for the original data), a small text file in a human-readable formatgit add data/train.csv.dvc & git commit -m "Add raw data"
- this adds the data to git
Whenever the data changes, simply run the dvc add [file]
command and commit to save into version control.
MLFlow helps to track and log machine learning experiments. To start using mlflow, follow the below steps:
- Adjust the code as necessary to fit your data/problem (e.g. you might need to do some data transformation / want to use a different ML model / use a different metric)
mlflow ui
into the root directory of the project- In
main.py
, select the evaluation method (run_single
orrun_cv
) - When ready to record a run, set
mlflow_record=True
and enter an experiment name (each experiment can have multiple runs associated with it) - Navigate to the UI (localhost:5000) and see the run recorded
Here are some top tips for tracking your model using MLflow, based on this blog post:
- If you are debugging, then set the variable
ml_flow_record
to false. This ensures that you only record the runs which are significant, aiding model analysis - Before you record a run in MLFlow, make sure you commit the code (via Git) and the data (via dvc). This ensures you have an accurate record in MLFlow on what code and data produced the model.
- When experimenting with new data / new approach, create a fresh branch in git. This uses the power of this framework to make it very easy to compare / switch back to existing setup
- Use experiments in MLFlow to group together runs
This repo contains the following files:
src/read_data.py
- files which reads in the data. The code currently works for data in a format detailed indata/data_structure.txt
src/transform.py
- a file for all your necessary data transformations to get the data into a form that can be consumed by ML model. This might involve filling in missing data, encoding categorical variables etcsrc/ml.py
- a file that contains the machine learning code, including classes which train the model, evaluate the model and predict on the test datamain.py
- file from which to run the source code
Please do contribute to improve the repository. If you have an issue with the current code/documentation, do open an issue here