This repository demonstrates how to use Data Version Control (DVC) in combination with Makefiles to streamline and manage a machine learning project. DVC is a powerful tool that helps track changes in data files, plots, machine learning models, and metrics, ensuring reproducibility and shareability of your data science workflows.
The project aims to analyze a Kaggle dataset containing insurance cost information and build a predictive model to understand the factors influencing insurance charges. The dataset can be accessed from the following link: Insurance Cost Dataset
To tackle this problem, we will follow the abridged standard data science workflow:
* Data Gathering: Import the dataset from Kaggle using the Kaggle API.
* Data Cleaning: Load the data, handle missing values, remove duplicates, and perform necessary data transformations.
* Exploratory Data Analysis (EDA): Conduct exploratory data analysis by creating visualizations like correlation matrices, bar charts, and bubble charts to gain insights into the data.
* Feature Engineering: Encode categorical variables into numerical form to prepare the data for machine learning models.
* Model Building and Testing: Train and evaluate various machine learning models, such as linear regression, polynomial regression, and decision tree regression, to identify the most important features influencing insurance costs.
* Model Deployment and Monitoring: Deploy the best-performing model and monitor its performance over time.
The main factors influencing insurance spending are expected to be revealed by the machine learning models, particularly the decision tree regression model, which can provide insights into the most important features affecting insurance costs.
The repository has the following structure:
.
├── Makefile # Commands to manage the project lifecycle
├── activate_venv.sh # Script to activate the virtual environment (optional)
├── cleandata.py # Script to load, clean, and preprocess the data
├── dvc.lock # File generated by DVC to lock the project state
├── dvc.yaml # DVC configuration file for workflow steps
├── eda.py # Script for exploratory data analysis
├── evaluate.py # Script to evaluate machine learning models
├── send_sms.py # Script to send a text message with Africa's Talking API
├── import_data.sh # Script to import data from Kaggle
├── params.yaml # File to store and manage hyperparameters
├── requirements.txt # Python package requirements
└── split_data.py # Script to split data into training and testing sets
To set up the project, follow these steps:
Ensure you have a virtual environment running. You can create and activate a new virtual environment using the following commands:
# Create a virtual environment
python3 -m venv .venv
# Activate the virtual environment
source .venv/bin/activate # On Windows, use `.venv\Scripts\activate`
You need a kaggle account to use the kaggle API. Please handle the resultant
kaggle.json
with care. Don't add it to the repository. You can enforce that by adding it .gitingore file and .dockerignore file.
Install the required Python packages by running:
make install
This command will install all the dependencies listed in requirements.txt in your virtual environment.
Note: The project was originally developed using Python 3.12.
Some issues may arise if some environment variables are not set. Make sure to export the following environment variables:
# this is the africa's talking api key
export AT_API_KEY="your_api_key"
export AT_USERNAME="your_username"
export PHONE_NUMBER="your_phone_number"
After setting up the virtual environment and installing the dependencies, you can run the project using the following commands:
# Run the entire project pipeline
dvc repro --no-commit
# Run a specific step (e.g., import data)
dvc repro import_data
# Run the entire project pipeline and commit the changes
dvc repro
# To visualize the pipeline
dvc dag
# store it
dvc dag --md >> dag.md
Refer to the Makefile for individual commands and steps.
Moreover, you can use the following DVC commands for comparing metrics, parameters, and experiments:
# test experiment
# Compare metrics between runs (requires metrics stored in JSON format)
dvc metrics diff
# Compare hyperparameters between runs
dvc params diff
# Compare different machine learning experiments
dvc exp show
# Many experiments can be compared using the following command
dvc exp run -S 'evaluate_model.min_samples_leaf=25' -S 'evaluate_model.max_leaf_nodes=2' -S 'split_data.strategy=kfold split_data.test_size 0.2' --queue # min_samples_leaf range(20,25) grid search
# run all the experiments in the queue
dvc exp run --run-all --jobs 2
# apply results of the best experiment
# you can get the name of the experiment by running dvc exp show
dvc exp apply <exp>
# Or in a different branch
dvc exp branch <exp>
# commit the changes to dvc and git
dvc commit
git commit -m "commit message"
# check the status of the dvc pipeline
dvc status
DVC metrics, plots, and parameters: https://dvc.org/doc/start/data-pipelines/metrics-parameters-plots
DVC experiments: https://dvc.org/doc/start/experiments
- Add logging instead of using print statements.
- Create a workflow where the generated files are stored in drive or object storage?
- Edit the evaluate.py to edit the linear regression and decision tree classifier hyperparameters using argparse?
- Try using other models and add them to the workflow such as Random Forest and XGBoost?