Skip to content

Latest commit



329 lines (243 loc) · 12.5 KB

File metadata and controls

329 lines (243 loc) · 12.5 KB

AISG Makerspace: DVC for CD4ML - Part 2

DVC for CD4ML - Part 2 Banner

This repository contains all the code and scripts for the post mentioned in the title. This README document lists down the commands for the different workflows for the reader to execute in sequence.

Create a repository on GitHub (assuming the name dvc-for-cd4ml) and clone it on your local machine:

$ git clone<YOUR_GITHUB_USERNAME_HERE>/dvc-for-cd4ml.git
$ cd dvc-for-cd4ml
# NOTE: Run these commands from the repository's root location
$ curl --create-dirs -o .github/workflows/build-environment.yaml
$ curl --create-dirs -o .github/workflows/retag-docker-image.yaml
$ curl --create-dirs -o scripts/
$ git add .github scripts
$ git commit -m "First commit; add GA workflows and script for building environment."
$ git push origin master

Add the following credentials to your repository's GitHub Secrets:

  • DOCKERHUB_USER: Docker Hub account username.
  • DOCKERHUB_PW: Docker Hub password.
  • Create a new branch build-env, add files into it, and push it to the remote repository:
$ git checkout -b build-env
$ git push -u origin build-env
$ curl --create-dirs -o docker/dvc-for-cd4ml.Dockerfile
$ curl --create-dirs -o dvc-for-cd4ml-conda.yaml
$ git add docker dvc-for-cd4ml-conda.yaml
$ git commit -m "Add Dockerfile and Conda YAML file for building the custom dev environment."
$ git push origin build-env
  • Take note of the commit SHA from the workflow executed above and trigger the Change Latest Docker Image workflow manually.

  • Make a pull request merging build-env into the master branch.

  • Switch back to the master branch and pull the changes locally:

$ git checkout master
$ git pull origin master
  • Install the Conda environment locally:
$ conda env create -f dvc-for-cd4ml-conda.yaml -n dvc-for-cd4ml
$ conda activate dvc-for-cd4ml
  • Create a new branch test-code-check-exp and push it to remote:
$ git checkout -b test-code-check-exp
$ git push -u origin test-code-check-exp
  • Initialise DVC and remote storage for it:
$ dvc init
# The `.dvc` folder is automatically tracked by Git after initialisation
$ git commit -m "DVC init."
$ dvc remote add -d azremote azure://dvc-remote/dvc-for-cd4ml
$ dvc remote modify --local azremote connection_string '<PASTE YOUR CONNECTION STRING HERE>'
$ git add .dvc/config
$ git commit -m "Add remote storage for DVC."
  • Download raw data:
$ wget --directory-prefix=./data/raw/
$ unzip data/raw/ -d data/raw
$ rm data/raw/
  • Add raw data to DVC and push to remote storage:
$ dvc add data/raw
$ git add data/raw.dvc data/.gitignore
$ git commit -m "Add raw data for DVC to track."
$ dvc push data/raw.dvc
  • Add scripts and parameter file for processing raw data:
$ curl --create-dirs -o src/
$ curl --create-dirs -o src/
$ curl --create-dirs -o params.yaml
$ git add src/ src/ params.yaml
$ git commit -m "Add Python scripts for data preparation and config (params) file."
  • Activate Conda environment and run the data processing pipeline through DVC:
# Making sure that the conda environment has been activated
$ conda activate dvc-for-cd4ml
$ dvc run -n data_prep \
    -d src/ -d data/raw \
    -o data/processed \
    python src/
  • Add and commit the newly created DVC artefacts:
$ git add data/.gitignore dvc.yaml dvc.lock
$ git commit -m "Add and execute data preparation pipeline to/through DVC."
  • Push processed raw data to remote storage for DVC:
$ dvc push data_prep
  • Add and commit files for running unit tests and linter:
$ curl --create-dirs -o tests/
$ curl --create-dirs -o tests/
$ curl --create-dirs -o .pylintrc
$ git add tests/ tests/ .pylintrc
$ git commit -m "Add unit tests and linter configuration."
  • Create file for running tests and checks on GitHub Actions:
$ curl --create-dirs -o .github/workflows/test-analyse-code.yaml
  • Replace the username prefix ryzalk with your own Docker Hub username and then add and track the workflow file:
$ git add .github/workflows/test-analyse-code.yaml
$ git commit -m "Add GitHub Action workflow for running unit tests and linter."
$ git push origin test-code-check-exp
  • Log in to Weights & Biases on your local machine:
$ wandb login <YOUR_WANDB_API_KEY>
  • Create, add and commit the training script:
$ curl --create-dirs -o src/
$ git add src/
$ git commit -m "Add model training script."
  • Run the model experiment through DVC:
$ dvc run -n train_model \
    -d data/processed -d src/ \
    -o models/text-classification-model -o \
    -p train.epochs,,train.metric,train.pretrained_embedding \
    python src/
  • Add and commit the newly created artefacts to the repository, tag the commit, and then push to DVC remote storage:
$ git add models/.gitignore .gitignore dvc.yaml dvc.lock
$ git commit -m "Add a DVC pipeline stage for training the binary sentiment classification model."
$ git tag -a "model-v1.0" -m "First version of text classification model."
$ dvc push train_model
  • Create the file for the GitHub Action workflow:
$ curl --create-dirs -o .github/workflows/comment-pull-req.yaml
  • Replace the username prefix ryzalk with your own Docker Hub username and then add and track the workflow file:
$ git add .github/workflows/comment-pull-req.yaml
$ git commit -m "Add GitHub action workflow for displaying model training experiment results. Trigger: comment-model-exp"
  • Add the Azure connection string as a GitHub secret; name it AZ_CONN_STRING.

  • Push the test-code-check-exp to the remote repository:

$ git push origin test-code-check-exp
  • Navigate to another folder (or machine), clone the repository and reproduce the model training pipeline:
$ git clone<YOUR_GITHUB_USERNAME_HERE>/dvc-for-cd4ml.git
$ cd dvc-for-cd4ml
$ git checkout -b improve-model
$ git push -u origin improve-model
$ dvc remote modify --local azremote connection_string '<PASTE YOUR CONNECTION STRING HERE>'
$ dvc pull data_prep train_model
# Make sure that the relevant conda environment has been activated
$ conda activate dvc-for-cd4ml
$ dvc repro --downstream train_model
Stage 'train_model' didn't change, skipping
Data and pipelines are up to date.
  • Change the value for the train.epochs parameter in the file params.yaml to 5:
$ dvc repro --downstream train_model
Running stage 'train_model' with command:
	python src/
  • Add, commit, tag, and push the newly changed/created artefacts to the remote Git repository and DVC remote storage:
$ git add dvc.lock params.yaml
$ git commit -m "Train second version of sentiment classification model (5 epochs). Trigger: comment-model-exp"
$ git tag -a "model-v2.0" -m "Second version of text classification model."
$ git push origin improve-model
$ dvc push train_model
  • Create a pull request comparing the improve-model branch with master and merge. Pull the changes to your local branch after:
$ git checkout master
$ git pull origin master
  • Retrieve a previous version of the predictive model:
$ git tag
$ git checkout model-v1.0 -- dvc.lock
$ dvc checkout train_model
M       models/text-classification-model/
  • Create the files for packaging and serving the model:
$ curl --create-dirs -o src/
$ curl --create-dirs -o docker/dfc-model-flask-server.Dockerfile
$ curl --create-dirs -o .dockerignore
  • Build the Docker image on your local machine:
$ docker build . -t dfc-model-flask-server:model-v1.0 -f ./docker/dfc-model-flask-server.Dockerfile
  • Run the container:
$ docker run -d -p 80:80 --name dfc-serving dfc-model-flask-server:model-v1.0
  • Test out a POST request and get a prediction:
$ curl -X POST "localhost/predict" -H "Content-Type: application/json" -d '{"text": "This movie was unpleasant, like the year 2020."}'
  • Stop the container and push the files above to the remote repository:
$ docker stop dfc-serving
# To get rid of changes to the dvc.lock file
$ git checkout master -- dvc.lock
$ git add .dockerignore docker/dfc-model-flask-server.Dockerfile src/
$ git commit -m "Add Python script and Dockerfile for model serving."
$ git push origin master