MLops project for January course
Using this dataset https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge
The project aims to develop a classifier for identifying toxic comments, as part of the Kaggle Toxic Comment Classification Challenge. This classifier's primary function is to analyze individual comments and estimate the likelihood of each comment falling into one of seven categories. The categories include six specific classes: toxic, severe toxic, obscene, threat, insult, and identity hate, along with a seventh for general classification.
To achieve this, we are utilizing PyTorch-Transformers (now known as pytorch_pretrained_bert) frameworks. PyTorch-Transformers, a product of HuggingFace, is instrumental in loading the pretrained model and tokenizer. We are using the standard "bert-base-uncased". We also use Streamlit to create a simple web-app to use our solution.
Our data source is the Kaggle Toxic Comment Classification dataset, which comprises various comments sourced from Wikipedia. Each comment in this dataset is tagged with one or more labels corresponding to the six toxic categories. The dataset's structure and labels allow for a comprehensive training regime, catering to our classifier's need for diverse and complex examples. Interested parties can access the dataset through the provided Kaggle link: Toxic comment classification challenge data
For the modeling aspect, we are leveraging a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model. The choice of BERT is strategic; given its extensive training on a large corpus, it is highly capable of understanding nuanced language patterns. Our approach involves further training this base model on the specific dataset to fine-tune it for our classification task. This method is expected to harness BERT's advanced language processing capabilities, making it adept at recognizing and classifying varying degrees of toxicity in comments.
The directory structure of the project looks like this:
├── Makefile <- Makefile with convenience commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── processed <- The final, canonical data sets for modeling.
│ | ├── train.pt
│ | ├── val.pt
│ | └── test.pt
│ └── raw <- The original, immutable data dump. (we only use train.csv)
│ ├── train.csv
│ ├── sample_submission.csv
│ ├── test_labels.csv
│ └── test.csv
├── dockerfiles <- Dockerfiles to build images
│ ├── flaskapi.dockerfile
│ ├── inference_streamlit.dockerfile
│ ├── predict_model.dockerfile
│ └── train_model.dockerfile
│
├── docs <- Documentation folder
│ │
│ ├── index.md <- Homepage for your documentation
│ │
│ ├── mkdocs.yml <- Configuration file for mkdocs
│ │
│ └── source/ <- Source directory for documentation files
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks.
│ └── bert-with-fastai-example.ipynb
│
├── pyproject.toml <- Project configuration file
│
├── reports <- Generated analysisbucketbucketbucketbucket as HTML, PDF, LaTeX, etc.
│ ├── README.md
│ │
│ ├── report.py
│ │
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment
│
├── requirements_dev.txt <- The requirements file for reproducing the analysis environment
│
├── requirements_inference.txt <- The requirements file for reproducing the analysis environment
│
├── requirements_test.txt <- The requirements file for reproducing the analysis environment
│
├── tests <- Test files
│
├── toxic_comments <- Source code for use in this project.
│ │
│ ├── __init__.py <- Makes folder a Python module
│ │
│ ├── api <- Script to run streamlit api
│ │ └── streamlit_input_inference.py
│ │
│ ├── data <- Scripts to download or generate data
│ │ ├── __init__.py
│ │ └── make_dataset.py
│ │
│ ├── models <- model implementations, training script and prediction script
│ │ ├── __init__.py
│ │ └── model.py
│ │
│ ├── visualization <- Scripts to create exploratory and results oriented visualizations
│ │ ├── __init__.py
│ │ └── visualize.py
│ ├── train_model.py <- script for training the model
│ └── predict_model.py <- script for predicting from a model
│
└── LICENSE <- Open-source license if one is chosen
Created using mlops_template, a cookiecutter template for getting started with Machine Learning Operations (MLOps).
git clone https://github.com/adoprox/Group60_mlops.git dvc pull This creates a new directory with all the files needed for the model to work.
If you get this error:
```
ERROR: failed to pull data from the cloud - Checkout failed for following targets: models
I your cache up to date?
```
The please dvc pull from the google cloud remote: dvc pull -r gcloud-storage
-
To set custom logging directory, set this environment variable before running the sweep:
export WANDB_DIR=./outputs/wandb_logs/
-
Create a new sweep using:
wandb sweep ./toxic_comments/models/config/sweep.yaml
-
Run the sweep with the command that gets output after you create the sweep
Note: when you update the sweep.yaml you will need to create a new sweep to use the updated configuration file. If you reuse the old sweep, it will also reuse the old configuration file!
- Training container:
docker build -f dockerfiles/train_model.dockerfile . -t trainer:latest
- Prediction container:
docker build -f dockerfiles/predict_model.dockerfile . -t predict:latest
- Inference container:
docker build -f dockerfiles/inference_streamlit.dockerfile . -t inference:latest
Predict is still under work
The docker containers are set up without an entrypoint. The data root folder is in the configuration, the default is set to ./data/processed.
docker run -v ./data:/data -v ./models:/models -e WANDB_API_KEY='<your-api-key>' group60_trainer:latest python3 ./toxic_comments/train_model.py
IMPORTANT: to add GPU support to a container, add the flag --gpus all
to above command, like so:
docker run -v ./data:/data -v ./models:/models -e WANDB_API_KEY='<your-api-key>' --gpus all group60_trainer:latest python3 ./toxic_comments/train_model.py
docker run -v ./data:/data -v ./models:/models -e WANDB_API_KEY='<your-api-key>' --gpus all gcr.io/propane-facet-410709/bert-toxic-trainer:latest python3 ./toxic_comments/train_model.py
The following section contains documentation and rules for how to interact with the cloud setup.
All operations should be done in region eu-west-4 and zone eu-west-4a (if fine-grained zones are needed)
Any traing, testing, validation, prediction data should be added to the bucket group_60_data. Any trained models should be added to the bucket group_60_models.
The following command can be used to create a new inference service based on the latest version of the streamlit inference container:
gcloud run deploy inference-streamlit --image gcr.io/propane-facet-410709/inference-streamlit:latest --platform managed --region europe-west4 --allow-unauthenticated --port 8501
Additionally, a new instance will be deployed via a trigger whenever a push to main happens.
- Create an instance with GPU, choose one of the Deep learning images. When starting the instance, make sure the nvidia drivers and cuda are installed correctly. Make sure the VM has access to all APIs.
- Clone the repository
- Run dvc pull, supply credentials
- Train model
- Run dvc add models/
- Run dvc push -r gcloud-drive
Alternatively, the model can also be trained within a container. For that:
- Create an instance with GPU, choose one of the Deep learning images. When starting the instance, make sure the nvidia drivers and cuda are installed correctly. Make sure the
- Pull the container with:
docker pull gcr.io/propane-facet-410709/bert-toxic-trainer:latest
- Install gcloud, gsutil, etc.
- Copy training data from cloud storage to container:
gsutil rsync -r gs://group_60_data/data ./data
. This command will copy the data stored in the bucket to the local data directory (assuming current directoy is the project root) - Run the container with above command.
- wandb should automatically upload the model checkpoints. But they can also be uploaded using:
gsutil rsync -r ./local/path/to/models gs://group_60_models
The prediction script can classify a comment or a list of comments given as input:
- List of string: python toxic_comments/predict_model.py +file=<file_name>.csv
- One string: python toxic_comments/predict_model.py +text="comment to classify"
You can also specify the model to use by adding the parameter: "++predict.checkpoint_path=path_model"
n.b. The '=' is a special character, if it is present in the path, it needs to be preceded by the special character ''
To test our solution and classify a comment you can access the Streamlit web app at: https://inference-streamlit-kjftsv3ocq-ez.a.run.app/