This example uses ORTModule to fine-tune several popular HuggingFace models.
- Clone this repo and initialize git submodule
git clone https://github.com/microsoft/onnxruntime-training-examples.git
cd onnxruntime-training-examples
git submodule update --init --recursive
git submodule foreach git pull origin main
- Make sure python 3.8+ is installed
We recommend using conda to manage python environment. If you do not have conda installed, you can follow the instruction to install conda here. Once conda is installed, create a new python environment with
conda create --name myenv python=3.8
- Install azureml-core
Activate conda environment just created.
conda activate myenv
Install azureml dependency for script submission.
pip install azureml-core
- AzureML subscription is required to run this example. Either a config.json file (How to get config.json file from Azure Portal) or subscription_id, resource_group, workspace_name information needs to be passed in through parameter.
- The subscription should have a gpu cluster. This example was tested with GPU cluster of SKU
Standard_ND40rs_v2
. See this document for creating gpu cluster.
Download config.json file in 2.1 to huggingface/script
directory. Or append below run script with AzureML workspace information such as --workspace_name <your_workspace_name> --resource_group <resource_group> --subscription_id <your_subscription_id>
.
Here's an example to run run bert-large with ORTModule. hf-ort.py
builds a docker image based on dockerfile and submits run script to AzureML according to model and run configuration. Default docker image uses cuda 11.1.
cd huggingface/script
python hf-ort.py --gpu_cluster_name <gpu_cluster_name> --hf_model bert-large --run_config ort
To run different models with different configuration, check below tables.
This table summarizes if model changes are required.
Model | Performance Compariso | Model Change |
---|---|---|
bart-large | See BART | No model change required |
bert-large | See BERT | No model change required |
deberta-v2-xxlarge | See DeBERTa | See this commit |
distilbert-base | See DistilBERT | No model change required |
gpt2 | See GPT2 | No model change required |
roberta-large | See RoBERTa | See this commit |
t5-large | See T5 | See this PR |
Here're the different configs and description that the recipe script take through --run_config
parameter.
Config | Description |
---|---|
pt-fp16 | PyTorch mixed precision |
ort | ORTModule mixed precision |
ds_s1 | PyTorch + Deepspeed stage 1 |
ds_s1_ort | ORTModule + Deepspeed stage 1 |
Other parameters. Please also see parameters script/hf-ort.py
Name | Description |
---|---|
--model_batchsize | Model batchsize per GPU |
--max_steps | Max step that a model will run |
--process_count | Total number of GPUs (not GPUs per node). Adjust this if target cluster is not 8 gpus |
--node_count | Node count |
--skip_docker_build | Skip docker build (use last built docker saved in AzureML environment) |
--use_cu102 | Use Cuda 10.2 dockerfile |
--local_run | Run the model locally, azureml related parameters will be ignored |
- Benchmark methodology: We report samples/sec on
ND40rs_v2
VMs (V100 32G x 8), Cuda 11, with stable releaseonnxruntime_training-1.8.0%2Bcu111-cp36-cp36m-manylinux2014_x86_64.whl
. Cuda 10.2 option is also available through--use_cu102
flag. Please check dependency details in Dockerfile. We look at the metricsstable_train_samples_per_second
in the log, which discards first step that includes setup time. Also please note since ORTModule takes some time to do initial setup, smaller--max_steps
value may lead to longer total run time for ORTModule compared to PyTorch. However, if you want to see finetuning to finish faster, adjust--max_steps
to a smaller value. Lastly, we do not recommend running this recipe on [NC
] series VMs which uses old architecture (K80). - Cost and VM availability: The finetuning job runs for ~1hr for default 8000 steps on
ND40rs_v2
VMs, which costs $22.03/hr per run. Additional costs are Azure container registry costs for docker image storage, as well as Azure Storage cost for run history storage. Please note,ND40rs_v2
is not publicly available by default. To get it, after the subscription is created, user need to create a support ticket here, then ND series will be available. - On first run, this script takes ~20 mins to submit the finetuning job due to building a new docker image from Dockerfile. The step to build docker image
hf_ort_env.register(ws).build(ws).wait_for_completion()
can be skipped by passing--skip_docker_build
if not running for the first time.
- A machine that you can access with GPU. This recipe was tested on 8 x 32G V100 GPUs machine.
- Know how many GPUs are there. This needs to be passed to parameter
--process_count
Build docker image.
cd huggingface/docker
sudo docker build -t hf-recipe-local-docker -f Dockerfile .
Run built docker image
- Replace
<onnxruntime-training-examples_path>
to your local full path toonnxruntime-training-examples
- Usually it's located at
~/onnxruntime-training-examples/
- Usually it's located at
-v /dev/shm:/dev/shm
mounts/dev/shm
to inside docker/dev/shm
. Similarly-v <onnxruntime-training-examples_path>:/onnxruntime-training-examples
mounts<onnxruntime-training-examples_path>
to inside docker/onnxruntime-training-examples/
sudo docker run -it -v /dev/shm:/dev/shm -v <onnxruntime-training-examples_path>:/onnxruntime-training-examples --gpus all hf-recipe-local-docker
Run hf-ort.py
script
- Reminder to use the number of GPUs available locally to parameter
--process_count
- Depending on the memory available to local GPU, you might need to overwrite default batch size by passing in
--model_batchsize
--local_run
runs the script locally
cd /onnxruntime-training-examples/huggingface/script/
python hf-ort.py --hf_model {hf_model} --run_config {run_config} --process_count <process_count> --local_run
If there's an Azure authentication issue, install Azure CLI here and run az login --use-device-code
The issue is most likely caused by hitting a HW limitation on the target, this can be mitigated by using the following switches
--model_batchsize
- Change to smaller batchsize
--process_count
- Change the number of GPUs to activate
python hf-ort.py --hf_model bart-large --run_config pt-fp16 --process_count 1 --local_run --model_batchsize 1 --max_steps 20
RoBERTa & DeBERTa currently decommissioned from the hf-ort.py script because of unresolved issues.
RoBERTa currently requires ORT >= 1.12.0 according to this issue (#11268) which was resolved in ORT 1.12.0. However, running with ORT 1.12.0 with the PTCA Docker container and on the specified machine for benchmarking causes this issue (#12312).
DeBERTa has the following unresolved issues when using Optimum's ORTTrainer: #15 and #305