This is a best Practices for Accelerating Training with HuggingFace Transformers. TorchAcc supports acceleration of native HuggingFace Transformers. For users using the HF Trainer
interface, it is very convenient to accelerate Transformers model training through TorchAcc.
The following will use the run_clm.py
example script from the Transformers library to train Llama3-8B
tasks as examples, demonstrating how to use TorchAcc
to accelerate Transformers training. We will also compare the methods of training Transformers models using native PyTorch
and DeepSpeed
with the run_clm.py
script.
Refer to the "install"
section to obtain the latest image:
sudo docker run --gpus all --net host --ipc host --shm-size 10G -it --rm --cap-add=SYS_PTRACE dsw-registry.cn-hangzhou.cr.aliyuncs.com/pai/acc:r2.3.0-cuda12.1.0-py3.10-nightly bash
Since we are running the run_clm.py
script built into Transformers, it requires source code installation of the Transformers library:
# Uninstall Transformers already installed in the image
pip uninstall transformers -y
# Clone and install Transformers
git clone https://github.com/huggingface/transformers.git
cd transformers
pip install -e .
# Install related dependencies
pip install evaluate scikit-learn
# If DeepSpeed training is needed
pip install deepspeed
If your network cannot access HuggingFace, please refer to the following instructions to manually download the model. Otherwise, you can skip the
Model Preparation
andDataset Preparation
sections.
You can download the Meta-Llama-3-8B
model from the model repositories of HuggingFace
or ModelScope
. Here's an example using ModelScope
, with the model repository link: https://modelscope.cn/models/LLM-Research/Meta-Llama-3-8B.
You can clone the model configuration and weights to your local machine using git clone
, and store them in a specified directory (such as Meta-Llama-3-8B):
apt-get update && apt-get install git-lfs
git clone https://www.modelscope.cn/LLM-Research/Meta-Llama-3-8B.git
We use the wikitext dataset for training the model. For detailed information about the dataset, visit: https://www.modelscope.cn/datasets/modelscope/wikitext/
You can download the dataset and place it in the specified directory.
In the run_clm.py
file, find the AutoModelForCausalLM.from_pretrained()
and AutoModelForCausalLM.from_config()
, and add attn_implementation="flash_attention_2"
to enable FlashAttention computation.
diff --git a/examples/pytorch/language-modeling/run_clm.py b/examples/pytorch/language-modeling/run_clm.py
index c0db57037..dc8e3040a 100755
--- a/examples/pytorch/language-modeling/run_clm.py
+++ b/examples/pytorch/language-modeling/run_clm.py
@@ -434,9 +436,10 @@ def main():
trust_remote_code=model_args.trust_remote_code,
torch_dtype=torch_dtype,
low_cpu_mem_usage=model_args.low_cpu_mem_usage,
+ attn_implementation='flash_attention_2'
)
else:
- model = AutoModelForCausalLM.from_config(config, trust_remote_code=model_args.trust_remote_code)
+ model = AutoModelForCausalLM.from_config(config, trust_remote_code=model_args.trust_remote_code, attn_implementation='flash_attention_2')
n_params = sum({p.data_ptr(): p.numel() for p in model.parameters()}.values())
logger.info(f"Training new model from scratch - Total size={n_params/2**20:.2f}M params")
You can follow these steps to conduct Transformers Llama3-8B PyTorch native FSDP training.
When running FSDP training tasks using run_clm.py
, you need to configure this file:
Note: Due to a bug in Transformers, activation_checkpoint cannot be enabled for this task.
{
"fsdp_transformer_layer_cls_to_wrap": [
"LlamaDecoderLayer"
],
"activation_checkpointing": false
}
We utilize the run_clm.py
file from the Transformers library to run directly by specifying parameters without needing additional code. The run_clm.py
file encapsulates the logic of the Transformers library's Trainer and provides various parameter configurations. For specific parameter information, you can run python run_clm.py --help
.
Use torchrun
to start the training. The specific command is as follows:
set -ex
echo "Running a native torch job ..."
export USE_TORCH_XLA=0
[ -z "$RANK" ] && RANK=0
[ -z "$WORLD_SIZE" ] && WORLD_SIZE=1
[ -z "$MASTER_ADDR" ] && MASTER_ADDR=127.0.0.1
[ -z "$MASTER_PORT" ] && MASTER_PORT=9010
BS=1
SEQLEN=4096
NPROC_PER_NODE=8
PRECISION="bf16=true"
FSDP_CONFIG="llama3_fsdp_native.json"
JOB_NAME="LLAMA3_FSDP_NATIVE_GPU${NPROC_PER_NODE}_BS${BS}_SEQLEN${SEQLEN}_BF16_FA"
torchrun --nproc_per_node $NPROC_PER_NODE \
--nnodes $WORLD_SIZE \
--node_rank $RANK \
--master_port $MASTER_PORT \
--master_addr $MASTER_ADDR \
../examples/pytorch/language-modeling/run_clm.py \
--num_train_epochs 2 \
--dataset_name wikitext \
--dataset_config_name wikitext-103-raw-v1 \
--use_fast_tokenizer false \
--per_device_train_batch_size $BS \
--per_device_eval_batch_size $BS \
--do_train \
--output_dir /tmp/test-clm \
--overwrite_output_dir \
--config_name ./Meta-Llama-3-8B/ \
--tokenizer_name ./Meta-Llama-3-8B/ \
--trust_remote_code true \
--cache_dir ./cache \
--block_size $SEQLEN \
--optim adamw_torch \
--save_strategy no \
--logging_strategy steps \
--gradient_checkpointing no \
--logging_steps 100 \
--$PRECISION \
--fsdp "auto_wrap" \
--fsdp_config $FSDP_CONFIG 2>&1 | tee ./$JOB_NAME.log
Specific configuration details can be found in the official documentation: https://www.deepspeed.ai/docs/config-json/. To align with the Transformers native and torchacc's FSDP training tasks, we configure the DeepSpeed file as follows:
- The zero3 training strategy is the same as FSDP.
- train_batch_size = train_micro_batch_size_per_gpu * number of training cards, should align with
batch_size
in the training script.
The Transformers library does not recognize activation checkpointing configured in DeepSpeed config, so activation checkpointing configuration is unnecessary.
{
"train_batch_size": 8,
"train_micro_batch_size_per_gpu": 1,
"optimizer": {
"type": "AdamW"
},
"zero_optimization": {
"stage": 3
},
"bf16": {
"enabled": true
}
}
set -ex
echo "Running a deepspeed job ..."
export USE_TORCH_XLA=0
[ -z "$RANK" ] && RANK=0
[ -z "$WORLD_SIZE" ] && WORLD_SIZE=1
[ -z "$MASTER_ADDR" ] && MASTER_ADDR=127.0.0.1
[ -z "$MASTER_PORT" ] && MASTER_PORT=9010
BS=1
SEQLEN=4096
NPROC_PER_NODE=8
PRECISION="bf16=true"
FSDP_CONFIG="llama3_fsdp_ds.json"
JOB_NAME="LLAMA3_FSDP_DEEPSPEED_GPU${NPROC_PER_NODE}_BS${BS}_SEQLEN${SEQLEN}_BF16_FA"
torchrun --nproc_per_node $NPROC_PER_NODE \
--nnodes $WORLD_SIZE \
--node_rank $RANK \
--master_port $MASTER_PORT \
--master_addr $MASTER_ADDR \
./examples/pytorch/language-modeling/run_clm.py \
--num_train_epochs 2 \
--dataset_name wikitext \
--dataset_config_name wikitext-103-raw-v1 \
--use_fast_tokenizer false \
--per_device_train_batch_size $BS \
--per_device_eval_batch_size $BS \
--do_train \
--output_dir /tmp/test-clm \
--overwrite_output_dir \
--config_name ./Meta-Llama-3-8B/ \
--tokenizer_name ./Meta-Llama-3-8B/ \
--trust_remote_code true \
--cache_dir ./cache \
--block_size $SEQLEN \
--optim adamw_torch \
--save_strategy no \
--logging_strategy steps \
--gradient_checkpointing no \
--logging_steps 100 \
--$PRECISION \
--deepspeed $FSDP_CONFIG 2>&1 | tee ./$JOB_NAME.log
If you want to accelerate the training of Transformers llama3-8b
with Torchacc, you need to make the following changes:
Open the examples/pytorch/language-modeling/run_clm.py
file and insert the following code at the very top:
import torchacc
torchacc.utils.patch.patch_llama(True)
You can control the xla_fsdp_grad_ckpt
parameter to enable or disable gradient checkpointing.
{
"fsdp_transformer_layer_cls_to_wrap": [
"LlamaDecoderLayer"
],
"xla": true,
"xla_fsdp_settings": {
"compute_dtype": "bfloat16",
"buffer_dtype": "bfloat16",
"opt_flatten_overlap": true,
"pin_layout_in_collective_ops": false,
"flatten_parameters": true
},
"xla_fsdp_grad_ckpt": false
}
set -ex
echo "Running a torch job with torchacc ..."
export PJRT_ALLOCATOR_FRACTION=0.97
export PJRT_DEVICE=CUDA
#export XLA_PERSISTENT_CACHE_PATH=./compiled_cache # uncomment this line to cache the compile results and speed up initialization.
[ -z "$RANK" ] && RANK=0
[ -z "$WORLD_SIZE" ] && WORLD_SIZE=1
[ -z "$MASTER_ADDR" ] && MASTER_ADDR=127.0.0.1
[ -z "$MASTER_PORT" ] && MASTER_PORT=9010
BS=3
SEQLEN=4096
NPROC_PER_NODE=8
PRECISION="bf16=true"
FSDP_CONFIG="llama3_fsdp_acc.json"
JOB_NAME="LLAMA3_FSDP_TORCHACC_GPU${NPROC_PER_NODE}_BS${BS}_SEQLEN${SEQLEN}_BF16_FA"
torchrun --nproc_per_node $NPROC_PER_NODE \
--nnodes $WORLD_SIZE \
--node_rank $RANK \
--master_port $MASTER_PORT \
--master_addr $MASTER_ADDR \
./examples/pytorch/language-modeling/run_clm.py \
--num_train_epochs 2 \
--dataset_name wikitext \
--dataset_config_name wikitext-103-raw-v1 \
--use_fast_tokenizer false \
--per_device_train_batch_size $BS \
--per_device_eval_batch_size $BS \
--do_train \
--output_dir /tmp/test-clm \
--overwrite_output_dir \
--config_name ./Meta-Llama-3-8B/ \
--tokenizer_name ./Meta-Llama-3-8B/ \
--trust_remote_code true \
--cache_dir ./cache \
--block_size $SEQLEN \
--optim adamw_torch \
--save_strategy no \
--logging_strategy steps \
--gradient_checkpointing no \
--logging_steps 100 \
--$PRECISION \
--fsdp "auto_wrap" \
--fsdp_config $FSDP_CONFIG 2>&1 | tee ./$JOB_NAME.log
The following is a comparison of various configurations tested for Torch native, DeepSpeed, and TorchAcc, aiming to select the optimal performance configuration for each framework.
Experimental Parameters:
flash_attn==2.5.6
- Sequence Length = 4096
- Compute Resources: 8 * 80G A100
- transformers commit hash: f91c16d270e5e3ff32fdb32ccf286d05c03dfa66
Here is the table with the last two rows removed:
Global Batch Size | PyTorch | DeepSpeed | TorchAcc |
---|---|---|---|
8 | 2945.0 tokens/s/GPU | 3123.2 tokens/s/GPU | 3276.8 tokens/s/GPU |
16 | OOM | OOM | 3737.6 token/s/GPU |
24 | OOM | OOM | 4044.8 tokens/s/GPU |
- Optimal PyTorch Configuration: BS=8+FA+noGC, Throughput: 2945.0 tokens/perGPU/s
- Optimal DeepSpeed Configuration: BS=8+FA+noGC, Throughput: 3123.2 tokens/perGPU/s, showing a 6% improvement over PyTorch's optimal performance.
- Optimal TorchAcc Configuration: BS=24+FA+noGC, Throughput: 4044.8 tokens/perGPU/s, showing a 37% improvement over PyTorch's optimal performance and a 30% improvement over DeepSpeed's optimal performance.