Skip to content

DeepSpeed support

zhezhaoa edited this page Sep 26, 2022 · 2 revisions

TencentPretrain integrates the DeepSpeed and supports gigantic model pre-training, fine-tuning, and inference.

Pre-training

To use DeepSpeed, we need to specify --deepspeed and the path of DeepSpeed configuration file (--deepspeed_config). This section was takes gigantic models in Megatron-LM as examples to demonstrate how to use DeepSpeed in TencentPretrain. It is noticeable that pre-layernorm is used in Megatron BERT and Megatron GPT-2.

Megatron BERT:

The example of using DeepSpeed for pre-training Megatron BERT: The example of pre-training on a single machine with 8 GPUs:

python3 preprocess.py --corpus_path corpora/CLUECorpusSmall_5000_lines.txt --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 --dynamic_masking \
                      --data_processor mlm

deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
                      --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                      --config_path models/megatron/bert_3.9B_config.json \
                      --output_model_path models/output_model \
                      --world_size 8 --batch_size 16 \ 
                      --total_steps 10000 --save_checkpoint_steps 5000 --report_steps 100 --deep_init

The example of loading PyTorch model and doing incremental training:

deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
                      --pretrained_model_path models/input_model.bin \
                      --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                      --config_path models/megatron/bert_3.9B_config.json \
                      --output_model_path models/output_model \
                      --world_size 8 --batch_size 16 \ 
                      --total_steps 10000 --save_checkpoint_steps 5000 --report_steps 100 --deep_init

The example of pre-training on two machines: each machine has 8 GPUs (16 GPUs in total): It is required to provide hostfile.txt , whose format is ip slots=the number of GPUs . For example:

1.1.1.1 slots=8
2.2.2.2 slots=8

When training on multiple machines, we only need to run scripts in master node.

python3 preprocess.py --corpus_path corpora/CLUECorpusSmall_5000_lines.txt --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 --dynamic_masking \
                      --data_processor mlm

deepspeed --hostfile=hostfile.txt pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
                                              --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                                              --config_path models/megatron/bert_3.9B_config.json \
                                              --output_model_path models/output_model \
                                              --world_size 16 --batch_size 16 \
                                              --total_steps 10000 --save_checkpoint_steps 5000 --report_steps 100 --deep_init

Megatron GPT-2:

The example of using DeepSpeed for training Megatron GPT-2: The example of pre-training on a single machine with 8 GPUs:

python3 preprocess.py --corpus_path corpora/CLUECorpusSmall_5000_lines.txt --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 --data_processor lm

deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
                      --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                      --config_path models/megatron/gpt2_8.3B_config.json \
                      --output_model_path models/output_model \
                      --world_size 8 --batch_size 4 \ 
                      --total_steps 10000 --save_checkpoint_steps 5000 --report_steps 100 --deep_init

The example of pre-training on two machines: each machine has 8 GPUs (16 GPUs in total):

python3 preprocess.py --corpus_path corpora/CLUECorpusSmall_5000_lines.txt --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 --data_processor lm

deepspeed --hostfile=hostfile.txt pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
                                              --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                                              --config_path models/megatron/gpt2_8.3B_config.json \
                                              --output_model_path models/output_model \
                                              --world_size 16 --batch_size 4 \ 
                                              --total_steps 10000 --save_checkpoint_steps 5000 --report_steps 100 --deep_init

DeepSpeed model conversion

After pre-training, the pre-trained model and the conversion script zero_to_fp32.py are saved in models/output_model folder. zero_to_fp32.py converts the pre-trained from DeepSpeed format to PyTorch format.

usage: zero_to_fp32.py [-h] checkpoint_dir output_file

positional arguments:
  checkpoint_dir path to the deepspeed checkpoint folder, e.g., path/checkpoint-1/global_step1
  output_file path to the pytorch fp32 state_dict output file (e.g.,path/checkpoint-1/pytorch_model.bin)

optional arguments:
  -h, --help how this help message and exit

The example Megatron BERT conversion:

python3 models/output_model/zero_to_fp32.py models/output_model/10000 models/output_model/megatron_bert.bin-10000

Fine-tuning

finetune/run_classifier_deepspeed.py is used to fine-tune gigantic models with DeepSpeed. run_classifier_deepspeed.py and the regular classification script run_classifier.py have the following differences:

  • In run_classifier_deepspeed.py , --world_size is used to specify the number of GPUs.
  • In run_classifier_deepspeed.py , the actual batch size is --batch_size times --world_size . In run_classifier.py , the actual batch size is batch_size .
  • run_classifier_deepspeed.py saves the fine-tuned model every epoch and place them to the path specified by --output_model_path.

The example of using DeepSpeed for fine-tuning Megatron BERT: The example of fine-tuning on a single machine with 8 GPUs:

deepspeed finetune/run_classifier_deepspeed.py --pretrained_model_path models/output_model/megatron_bert.bin-10000 \
                                               --deepspeed_config models/deepspeed_config.json \
                                               --vocab_path models/google_zh_vocab.txt \
                                               --config_path models/megatron/bert_3.9B_config.json \
                                               --output_model_path models/classifier_model \
                                               --train_path  datasets/chnsenticorp/train.tsv \
                                               --dev_path datasets/chnsenticorp/dev.tsv \
                                               --test_path datasets/chnsenticorp/test.tsv \
                                               --epochs_num 3 --batch_size 8  --world_size 8

The example of fine-tuning on two machines: each machine has 8 GPUs (16 GPUs in total):

deepspeed --hostfile=hostfile.txt finetune/run_classifier_deepspeed.py --pretrained_model_path models/output_model/megatron_bert.bin-10000 \
                                                                       --deepspeed_config models/deepspeed_config.json \
                                                                       --vocab_path models/google_zh_vocab.txt \
                                                                       --config_path models/megatron/bert_3.9B_config.json \
                                                                       --output_model_path models/classifier_model \
                                                                       --train_path  datasets/chnsenticorp/train.tsv \
                                                                       --dev_path datasets/chnsenticorp/dev.tsv \
                                                                       --test_path datasets/chnsenticorp/test.tsv \
                                                                       --epochs_num 3 --batch_size 4  --world_size 16

Then we converts the pre-trained model from DeepSpeed format to PyTorch format:

python3 models/classifier_model/zero_to_fp32.py models/classifier_model/3 models/classifier_model/megatron_bert_classifier.bin

Inference

run_classifier_deepspeed_infer.py is used to do inference on gigantic models with DeepSpeed. --mp_size specifies the the number of used GPUs for model parallel.

The example of using DeepSpeed for Megatron BERT: The example of doing inference on a single machine with 8 GPUs:

deepspeed finetune/run_classifier_deepspeed_infer.py --load_model_path models/classifier_model/megatron_bert_classifier.bin \
                                                     --vocab_path models/google_zh_vocab.txt \
                                                     --config_path models/megatron/bert_3.9B_config.json \
                                                     --test_path datasets/chnsenticorp/test_nolabel.tsv \
                                                     --prediction_path prediction.txt --labels_num 2 \
                                                     --mp_size 8

The example of do inference on two machines: each machine has 8 GPUs (16 GPUs in total):

deepspeed --hostfile=hostfile.txt finetune/run_classifier_deepspeed_infer.py --load_model_path models/classifier_model/megatron_bert_classifier.bin \
                                                                             --vocab_path models/google_zh_vocab.txt \
                                                                             --config_path models/megatron/bert_3.9B_config.json \
                                                                             --test_path datasets/chnsenticorp/test_nolabel.tsv \
                                                                             --prediction_path prediction.txt  --labels_num 2 \
                                                                             --mp_size 16

Text generation

generate_lm_deepspeed.py is used to generate text with gigantic language models. The model generates the text according to the beginning. --mp_size specifies the the number of used GPUs for model parallel. The example of using DeepSpeed for Megatron GPT-2: The example of generating text on a single machine with 8 GPUs:

deepspeed scripts/generate_lm_deepspeed.py --load_model_path models/megatron_gpt2.bin \
                                           --vocab_path models/google_zh_vocab.txt \
                                           --config_path models/megatron/gpt2_8.3B_config.json \
                                           --test_path beginning.txt  --prediction_path generated_sentence.txt \
                                           --mp_size 8

The example of doing inference on two machines: each machine has 8 GPUs (16 GPUs in total):

deepspeed --hostfile=hostfile.txt scripts/generate_lm_deepspeed.py --load_model_path models/megatron_gpt2.bin \
                                                                   --vocab_path models/google_zh_vocab.txt \
                                                                   --config_path models/megatron/gpt2_8.3B_config.json \
                                                                   --test_path beginning.txt  --prediction_path generated_sentence.txt \
                                                                   --mp_size 16

generate_seq2seq_deepspeed.py is used to generate text with gigantic seq2seq models. The example of generating text on a single machine with 8 GPUs:

deepspeed scripts/generate_seq2seq_deepspeed.py --load_model_path models/input_model.bin \
                                                --vocab_path models/google_zh_vocab.txt \
                                                --config_path models/encoder_decoder_config.json \
                                                --test_path input.txt --prediction_path output.txt \
                                                --mp_size 8

The example of generating text on two machines: each machine has 8 GPUs (16 GPUs in total):

deepspeed --hostfile=hostfile.txt scripts/generate_seq2seq_deepspeed.py --load_model_path models/input_model.bin \
                                                                        --vocab_path models/google_zh_vocab.txt \
                                                                        --config_path models/encoder_decoder_config.json \
                                                                        --test_path input.txt  --prediction_path output.txt \
                                                                        --mp_size 16
Clone this wiki locally