Skip to content

Intel Gaudi's Megatron DeepSpeed Large Language Models for training

License

Notifications You must be signed in to change notification settings

imangohari1/Megatron-DeepSpeed

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM for PyTorch

This directory provides scripts to train the GPT-based LLaMA and Mixtral models in the Megatron-DeepSpeed repository on Intel® Gaudi® 2 AI accelerator. Before you get started, make sure to review the Supported Configuration.

Table of Contents

Model Overview

This implementation is based on https://github.com/microsoft/Megatron-DeepSpeed at 3c5f47563f697702c1e305fa01b7563f54b747fc. Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for training large transformer language models such as LLaMA at scale. Codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism.

How to use

Users bear sole liability and responsibility to follow and comply with any third party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third party licenses.

Setup

Please follow the instructions provided in the Intel Gaudi Installation Guide to set up the environment including the $PYTHON environment variable. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform guide. The guides will walk you through the process of setting up your system to run the model on Gaudi 2.

Install Intel Gaudi DeepSpeed

Please follow the instructions provided in the DeepSpeed Installation Guide to install deepspeed.

Clone Intel Gaudi Megatron-DeepSpeed

In the docker container, clone this repository and switch to the branch that matches your Intel Gaudi software version. You can run the hl-smi utility to determine the Intel Gaudi software version.

git clone -b [Intel Gaudi software version] https://github.com/HabanaAI/Megatron-DeepSpeed
export MEGATRON_DEEPSPEED_ROOT=/path/to/Megatron-DeepSpeed
export PYTHONPATH=$MEGATRON_DEEPSPEED_ROOT:$PYTHONPATH

Install Megatron-DeepSpeed Requirements

  • In the docker container, go to the Megatron-DeepSpeed directory:

    cd $MEGATRON_DEEPSPEED_ROOT
  • Install the required packages using pip:

    pip install -r megatron/core/requirements.txt
  • To run training on more than 128 cards, apply the below configuration changes:

    echo '*    soft nofile  unlimited' >> /etc/security/limits.conf
    echo '*    hard nofile  unlimited' >> /etc/security/limits.conf
    echo 'root soft nofile  unlimited' >> /etc/security/limits.conf
    echo 'root hard nofile  unlimited' >> /etc/security/limits.conf

Dataset Preparation

Follow the instructions in https://github.com/bigscience-workshop/bigscience/tree/master/data/oscar to download oscar-en full dataset. Note that the dataset takes around 550G of disk space. This dataset is used for training LLaMA & LLaMA 2.

Dataset Preparation Examples

The below provides the steps required to prepare your dataset. It is based on instructions in https://github.com/bigscience-workshop/bigscience/tree/master/data/oscar. The dataset in the example is intended to be zh

Step 0 :

git clone https://github.com/bigscience-workshop/bigscience.git
cd bigscience/data/oscar
# Edit the `oscar-to-jsonl.py` in the list language_subsets and remove the comment on unshuffled_deduplicated_zh and comment out unshuffled_deduplicated_en
vi oscar-to-jsonl.py

Step 1 :

# -s can be added for subset of data
$PYTHON oscar-to-jsonl.py

Step 2 :

mkdir -p zh
mv oscar*.jsonl zh
cd zh
cat oscar-[0-4].jsonl > oscar-zh.jsonl

Step 3 :

Use one of the three methods below to tokenize the dataset. You can use any number of workers based on the CPU cores.

  • Tokenize the dataset using GPT2BPETokenizer:
# download gpt2 vocab and merge files
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt

# use the tokenized files generated from this command to train
$PYTHON $MEGATRON_DEEPSPEED_ROOT/tools/preprocess_data.py --input oscar-zh.jsonl --output-prefix tokenized --tokenizer-type GPT2BPETokenizer --vocab-file gpt2-vocab.json --merge-file gpt2-merges.txt --append-eod --workers 64
  • Tokenize the dataset using GPTSentencePieceTokenizer:
# download tokenizer.model based on model trying to train
# use the tokenized files generated from this command to train
$PYTHON $MEGATRON_DEEPSPEED_ROOT/tools/preprocess_data.py --input oscar-zh.jsonl --output-prefix tokenized --tokenizer-type GPTSentencePieceTokenizer --tokenizer-model /path/to/tokenizer.model --append-eod --workers 28
  • Tokenize the dataset using HFTokenizer:
# path to tokenizer can be local directory path and to run custom code from it, trust remote code option(--trust-remote-code) should be passed
#  or
# path to tokenizer can be link to huggingface repo model card
# if huggingface repo model card is a gated repo, Log in using a token from huggingface.co/settings/tokens with below command
# huggingface-cli login
# --seq-length value need to be passed explicitly from huggingface repo model card or local directory path which has model_max_length in tokenizer_config.json file

# use the tokenized files generated from this command to train
$PYTHON $MEGATRON_DEEPSPEED_ROOT/tools/preprocess_data.py --input oscar-zh.jsonl --output-prefix tokenized --tokenizer-type HFTokenizer --tokenizer-model /path/to/tokenizer --append-eod --workers 4 --seq-length 1000000000000000019884624838656

Training Script Settings

  • Based on the tokenization method, update the tokenizer type:
    HL_TOKENIZER_TYPE=GPT2BPETokenizer
    
  • To run custom tokenizer code from local path using HFTokenizer method:
    HL_TRUST_REMOTE_CODE=1
    
  • Update data root dir with the path of your choice:
    HL_DATA_DIR_ROOT=/data/bigscience/oscar-en
    
  • Update data file prefix(*.bin and *.idx) based on file name in data root dir:
    HL_DATA_FILE_PREFIX=tokenized_text_document
    
  • Update tokenizer.model file path if it is not in data root dir, required for any sentence piece based tokenizer:
    HL_TOKENIZER_MODEL=path/to/tokenizer.model
    

Note: For the training commands, make sure to change the IP addresses in hostsfile according to your setup. HL_RESULTS_DIR and HL_DATA_DIR_ROOT must be shared writable across all nodes and launchers when running training on more than 8 cards. The same applies to HL_CHECKPOINTS_DIR, HL_TENSORBOARD_DIR and HL_KILL_SWITCH if specified. If HL_DATA_DIR_ROOT is not writable, then HL_DATA_CACHE_DIR must be set to a writable location and must be shared and accessible across all nodes and launchers when running training on more than 8 cards.

LLaMA Training and Examples

Multi-Card Training Examples

  • Run LLaMA 2 13B on 8 HPUs with BF16 precision:

    HL_NUM_NODES=1 HL_PP=2 HL_TP=2 HL_DP=2 scripts/run_llama.sh
    
  • Run LLaMA 2 13B on 64 HPUs with BF16 precision:

    HL_HOSTSFILE=scripts/hostsfile HL_NUM_NODES=8 HL_PP=2 HL_TP=2 HL_DP=16 scripts/run_llama.sh
    
  • Run LLaMA 2 70B on 32 HPUs with BF16 precision:

    HL_HOSTSFILE=scripts/hostsfile HL_LLAMA_MODEL_SIZE=70 HL_NUM_NODES=4 HL_PP=4 HL_TP=8 HL_DP=1 scripts/run_llama.sh
    

LLaMA 2 training supports FP8 precision, which improves model performance. To enable FP8, set HL_USE_TRANSFORMER_ENGINE=1. Several FP8 parameters adjust model performance, accuracy, and memory utilization. It is not recommended to change the following default parameters, as they are set optimally:

  • HL_FP8_FORMAT=hybrid
  • HL_FP8_MARGIN=0
  • HL_FP8_AMAX_RECOMPUTE_ALGO=max
  • HL_FP8_AMAX_REDUCE=1
  • HL_FP8_MEASURE_INTERVAL=GBS/micro_batch_size/DP
  • HL_FP8_AMAX_HISTORY_LEN=GBS/micro_batch_size/DP

The below parameter can be added to improve model performance while using FP8. Try adding them if you have enough memory:

  • HL_USE_CACHE_FP8_WEIGHT_FWD=1
  • HL_USE_CACHE_FP8_WEIGHT=1
  • Run LLaMA 2 70B on 32 HPUs with FP8 precision:

    HL_HOSTSFILE=scripts/hostsfile HL_LLAMA_MODEL_SIZE=70 HL_NUM_NODES=4 HL_PP=4 HL_TP=8 HL_DP=1 HL_CKP_ACT=0 HL_SEQ_LEN=4096 HL_MICRO_BATCH=1 HL_USE_TRANSFORMER_ENGINE=1 HL_USE_CACHE_FP8_WEIGHT_FWD=1 scripts/run_llama.sh
    
  • Run LLaMA 2 13B on 16 HPUs with FP8 precision:

    HL_HOSTSFILE=scripts/hostsfile HL_NUM_NODES=2 HL_PP=2 HL_TP=2 HL_DP=4 HL_CKP_ACT=2 HL_SEQ_LEN=4096 HL_ZERO_STAGE=1 HL_USE_FAST_SOFTMAX=1 HL_MICRO_BATCH=2 HL_GRAD_ACCUM_DTYPE=bf16 HL_USE_TRANSFORMER_ENGINE=1 HL_USE_CACHE_FP8_WEIGHT_FWD=1 HL_USE_CACHE_FP8_WEIGHT=1 scripts/run_llama.sh
    
  • Run LLaMA 2 7B on 8 HPUs with FP8 precision:

    HL_LLAMA_MODEL_SIZE=7 HL_NUM_NODES=1 HL_PP=1 HL_TP=1 HL_DP=8 HL_CKP_ACT=2 HL_SEQ_LEN=4096 HL_ZERO_STAGE=1 HL_USE_FAST_SOFTMAX=1 HL_MICRO_BATCH=1 HL_GRAD_ACCUM_DTYPE=bf16  HL_USE_TRANSFORMER_ENGINE=1 HL_USE_CACHE_FP8_WEIGHT_FWD=1 HL_USE_CACHE_FP8_WEIGHT=1 scripts/run_llama.sh
    
    

Mixtral Training and Examples

Multi-Card Training Examples

Configure the following for the Mixtral examples below:

  • Set the correct path for HL_DATA_DIR_ROOT.
  • Set the correct values for HL_TOKENIZER_TYPE and HL_DATA_FILE_PREFIX.
  • Add HL_DATA_CACHE_DIR and/or HL_TOKENIZER_MODEL if necessary.

Refer to training script settings for details.

  • Run Mixtral 4x7b on 8 HPUs, Eager mode with torch.compile enabled, with BF16 precision, sequence length 256:

    PT_HPU_LAZY_MODE=0 \
    PT_ENABLE_INT64_SUPPORT=1 \
    PT_HPU_FORCE_EAGER_FALLBACK_OPS=rand \
    HL_MIXTRAL_MODEL='small' \
    HL_USE_TORCH_COMPILE=true \
    HL_DP=8 \
    HL_TP=1 \
    HL_MOE_EP=4 \
    HL_MOE_ENABLE_EXPERT_TP=1 \
    HL_LR_WARMUP_ITERS=1 \
    HL_GBS=128 \
    HL_MICRO_BATCH=16 \
    HL_ZERO_STAGE=1 \
    HL_CKP_ACT=1 \
    $MEGATRON_DEEPSPEED_ROOT/scripts/run_mixtral.sh
    
  • Run Mixtral 8x7b on 32 HPUs, Eager mode with torch.compile enabled, with BF16 precision, sequence length 8k:

    # Additional flags need to be passed in $HOME/.deepspeed_env
    echo 'PT_HPU_LAZY_MODE=0' > $HOME/.deepspeed_env
    echo 'PT_ENABLE_INT64_SUPPORT=1' >> $HOME/.deepspeed_env
    echo 'PT_HPU_FORCE_EAGER_FALLBACK_OPS=rand' >> $HOME/.deepspeed_env
    
    HL_HOSTSFILE=$MEGATRON_DEEPSPEED_ROOT/scripts/hostsfile \
    HL_USE_TORCH_COMPILE=true \
    HL_NUM_NODES=4 \
    HL_DP=32 \
    HL_TP=1 \
    HL_MOE_EP=8 \
    HL_MOE_ENABLE_EXPERT_TP=1 \
    HL_LR_WARMUP_ITERS=4 \
    HL_GBS=32 \
    HL_ZERO_STAGE=1 \
    HL_CKP_ACT=1 \
    HL_SEQ_LEN=8192 \
    $MEGATRON_DEEPSPEED_ROOT/scripts/run_mixtral.sh
    

Supported Configuration

Validated on Intel Gaudi software Version PyTorch Version Mode
Gaudi 2 1.16.1 2.2.2 Training

Changelog

1.16.0

  • Added Mixtral model with Eager and torch.compile modes support. Lazy mode is not supported.
  • Rebased Megatron-DeepSpeed repository from PR#307 to PR#372.
  • Set the LLaMA 2 model as the default.
  • Added support for Zeroshot_gpt tasks using DeepSpeed 3D parallelism.
  • Added support for ALiBi positional embeddings in core attention only.
  • Added support for fast softmax. Currently disabled by default.
  • Added support for accumulation of gradients in BF16. Currently disabled by default.

1.15.0

  • Initial release.

Script Modifications

Major changes done to the original model from microsoft/Megatron-DeepSpeed repository:

  • Changed README file content.
  • TFLOPs calculation changed.
  • Added HPU FP8 support.
  • Flash attention support via FusedSDPA is added for HPU Accelerator.
  • Added checkpoint verification.
  • Added kill-switch mechanism to gracefully stop training.

Known Issues

  • Only scripts and configurations mentioned in this README are supported and verified.
  • Checkpoint activation in full recompute mode is not supported together with FP8 mode.

About

Intel Gaudi's Megatron DeepSpeed Large Language Models for training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 76.2%
  • Shell 20.3%
  • C++ 3.0%
  • Cuda 0.3%
  • C 0.1%
  • HTML 0.1%