Launch a Llama2 pretraining job using neuronx-nemo-megatron

This tutorial explains how to run Llama V2 pretraining jobs with AWS EC2 trn1.32xl instances using neuronx-nemo-megatron and AWS ParallelCluster.

neuronx-nemo-megatron (also known as "AWS Neuron Reference for NeMo Megatron") includes modified versions of the open-source packages NeMo and Apex that have been adapted for use with AWS Neuron and AWS EC2 Trn1 instances. neuronx-nemo-megatron allows for pretraining models with hundreds of billions of parameters across thousands of Trainium accelerators, and enables advanced training capabilities such as 3D parallelism, sequence parallelism, and activation checkpointing.

Prerequisites

Before proceeding with this tutorial, please follow these instructions to create a ParallelCluster consisting of 1 or more trn1.32xl or trn1n.32xl nodes. ParallelCluster automates the creation of trn1 clusters, and provides the SLURM job management system for scheduling and managing distributed training jobs. Please note that the home directory on your ParallelCluster head node will be shared with all of the worker nodes via NFS.

Install neuronx-nemo-megatron

With your trn1 ParallelCluster in place, begin by logging into the head node of your cluster using SSH. To provide access to TensorBoard (required in a later step), please make sure that you enable port forwarding for TCP port 6006 when you login, ex:

ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006

Once logged into the head node, activate the provided PyTorch Neuron virtual environment that was created when you set up your ParallelCluster. Note: if your PyTorch Neuron environment is lower than Neuron 2.11, please refer to the Neuron documentation for instructions on updating to Neuron 2.11 or later.

cd ~
source ./aws_neuron_venv_pytorch/bin/activate

Next, clone the neuronx-nemo-megatron repo to the head node:

cd ~
git clone https://github.com/aws-neuron/neuronx-nemo-megatron.git
cd neuronx-nemo-megatron

Install the wheel Python package and run the build script to create the neuronx-nemo-megatron wheels:

pip3 install wheel
./build.sh

Install the neuronx-nemo-megatron packages and dependencies in your virtual environment:

pip3 install ./build/*.whl
pip3 install -r requirements.txt protobuf==3.20.3

Build the Megatron helper module

cd ~
python3 -c "from nemo.collections.nlp.data.language_modeling.megatron.dataset_utils import compile_helper; \
compile_helper()"

The above utility will help make this file : nemo.collections.nlp.data.language_modeling.megatron.dataset_utils and below is the expected output (You can ignore the error)

2023-Aug-17 22:53:01.0674 47940:47940 ERROR  TDRV:tdrv_get_dev_info                       No neuron device available
[NeMo W 2023-08-17 22:53:03 optimizers:67] Could not import distributed_fused_adam optimizer from Apex
[NeMo W 2023-08-17 22:53:04 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.

Download LlamaV2 dataset and tokenizer

This tutorial makes use of a Red pyjama dataset. The dataset can be downloaded to your cluster by running the following commands on the head node:

wget https://data.together.xyz/redpajama-data-1T/v1.0.0/book/book.jsonl

Note: Dataset download is 50G and will take approximately 3-4 hours to download.

The above command will give you the raw dataset of around 50G which needs to be tokenized using a llamaV2 tokenizer. To tokenize the data, you need to request the tokenizer from hugging face and meta following the below link :

Request Tokenizer and model weights from hugging face

Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the above website and accept our License before requesting access here.

Once you have the Tokenizer and the dataset. You can tokenize the dataset following the below command :

python nemo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
    --input=DATA_FOLDER/DATA.jsonl \
    --json-keys=text \
    --tokenizer-library=huggingface \
    --tokenizer-type=TOKENIZER_FOLDER/llama7b-hf \
    --dataset-impl=mmap \
    --output-prefix=DATA_FOLDER/DATA_tokenized \
    --append-eod \
    --need-pad-id \
    --workers=32

Post tokenizing the dataset, you will have a path to the tokenizer and the dataset which will be used for pretraining.

Llama2 training configurations

We tested with the following model sizes: 7B, 13B, 70B

Llama2 7B

Model configuration
- Attention heads: 32
- Layers: 32
- Sequence length: 4096
- Hidden size: 4096
- Hidden FFN size: 11008
- Microbatch size: 1
- Global batch size: 256
Distributed training configuration
- Number of nodes: 4
- Tensor parallel degree: 8
- Pipeline parallel degree: 1
- Data parallel degree: 16

Llama2 13B

Model configuration
- Attention heads: 40
- Layers: 40
- Sequence length: 4096
- Hidden size: 5120
- Hidden FFN size: 13824
- Microbatch size: 1
- Global batch size: 1024
Distributed training configuration
- Number of nodes: 4
- Tensor parallel degree: 8
- Pipeline parallel degree: 4
- Data parallel degree: 4

Llama2 70B

Model configuration
- Attention heads: 64
- Layers: 80
- Sequence length: 4096
- Hidden size: 8192
- Hidden FFN size: 28672
- Microbatch size: 1
- Global batch size: 512
Distributed training configuration
- Number of nodes: 8
- Tensor parallel degree: 8
- Pipeline parallel degree: 16
- Data parallel degree: 2

Pre-compile the model

By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially compiles all of the neural network compute graphs as they are encountered during a training job. The compiled graphs are cached in a local compiler cache so that subsequent training jobs can leverage the compiled graphs and avoid compilation (so long as the graph signatures and Neuron version have not changed).

An alternative to the JIT flow is to use the included neuron_parallel_compile command to perform ahead of time (AOT) compilation. In the AOT compilation flow, the compute graphs are first identified and extracted during a short simulated training run, and the extracted graphs are then compiled and cached using parallel compilation, which is considerably faster than the JIT flow.

Before starting the compilation you need to update your path to the dataset and tokenizer in the test_llama.sh script for pretraining llama 7b and llama 13b and test_llama_gqa.sh for pretraining llama 70b as below :

cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling

# For llama 7b and 13b
vi test_llama.sh

# For llama 70b 
vi test_llama_gqa.sh

# Update the below lines
# For tokenizer
model.tokenizer.type='PATH_TO_LLAMA_TOKENIZER/' \

# For Dataset
model.data.data_prefix=[1.0,PATH_TO_TOKENIZED_DATASET/books/book.jsonl-processed_text_document] \

Run the following commands to launch an AOT pre-compilation job on your ParallelCluster:

cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
sbatch --nodes 4 compile.slurm ./llama_7b.sh

For compiling llama 13b, run the following commands:

cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
sbatch --nodes 4 compile.slurm ./llama_13b.sh

For compiling llama 70b, run the following commands:

cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
sbatch --nodes 32 compile.slurm ./llama_70b.sh

Note : For the 70B the --nodes 32 would be used instead of 4.

Once you have launched the precompilation job, run the squeue command to view the SLURM job queue on your cluster. If you have not recently run a job on your cluster, it may take 4-5 minutes for the requested trn1.32xlarge nodes to be launched and initialized. Once the job is running, squeue should show output similar to the following:

    JOBID  PARTITION  NAME           USER    ST  TIME  NODES NODELIST(REASON)
    10     compute1   compile.slurm  ubuntu  R   5:11  4     compute1-dy-queue1-i1-[1-4]

You can view the output of the precompilation job by examining the file named slurm-compile.slurm-ZZ.out where ZZ represents the JOBID of your job in the squeue output, above. Ex:

tail -f slurm-compile.slurm-10.out

Once the precompilation job is complete, you should see a message similar to the following in the logs:

2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total graphs: 22
2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total successful compilations: 22
2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0

At this point, you can press CTRL-C to exit the tail command.

Launch a pretraining job

The Llama2 pretraining job can be launched in the same manner as the precompilation job described above. In this case, we change the SLURM script from compile.slurm to run.slurm, but the other parameters remain the same:

cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
sbatch --nodes 4 run.slurm ./llama_7b.sh

For llama_13b, run the below command :

cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
sbatch --nodes 4 run.slurm ./llama_13b.sh

For llama_70b, run the below command :

cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
sbatch --nodes 32 run.slurm ./llama_70b.sh

Note : For the 70B the --nodes 32 would be used instead of 4.

As outlined above, you can again use the squeue command to view the job queue. Once you see that your pretraining job is running, you can view the output of the training job by examining the file named slurm-run.slurm-ZZ.out where ZZ represents the JOBID of your job:

tail -f slurm-run.slurm-11.out

Once the model is loaded onto the Trainium accelerators and training has commenced, you will begin to see output indicating the job progress:

Epoch 0:  22%|██▏       | 4499/20101 [22:26:14<77:48:37, 17.95s/it, loss=2.43, v_num=5563, reduced_train_loss=2.470, gradient_norm=0.121, parameter_norm=1864.0, global_step=4512.0, consumed_samples=1.16e+6, iteration_time=16.40]
Epoch 0:  22%|██▏       | 4500/20101 [22:26:32<77:48:18, 17.95s/it, loss=2.43, v_num=5563, reduced_train_loss=2.470, gradient_norm=0.121, parameter_norm=1864.0, global_step=4512.0, consumed_samples=1.16e+6, iteration_time=16.40]
Epoch 0:  22%|██▏       | 4500/20101 [22:26:32<77:48:18, 17.95s/it, loss=2.44, v_num=5563, reduced_train_loss=2.450, gradient_norm=0.120, parameter_norm=1864.0, global_step=4512.0, consumed_samples=1.16e+6, iteration_time=16.50]

Monitor training

TensorBoard

In addition to the text-based job monitoring described in the previous section, you can also use standard tools such as TensorBoard to monitor training job progress. To view an ongoing training job in TensorBoard, you first need to identify the experiment directory associated with your ongoing job. This will typically be the most recently created directory under ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/nemo_experiments/megatron_llama. Once you have identifed the directory, cd into it, and then launch TensorBoard:

cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/nemo_experiments/megatron_llama
ls -alt|head
# Identify the correct experiment directory in the
# output of the ls command, ex: 2023-06-10_00-22-42
cd YOUR_EXPERIMENT_DIR  # <- replace this with your experiment directory
tensorboard --logdir ./

With TensorBoard running, you can then view the TensorBoard dashboard by browsing to http://localhost:6006 on your local machine. If you cannot access TensorBoard at this address, please make sure that you have port-forwarded TCP port 6006 when SSH'ing into the head node, ex: ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006

neuron-top / neuron-monitor / neuron-ls

The neuron-top tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization, and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job, first SSH into one of your compute nodes from the head node, and then run neuron-top:

ssh compute1-dy-queue1-i1-1  # to determine which compute nodes are in use, run the squeue command
neuron-top

Similarly, once you are logged into one of the active compute nodes, you can also use other Neuron tools such as neuron-monitor and neuron-ls to capture performance/utilization statistics and to understand NeuronCore allocation.

Key Features

We were able to make llama work with zero optimizer but have enabled it by default. To reduce the memory pressure, you can give it by adding the below hyper parameter in your run script :

cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/

# For llama 7b and 13b 
vi test_llama.sh

# For llama 70b
vi test_llama_gqa.sh

# Add the below line in the run script :
model.wrap_with_zero=True \

Known issues/limitations

The initial release of neuronx-nemo-megatron supports Llama2 pretraining only. Model evaluation will be available in a future release.
neuronx-nemo-megatron currently requires pytorch-lightning v1.8.6
Llama2-70B : Tested and validated on 8 nodes. Scaling beyond might see memory issues.

Troubleshooting guide

See Troubleshooting Guide for AWS ParallelCluster for more details and fixes to common issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neuronx-nemo-megatron-llamav2-job.md

neuronx-nemo-megatron-llamav2-job.md

Launch a Llama2 pretraining job using neuronx-nemo-megatron

Prerequisites

Install neuronx-nemo-megatron

Download LlamaV2 dataset and tokenizer

Llama2 training configurations

Llama2 7B

Llama2 13B

Llama2 70B

Pre-compile the model

Launch a pretraining job

Monitor training

TensorBoard

neuron-top / neuron-monitor / neuron-ls

Key Features

Known issues/limitations

Troubleshooting guide

Files

neuronx-nemo-megatron-llamav2-job.md

Latest commit

History

neuronx-nemo-megatron-llamav2-job.md

File metadata and controls

Launch a Llama2 pretraining job using neuronx-nemo-megatron

Prerequisites

Install neuronx-nemo-megatron

Download LlamaV2 dataset and tokenizer

Llama2 training configurations

Llama2 7B

Llama2 13B

Llama2 70B

Pre-compile the model

Launch a pretraining job

Monitor training

TensorBoard

neuron-top / neuron-monitor / neuron-ls

Key Features

Known issues/limitations

Troubleshooting guide