You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
aws-neuronx-collectives 2.19.7.0-530fb3064 amd64 neuron_ccom built using CMake
aws-neuronx-dkms 2.15.9.0 amd64 aws-neuronx driver in DKMS format.
aws-neuronx-oci-hook 2.2.45.0 amd64 neuron_oci_hook built using CMake
aws-neuronx-runtime-lib 2.19.5.0-97e2d271b amd64 neuron_runtime built using CMake
aws-neuronx-tools 2.16.1.0 amd64 Neuron profile and debug tools
Setup
Cluster type
OpenMPI
Head Node
The head node type is trn1.2xlarge and was created using this CFN template, with EFS and FSx file-systems enabled
Cluster Nodes
The cluster nodes were of type trn1.32xlarge and were created using this CFN template
Open MPI Launch script
THIS IS THE LUANCH SCRIPT AFTTER RUNNING THE neuron_parallel_compile
#!/usr/bin/env bash
set -o pipefail
set -e
ulimit -n 65535
export FI_EFA_USE_DEVICE_RDMA=1
export FI_PROVIDER=efa
export FI_EFA_FORK_SAFE=1
if [ -v SLURM_NNODES ]
then
# SLURM runs
sudo sysctl -w net.ipv4.ip_local_reserved_ports=41000
IPS=""
for h in $(scontrol show hostname); do
IPS="$IPS $(nslookup $h | awk '/^Address: / { print $2 }')";
done
HOSTS=(${IPS//\ / })
NODEID=$SLURM_NODEID
NTASKS=$SLURM_NTASKS
export MASTER_ADDR=${HOSTS[0]}
export NEMO_EXPM_VERSION=$SLURM_JOB_ID
export EXPLICIT_LOGDIR=null
: ${SLURM_RESTART_COUNT:=0}
LOG_PATH=logs/$SLURM_JOB_ID/$SLURM_RESTART_COUNT/$NODEID/
mkdir -p $LOG_PATH
export NEURON_COMPILE_CACHE_URL="$HOME/neuron_cache" # Place cache on shared storage to reduce redundant compilations
# Make sure to install latest runtime
./setup.sh 2>&1 | tee $LOG_PATH/setup.log
elif [ -v OMPI_COMM_WORLD_RANK ]
then
# MPI
[[ -z $MASTER_ADDR ]] && echo "MASTER_ADDR is not set" && exit 1
TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`
PRIMARY_MAC=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -s http://169.254.169.254/latest/meta-data/mac)
export CCOM_SOCKET_IFNAME=$(ip -o link show | grep -F "link/ether $PRIMARY_MAC" | awk -F'[ :]+' '{print $2}')
NODEID=$OMPI_COMM_WORLD_RANK
NTASKS=$OMPI_COMM_WORLD_SIZE
export EXPLICIT_LOGDIR=$LOGS_DIR
LOG_PATH=$LOGS_DIR/$NODEID/
mkdir -p $LOG_PATH
export NEURON_COMPILE_CACHE_URL=$CACHE_DIR/$NODEID # Place cache on shared storage to reduce redundant compilations
else
# Single-node, non-SLURM, non-MPI runs
HOSTS=(localhost)
NODEID=0
NTASKS=1
export MASTER_ADDR=${HOSTS[0]}
export NEMO_EXPM_VERSION=$(date "+%Y-%m-%d_%H-%M-%S")
export EXPLICIT_LOGDIR=null
LOG_PATH=./nemo_experiments/logs
mkdir -p $LOG_PATH
fi
export HYDRA_FULL_ERROR=1
export PROCESSES_PER_NODE=32
export MASTER_PORT=41000
export NEURON_RT_EXEC_TIMEOUT=10
export DISTRIBUTED_ARGS="--nproc_per_node $PROCESSES_PER_NODE --nnodes $NTASKS --node_rank $NODEID --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
echo $DISTRIBUTED_ARGS
export BUCKET_CAP_MB=1024
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=5
export NEURON_TRANSFER_WITH_STATIC_RING_OPS=""
export MALLOC_ARENA_MAX=128
export TF_NUM_INTEROP_THREADS=1024
export XLA_THREAD_POOL_SIZE=4
export XLA_IO_THREAD_POOL_SIZE=4
export NEURON_RT_STOCHASTIC_ROUNDING_EN=1
#training_precision is one of 'bf16SR', 'megatron_amp_O2', 'fp32_OptStates'
#training_precision = "bf16SR", uses BF16 + Stochastic Rounding
#training_precision = "megatron_amp_O2", master weights and optimizer states are stored in fp32, model weights in bf16
#training_precision = "fp32_OptStates", optimizer states are stored in fp32, model weights in bf16
training_precision="bf16SR"
if [[ $training_precision == "bf16SR" ]];then
echo using BF16 SR
export XLA_USE_BF16=1
export NEURON_CC_FLAGS="--model-type transformer --distribution-strategy=nemo --enable-mixed-precision-accumulation"
export OPTIM_NAME=adamw
export megatron_amp_O2=false
elif [[ $training_precision == "megatron_amp_O2" ]]; then
echo using megatron_amp_O2
export XLA_DOWNCAST_BF16=1
export NEURON_CC_FLAGS="--model-type transformer --distribution-strategy=nemo --enable-mixed-precision-accumulation"
export OPTIM_NAME=adamw
export megatron_amp_O2=true
elif [[ $training_precision == "fp32_OptStates" ]]; then
echo using FP32 Optimizer States
export XLA_DOWNCAST_BF16=1
export NEURON_CC_FLAGS="--model-type transformer --distribution-strategy=nemo --enable-mixed-precision-accumulation"
export OPTIM_NAME=adamw_fp32OptState
export megatron_amp_O2=false
else
echo Incorrect Training Precision Provided
fi
export CREATE_TB_LOGGER=True
export CHECKPOINT_CALLBACK=True
if [ "$COMPILE" = "1" ]; then
echo "compiling only run"
MAYBE_COMPILE="neuron_parallel_compile"
export TRAIN_ITERS=3
CREATE_TB_LOGGER=False
CHECKPOINT_CALLBACK=False
export MASTER_PORT=41001
fi
Steps to reproduce
Connect to head node using dcv client. Verify you have EFS mounted under ~/efs and FSx for Lustre file-system mounted under ~/fsx.
Set up SSH key on the head node so Open MPI can ssh to cluster nodes. This means adding the ssh key in ~/.ssh/id_rsa and setting .ssh/config as follows:
Prepare GPT2 data under ~/efs/examples_datasets/gpt2/
cd ~/efs/git; git clone https://github.com/aws-neuron/neuronx-nemo-megatron.git
cd ~/efs/git/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
Create an Open MPI hostfile for the four cluster nodes with slots=1 for each of the four nodes. Set the path to the hostfile in the environment variable export HOSTFILE= and set the IP address of one of the cluster nodes in the environment variable export MASTER_ADDR=
After installing new, or replacing the existing shell scripts files noted above, run: ./pretrain_openmpi.sh gpt_23b.sh 1>/tmp/a.out 2>&1 &
The text was updated successfully, but these errors were encountered:
This GPT 3 26B pretraining tutorial crashes after 12-18 hours of pre-training, specifically on Ubuntu 22.04 stack. It works on Ubuntu 20.04 stack.
Not all processes in the cluster crash, but 1 or more processes in the 4-node cluster crash with the same error.
Key error stack trace is as follows:
OS information
Pip freeze for Neuron
Packages for neuron
Setup
Cluster type
OpenMPI
Head Node
The head node type is
trn1.2xlarge
and was created using this CFN template, with EFS and FSx file-systems enabledCluster Nodes
The cluster nodes were of type
trn1.32xlarge
and were created using this CFN templateOpen MPI Launch script
THIS IS THE LUANCH SCRIPT AFTTER RUNNING THE neuron_parallel_compile
This is the
test.sh
script:This is the
train_setup.sh
script:Steps to reproduce
~/efs
and FSx for Lustre file-system mounted under~/fsx
..ssh/config
as follows:source ~/aws_neuron_nemo_megatron/bin/activate
sudo mkdir -p ~/efs/git; sudo chown -R ubuntu:ubuntu ~/efs/git
sudo mkdir -p ~/efs/examples_datasets/gpt2/; sudo chown -R ubuntu:ubuntu ~/efs/examples_datasets/gpt2/
~/efs/examples_datasets/gpt2/
cd ~/efs/git; git clone https://github.com/aws-neuron/neuronx-nemo-megatron.git
cd ~/efs/git/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
slots=1
for each of the four nodes. Set the path to the hostfile in the environment variableexport HOSTFILE=
and set the IP address of one of the cluster nodes in the environment variableexport MASTER_ADDR=
./pretrain_openmpi.sh gpt_23b.sh 1>/tmp/a.out 2>&1 &
The text was updated successfully, but these errors were encountered: