Skip to content

Commit f94680d

Browse files
authored
Merge pull request #34 from incredere/main
Add llama3-1-70b-2node-bf16-seq8192-gbs2048 recipes on A4
2 parents 4380734 + 63fb536 commit f94680d

File tree

10 files changed

+859
-0
lines changed

10 files changed

+859
-0
lines changed
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
apiVersion: v2
16+
name: a4_jobset_workload
17+
description: a4_jobset_workload
18+
type: application
19+
version: 0.1.0
20+
appVersion: "1.16.0"
Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
<!-- mdformat global-off -->
2+
# Pretrain llama3-1-70b-seq8192-gbs2048-mbs1-gpus16 workloads on a4 GKE Node pools with Nvidia NeMo Framework
3+
4+
This recipe outlines the steps for running a llama3-1-70b-seq8192-gbs2048-mbs1-gpus16 pretraining
5+
workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
6+
[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo).
7+
8+
## Orchestration and deployment tools
9+
10+
For this recipe, the following setup is used:
11+
12+
- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
13+
- Pretraining job configuration and deployment - A Helm chart is used to
14+
configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the
15+
[NeMo pretraining workload](https://github.com/NVIDIA/nemo).
16+
17+
## Test environment
18+
19+
This recipe has been optimized for and tested with the following configuration:
20+
21+
- GKE cluster
22+
Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4)
23+
to create your a4 GKE cluster.
24+
25+
## Training dataset
26+
27+
This recipe uses a mock pretraining dataset provided by the NeMo framework.
28+
29+
## Docker container image
30+
31+
This recipe uses the following docker images:
32+
33+
- `nvcr.io/nvidia/nemo:25.07`
34+
- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.0`
35+
36+
## Run the recipe
37+
38+
From your client workstation, complete the following steps:
39+
40+
### Configure environment settings
41+
42+
Set the environment variables to match your environment:
43+
44+
```bash
45+
export PROJECT_ID=<PROJECT_ID>
46+
export CLUSTER_REGION=<CLUSTER_REGION>
47+
export CLUSTER_NAME=<CLUSTER_NAME>
48+
export GCS_BUCKET=<GCS_BUCKET> # Note: path should not be prefixed with gs://
49+
export KUEUE_NAME=<KUEUE_NAME>
50+
```
51+
52+
Replace the following values:
53+
54+
- `<PROJECT_ID>`: your Google Cloud project ID.
55+
- `<CLUSTER_REGION>`: the region where your cluster is located.
56+
- `<CLUSTER_NAME>`: the name of your GKE cluster.
57+
- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix.
58+
- `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Make sure to verify the name of the local queue in your cluster.
59+
60+
Set the default project:
61+
62+
```bash
63+
gcloud config set project $PROJECT_ID
64+
```
65+
66+
### Get the recipe
67+
68+
Clone the `gpu-recipes` repository and set a reference to the recipe folder.
69+
70+
```
71+
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
72+
cd gpu-recipes
73+
export REPO_ROOT=`git rev-parse --show-toplevel`
74+
export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-1-70b-seq8192-gbs2048-mbs1-gpus16/nemo-pretraining-gke/2_nodes
75+
cd $RECIPE_ROOT
76+
```
77+
78+
### Get cluster credentials
79+
80+
```
81+
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
82+
```
83+
84+
### Configure and submit a pretraining job
85+
86+
#### Using 2 node (16 gpus) bf16 precision
87+
To execute the job with the default settings, run the following command from
88+
your client:
89+
90+
```bash
91+
cd $RECIPE_ROOT
92+
export WORKLOAD_NAME=$USER-a4-llama3-1-70b
93+
helm install $WORKLOAD_NAME . -f values.yaml \
94+
--set-file workload_launcher=launcher.sh \
95+
--set-file workload_config=llama3-1-70b-seq8192-gbs2048-mbs1-gpus16.py \
96+
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
97+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
98+
--set volumes.gcsMounts[0].mountPath=/job-logs \
99+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
100+
--set queue=${KUEUE_NAME}
101+
```
102+
103+
**Examples**
104+
105+
- To set the number of training steps to 100, run the following command from
106+
your client:
107+
108+
```bash
109+
cd $RECIPE_ROOT
110+
export WORKLOAD_NAME=$USER-a4-llama3-1-70b
111+
helm install $WORKLOAD_NAME . -f values.yaml \
112+
--set-file workload_launcher=launcher.sh \
113+
--set-file workload_config=llama3-1-70b-seq8192-gbs2048-mbs1-gpus16.py \
114+
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
115+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
116+
--set volumes.gcsMounts[0].mountPath=/job-logs \
117+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
118+
--set queue=${KUEUE_NAME} \
119+
--set workload.arguments[0]="trainer.max_steps=100"
120+
```
121+
122+
### Monitor the job
123+
124+
To check the status of pods in your job, run the following command:
125+
126+
```
127+
kubectl get pods | grep $USER-a4-llama3-1-70b
128+
```
129+
130+
Replace the following:
131+
132+
- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-llama3-1-70b.
133+
134+
To get the logs for one of the pods, run the following command:
135+
136+
```
137+
kubectl logs POD_NAME
138+
```
139+
140+
Information about the training job's progress, including crucial details such as
141+
loss, step count, and step time, is generated by the rank 0 process.
142+
This process runs on the pod whose name begins with
143+
`JOB_NAME_PREFIX-workload-0-0`.
144+
For example: `$USER-a4-llama3-1-70b-workload-0-0-s9zrv`.
145+
146+
### Uninstall the Helm release
147+
148+
You can delete the job and other resources created by the Helm chart. To
149+
uninstall Helm, run the following command from your client:
150+
151+
```bash
152+
helm uninstall $USER-a4-llama3-1-70b
153+
```
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
usage()
2+
{
3+
cat << EOF
4+
usage: bash ./launcher.sh [config-override [config-override ...]]
5+
config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000.
6+
EOF
7+
}
8+
9+
parse_args() {
10+
while [ "$1" != "" ]; do
11+
case $(grep -o "=" <<< "$1" | wc -l) in
12+
1 )
13+
config_overrides+=("$1")
14+
;;
15+
* )
16+
echo "Invalid config override: $1"
17+
usage
18+
exit 1
19+
esac
20+
shift
21+
done
22+
config_overrides="${config_overrides[*]}"
23+
}
24+
25+
config_overrides=()
26+
parse_args "$@"
27+
28+
if [ -z "${config_overrides}" ]; then
29+
echo "No NeMo config overrides specified"
30+
else
31+
echo "NeMo config overrides:"
32+
echo " ${config_overrides}"
33+
fi
34+
35+
export LD_LIBRARY_PATH="$NCCL_PLUGIN_PATH"
36+
ldconfig $LD_LIBRARY_PATH
37+
echo "Added $LD_LIBRARY_PATH to ldconfig:"
38+
ldconfig -p | grep libcuda | sed 's/^/ /'
39+
echo ""
40+
41+
if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then
42+
explicit_log_dir=${EXPLICIT_LOG_DIR}
43+
else
44+
explicit_log_dir=workload_logs
45+
fi
46+
echo "Logging to ${explicit_log_dir}"
47+
48+
if [[ -n "${TOKENIZER_PATH}" ]]; then
49+
echo "Getting tokenizer files"
50+
cp ${TOKENIZER_PATH}/* .
51+
echo ""
52+
fi
53+
54+
echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes"
55+
56+
57+
# Update nemo run so we can export the config.
58+
pip install git+https://github.com/NVIDIA/NeMo-Run.git@6550ff68204e5095452098eed3765ed765de5d33
59+
pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger
60+
61+
62+
# Export the nemo2 config to yaml.
63+
python ${NEMO_LAUNCH_SCRIPT} --factory "recipe()" \
64+
trainer.num_nodes="$NNODES" \
65+
log.explicit_log_dir="${explicit_log_dir}" \
66+
trainer.max_steps=25 \
67+
trainer.num_nodes=2 \
68+
trainer.devices=8 \
69+
${config_overrides} \
70+
--to-yaml exported_nemo_config.yaml
71+
72+
# Create the nsys directory.
73+
mkdir -p ${explicit_log_dir}/nsys
74+
75+
OMP_NUM_THREADS=12 NSYS_CONFIG_DIRECTIVES="AgentLaunchTimeoutSec=240;AppLaunchTimeoutSec=240" TORCH_NCCL_ENABLE_MONITORING=0 \
76+
/usr/local/bin/nsys profile -s none -t nvtx,cuda --capture-range=cudaProfilerApi --capture-range-end=stop \
77+
-o ${explicit_log_dir}/nsys/noderank-${JOB_COMPLETION_INDEX} \
78+
--session-new "nemo-rank${JOB_COMPLETION_INDEX}"-$RANDOM \
79+
--wait all \
80+
torchrun \
81+
--nproc-per-node="${GPUS_PER_NODE}" \
82+
--nnodes="${NNODES}" \
83+
--node_rank="${JOB_COMPLETION_INDEX}" \
84+
--rdzv_id="${JOB_IDENTIFIER}" \
85+
--master_addr="${MASTER_ADDR}" \
86+
--master_port="${MASTER_PORT}" \
87+
${NEMO_LAUNCH_SCRIPT} --factory "recipe()" \
88+
trainer.num_nodes="$NNODES" \
89+
log.explicit_log_dir="${explicit_log_dir}" \
90+
trainer.max_steps=25 \
91+
trainer.num_nodes=2 \
92+
trainer.devices=8 \
93+
${config_overrides}
94+
95+
if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
96+
mkdir -p ${ARTIFACT_DIR}
97+
cp -r ${explicit_log_dir}/* ${ARTIFACT_DIR}/
98+
cp ${NEMO_LAUNCH_SCRIPT} ${ARTIFACT_DIR}/run-cli.py
99+
cp dllogger.json ${ARTIFACT_DIR}/dllogger.json
100+
cp exported_nemo_config.yaml ${ARTIFACT_DIR}/nemo-configuration.yaml
101+
env > ${ARTIFACT_DIR}/environ.txt
102+
ls ${ARTIFACT_DIR}
103+
fi
104+
echo "Training completed"
105+
echo "Pod on $(hostname --fqdn) is exiting"

0 commit comments

Comments
 (0)