| 
 | 1 | +<!-- mdformat global-off -->  | 
 | 2 | +# Pretrain llama3-1-70b-seq8192-gbs2048-mbs1-gpus16 workloads on a4 GKE Node pools with Nvidia NeMo Framework  | 
 | 3 | + | 
 | 4 | +This recipe outlines the steps for running a llama3-1-70b-seq8192-gbs2048-mbs1-gpus16 pretraining  | 
 | 5 | +workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the  | 
 | 6 | +[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo).  | 
 | 7 | + | 
 | 8 | +## Orchestration and deployment tools  | 
 | 9 | + | 
 | 10 | +For this recipe, the following setup is used:  | 
 | 11 | + | 
 | 12 | +- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)  | 
 | 13 | +- Pretraining job configuration and deployment - A Helm chart is used to  | 
 | 14 | +  configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the  | 
 | 15 | +  [NeMo pretraining workload](https://github.com/NVIDIA/nemo).  | 
 | 16 | + | 
 | 17 | +## Test environment  | 
 | 18 | + | 
 | 19 | +This recipe has been optimized for and tested with the following configuration:  | 
 | 20 | + | 
 | 21 | +- GKE cluster  | 
 | 22 | +Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4)  | 
 | 23 | +to create your a4 GKE cluster.  | 
 | 24 | + | 
 | 25 | +## Training dataset  | 
 | 26 | + | 
 | 27 | +This recipe uses a mock pretraining dataset provided by the NeMo framework.  | 
 | 28 | + | 
 | 29 | +## Docker container image  | 
 | 30 | + | 
 | 31 | +This recipe uses the following docker images:  | 
 | 32 | + | 
 | 33 | +- `nvcr.io/nvidia/nemo:25.07`  | 
 | 34 | +- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.0`  | 
 | 35 | + | 
 | 36 | +## Run the recipe  | 
 | 37 | + | 
 | 38 | +From your client workstation, complete the following steps:  | 
 | 39 | + | 
 | 40 | +### Configure environment settings  | 
 | 41 | + | 
 | 42 | +Set the environment variables to match your environment:  | 
 | 43 | + | 
 | 44 | + ```bash  | 
 | 45 | + export PROJECT_ID=<PROJECT_ID>  | 
 | 46 | + export CLUSTER_REGION=<CLUSTER_REGION>  | 
 | 47 | + export CLUSTER_NAME=<CLUSTER_NAME>  | 
 | 48 | + export GCS_BUCKET=<GCS_BUCKET> # Note: path should not be prefixed with gs://  | 
 | 49 | + export KUEUE_NAME=<KUEUE_NAME>  | 
 | 50 | + ```  | 
 | 51 | + | 
 | 52 | +Replace the following values:  | 
 | 53 | + | 
 | 54 | + - `<PROJECT_ID>`: your Google Cloud project ID.  | 
 | 55 | + - `<CLUSTER_REGION>`: the region where your cluster is located.  | 
 | 56 | + - `<CLUSTER_NAME>`: the name of your GKE cluster.  | 
 | 57 | + - `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix.  | 
 | 58 | + - `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Make sure to verify the name of the local queue in your cluster.  | 
 | 59 | + | 
 | 60 | +Set the default project:  | 
 | 61 | + | 
 | 62 | + ```bash  | 
 | 63 | + gcloud config set project $PROJECT_ID  | 
 | 64 | + ```  | 
 | 65 | + | 
 | 66 | +### Get the recipe  | 
 | 67 | + | 
 | 68 | +Clone the `gpu-recipes` repository and set a reference to the recipe folder.  | 
 | 69 | + | 
 | 70 | +```  | 
 | 71 | +git clone https://github.com/ai-hypercomputer/gpu-recipes.git  | 
 | 72 | +cd gpu-recipes  | 
 | 73 | +export REPO_ROOT=`git rev-parse --show-toplevel`  | 
 | 74 | +export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-1-70b-seq8192-gbs2048-mbs1-gpus16/nemo-pretraining-gke/2_nodes  | 
 | 75 | +cd $RECIPE_ROOT  | 
 | 76 | +```  | 
 | 77 | + | 
 | 78 | +### Get cluster credentials  | 
 | 79 | + | 
 | 80 | +```  | 
 | 81 | +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION  | 
 | 82 | +```  | 
 | 83 | + | 
 | 84 | +### Configure and submit a pretraining job  | 
 | 85 | + | 
 | 86 | +#### Using 2 node (16 gpus) bf16 precision  | 
 | 87 | +To execute the job with the default settings, run the following command from  | 
 | 88 | +your client:  | 
 | 89 | + | 
 | 90 | +    ```bash  | 
 | 91 | +    cd $RECIPE_ROOT  | 
 | 92 | +    export WORKLOAD_NAME=$USER-a4-llama3-1-70b  | 
 | 93 | +    helm install $WORKLOAD_NAME . -f values.yaml \  | 
 | 94 | +    --set-file workload_launcher=launcher.sh \  | 
 | 95 | +    --set-file workload_config=llama3-1-70b-seq8192-gbs2048-mbs1-gpus16.py \  | 
 | 96 | +    --set workload.image=nvcr.io/nvidia/nemo:25.07 \  | 
 | 97 | +    --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \  | 
 | 98 | +    --set volumes.gcsMounts[0].mountPath=/job-logs \  | 
 | 99 | +    --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \  | 
 | 100 | +    --set queue=${KUEUE_NAME}  | 
 | 101 | +    ```  | 
 | 102 | + | 
 | 103 | +**Examples**  | 
 | 104 | + | 
 | 105 | +-   To set the number of training steps to 100, run the following command from  | 
 | 106 | +    your client:  | 
 | 107 | + | 
 | 108 | +    ```bash  | 
 | 109 | +    cd $RECIPE_ROOT  | 
 | 110 | +    export WORKLOAD_NAME=$USER-a4-llama3-1-70b  | 
 | 111 | +    helm install $WORKLOAD_NAME . -f values.yaml \  | 
 | 112 | +    --set-file workload_launcher=launcher.sh \  | 
 | 113 | +    --set-file workload_config=llama3-1-70b-seq8192-gbs2048-mbs1-gpus16.py \  | 
 | 114 | +    --set workload.image=nvcr.io/nvidia/nemo:25.07 \  | 
 | 115 | +    --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \  | 
 | 116 | +    --set volumes.gcsMounts[0].mountPath=/job-logs \  | 
 | 117 | +    --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \  | 
 | 118 | +    --set queue=${KUEUE_NAME} \  | 
 | 119 | +    --set workload.arguments[0]="trainer.max_steps=100"  | 
 | 120 | +    ```  | 
 | 121 | + | 
 | 122 | +### Monitor the job  | 
 | 123 | + | 
 | 124 | +To check the status of pods in your job, run the following command:  | 
 | 125 | + | 
 | 126 | +```  | 
 | 127 | +kubectl get pods | grep $USER-a4-llama3-1-70b  | 
 | 128 | +```  | 
 | 129 | +
  | 
 | 130 | +Replace the following:  | 
 | 131 | +
  | 
 | 132 | +- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-llama3-1-70b.  | 
 | 133 | +
  | 
 | 134 | +To get the logs for one of the pods, run the following command:  | 
 | 135 | +
  | 
 | 136 | +```  | 
 | 137 | +kubectl logs POD_NAME  | 
 | 138 | +```  | 
 | 139 | +
  | 
 | 140 | +Information about the training job's progress, including crucial details such as  | 
 | 141 | +loss, step count, and step time, is generated by the rank 0 process.  | 
 | 142 | +This process runs on the pod whose name begins with  | 
 | 143 | +`JOB_NAME_PREFIX-workload-0-0`.  | 
 | 144 | +For example: `$USER-a4-llama3-1-70b-workload-0-0-s9zrv`.  | 
 | 145 | +
  | 
 | 146 | +### Uninstall the Helm release  | 
 | 147 | +
  | 
 | 148 | +You can delete the job and other resources created by the Helm chart. To  | 
 | 149 | +uninstall Helm, run the following command from your client:  | 
 | 150 | +
  | 
 | 151 | +```bash  | 
 | 152 | +helm uninstall $USER-a4-llama3-1-70b  | 
 | 153 | +```  | 
0 commit comments