argonne-lcf · khossain4337 · Dec 11, 2024 · Dec 11, 2024 · Dec 11, 2024
diff --git a/docs/aurora/data-science/profiling_dl.md b/docs/aurora/data-science/profiling_dl.md
@@ -0,0 +1,101 @@
+# Profiling Deep Learning Applications
+
+On Aurora we can use the `unitrace` profiler from Intel to profile Deep 
+Learning applications. Refer to the 
+[unitrace documentation page](https://github.com/intel/pti-gpu/tree/master/tools/unitrace)
+for details.
+
+## Example Usage
+
+We can use `unitrace` to trace an application running on multiple ranks and 
+multiple nodes. A simple example, where we use a wrapper script to trace the
+rank 0 on each node of a 4 node job running a PyTorch application is below:
+
+### A `unitrace` wrapper
+```
+#!/bin/bash
+## This wrapper should be used with unitrace to trace in any number of nodes.
+## The script for this example is set up to trace rank 0 of first 4 Nodes in the case of
+## profiling a job running on larger than 4 nodes.
+FNAME_EXT=$(basename "$2")
+FNAME="${FNAME_EXT%%.*}"
+
+NNODES=`wc -l < $PBS_NODEFILE`
+
+WORK_DIR=/path/to/the/Python/program
+UNITRACE_DIR=/opt/aurora/24.180.1/support/tools/pti-gpu/063214e
+UNITRACE_LIB=${UNITRACE_DIR}/lib64
+UNITRACE_BIN=${UNITRACE_DIR}/bin
+UNITRACE_EXE=${UNITRACE_BIN}/unitrace
+DTAG=$(date +%F_%H%M%S)
+UNITRACE_OUTDIR=${WORK_DIR}/logs/unitrace_profiles/name_of_choice_json_n${NNODES}_${DTAG}/${FNAME}_n${NNODES}_${DTAG}
+mkdir -p ${UNITRACE_OUTDIR}
+UNITRACE_OPTS=" --ccl-summary-report --chrome-mpi-logging --chrome-sycl-logging \
+--chrome-device-logging \
+--chrome-ccl-logging --chrome-call-logging --chrome-dnn-logging --device-timing --host-timing \
+--output-dir-path ${UNITRACE_OUTDIR} --output ${UNITRACE_OUTDIR}/UNITRACE_${FNAME}_n${NNODES}_${DTAG}.txt "
+
+
+export LD_LIBRARY_PATH=${UNITRACE_LIB}:${UNITRACE_BIN}:$LD_LIBRARY_PATH
+
+# Use $PMIX_RANK for MPICH and $SLURM_PROCID with srun.
+PROFRANK=0
+RANKCUTOFF=48
+
+if [[ $PALS_LOCAL_RANKID -eq $PROFRANK ]] && [[ $PMIX_RANK -lt $RANKCUTOFF ]]; then
+  echo "On rank $PMIX_RANK, collecting traces "
+  $UNITRACE_EXE $UNITRACE_OPTS "$@"
+else
+  "$@"
+fi
+
+```
+There are a few important things to notice in the wrapper.
+
+- `UNITRACE_DIR`: This is the main `unitrace` directory, which may change after
+an update to the programming environment.
+
+- `UNITRACE_OPTS`: These are the options that `unitrace` uses to trace data at
+different levels. Based on the number of options, the sizes of the output 
+profiles will vary. Usually enabling more options lead to a larger profile 
+(in terms of storage in MB).
+
+- `PROFRANK`: As implemented, this variable is set by the user to trace the rank
+of choice. For example, this wrapper will trace the rank 0 on each node.
+
+- `RANKCUTOFF`: This variable is Aurora specific. As we can run as many as 12
+ranks per node (without using CCS), the first 4 nodes of a job will have 48 
+ranks running. This provides the upper cutoff of the label (in number) of ranks,
+beyond which `unitrace` will not trace any rank. An user can change the number
+according to the number of maximum ranks running per node to set up how many 
+ranks to be
+traced. `unitrace` will produce a profile (`json` file, by default) per traced 
+rank.
+
+### Deployment
+
+The wrapper above can be deployed using a PBS job script the following way
+
+```
+#!/bin/bash -x
+#PBS -l select=4
+#PBS -l place=scatter
+#PBS -l walltime=00:10:00
+#PBS -q workq
+#PBS -A Aurora_deployment
+
+WORK_DIR=/path/to/the/Python/program
+UNITRACE_WRAPPER=${WORK_DIR}/unitrace_wrapper.sh
+
+# MPI and OpenMP settings
+NNODES=`wc -l < $PBS_NODEFILE`
+NRANKS_PER_NODE=12
+
+let NRANKS=${NNODES}*${NRANKS_PER_NODE}
+
+module load frameworks/2024.2.1_u1
+
+mpiexec --pmi=pmix -n ${NRANKS} -ppn ${NRANKS_PER_NODE} -l --line-buffer \
+${UNITRACE_WRAPPER} python ${WORK_DIR}/application.py 
+```
+
diff --git a/docs/polaris/data-science/profiling_dl.md b/docs/polaris/data-science/profiling_dl.md
@@ -0,0 +1,250 @@
+# Profiling Deep Learning Applications
+
+We can use both framework (for example, PyTorch) native profiler and vendor specific 
+[Nsys profiler](https://developer.nvidia.com/nsight-systems/get-started) to get
+high level profiling information and timeline of execution for an application.
+For kernel level information, we may use 
+[Nsight compute profiler](https://developer.nvidia.com/tools-overview/nsight-compute/get-started).
+Refer to the respective documentation for more details:
+
+[Nsight System User Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html)
+
+[Nsight Compute Documentation](https://docs.nvidia.com/nsight-compute/)
+
+[Nsight Compute CLI](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html)
+
+[PyTorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html)
+
+## Example Usage
+
+At the high level, the usage of the `nsys` or the `ncu` profiler can be
+summarized through the following command:
+
+```
+nsys profile -o profile python application.py
+```
+If we want to launch with `MPI` then
+
+```
+mpiexec ... nsys profile ... python application.py ... 
+```
+These two commands show the basic command-line structure of deploying the 
+profilers. Below we discuss important use cases that are relevant in 
+large scale distributed profiling.
+
+We can use `nsys` to trace an application running on multiple ranks and 
+multiple nodes. A simple example, where we use a wrapper script to trace the 
+rank 0 on each node of a 2 node job running a PyTorch application is below:
+
+### An `nsys` wrapper
+
+We can use `nsys` to trace an application running on multiple ranks and
+multiple nodes. A simple example, where we use a wrapper script to trace the
+rank 0 on each node of a 4 node job running a PyTorch application is below:
+
+```
+#!/bin/bash
+## This wrapper should be used with nsys profiler to trace in any number of nodes
+## The script is set up to trace rank 0 of first 2 Nodes in the case of
+## profiling a job running on larger than 2 nodes.
+FNAME_EXT=$(basename "$2")
+FNAME="${FNAME_EXT%%.*}"
+
+NNODES=`wc -l < $PBS_NODEFILE`
+
+WORK_DIR=/path/to/the/Python/application
+DTAG=$(date +%F_%H%M%S)
+PROFILER_OUTDIR=${WORK_DIR}/profiles/choice_of_name_nsys_n${NNODES}_${DTAG}/${FNAME}_n${NNODES}_${DTAG}
+RUN_ID=choice_of_name_nsys_n${NNODES}_${DTAG}
+
+mkdir -p ${PROFILER_OUTDIR}
+NSYS_OPTS=" -o ${PROFILER_OUTDIR}/${RUN_ID}_%q{PMI_RANK} --stats=true --show-output=true "
+
+PROFRANK=0
+RANKCUTOFF=8
+
+if [[ $PALS_LOCAL_RANKID -eq $PROFRANK ]] && [[ $PMI_RANK -lt $RANKCUTOFF ]]; then
+  echo "On rank ${PMI_RANK}, collecting traces "
+  nsys profile $NSYS_OPTS "$@"
+else
+  "$@"
+fi
+```
+There are a few important things to notice in the wrapper.
+
+- `NSYS_OPTS`: These are the options that `nsys` uses to trace data at
+different levels. An exhaustive list of options can be found in the 
+[nsys user guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html).
+Note that, `%q{PMI_RANK}` is essential to get a per rank profile.
+
+
+- `PROFRANK`: As implemented, this variable is set by the user to trace the rank
+of choice. For example, this wrapper will trace the rank 0 on each node.
+
+- `RANKCUTOFF`: This variable is Polaris specific. As we can run as many as 4
+ranks per node (without using MPS), the first 2 nodes of a job will have 8
+ranks running. This provides the upper cutoff of the label (in number) of ranks,
+beyond which `nsys` will not trace any rank. An user can change the number
+according to the number of maximum ranks running per node to set up how many
+ranks to be
+traced. `nsys` will produce a profile (`nsys-rep` file, by default) per traced
+rank.
+
+To view the produced trace files, we need to use NVIDIA's Nsight Systems on the 
+local machine
+
+[Getting Started, Download Nsys](https://developer.nvidia.com/nsight-systems/get-started)
+
+#### Deployment
+
+The wrapper above can be deployed using a PBS job script the following way
+
+```
+#!/bin/bash -l
+#PBS -l select=2:system=polaris
+#PBS -l place=scatter
+#PBS -l walltime=0:05:00
+#PBS -q debug-scaling
+#PBS -l filesystems=home:eagle
+#PBS -A YOUR ALLOCATION
+
+
+# What's the benchmark work directory?
+WORK_DIR=/path/to/the/Python/program
+TEMPORARY_DIR=/path/to/a/temporary/directory/for/`nsys`/to/use
+NSYS_WRAPPER=${WORK_DIR}/nsys_wrapper.sh
+
+# MPI and OpenMP settings
+NNODES=`wc -l < $PBS_NODEFILE`
+NRANKS_PER_NODE=4
+
+let NRANKS=${NNODES}*${NRANKS_PER_NODE}
+
+module use /soft/modulefiles/
+module load conda/2024-04-29
+conda activate
+
+mpiexec -n ${NRANKS} -ppn ${NRANKS_PER_NODE} --env TMPDIR=${TEMPORARY_DIR} -l --line-buffer \
+${NSYS_WRAPPER} python ${WORK_DIR}/application.py
+```
+
+Note that, `--env TMPDIR=${TEMPORARY_DIR}` is critical for the `nsys` 
+functioning.
+
+### An `ncu` wrapper
+
+We can get kernel level information (for example roofline, tensorcore usage)
+using NVIDIA's Nsight Compute profiler. Below is a simple wrapper script to 
+show the usage.
+
+```
+#!/bin/bash
+FNAME_EXT=$(basename "$2")
+FNAME="${FNAME_EXT%%.*}"
+
+NNODES=`wc -l < $PBS_NODEFILE`
+
+WORK_DIR=/path/to/the/Python/program
+DTAG=$(date +%F_%H%M%S)
+PROFILER_OUTDIR=${WORK_DIR}/profiles/choice_of_name_ncu_n${NNODES}_${DTAG}/${FNAME}_n${NNODES}_${DTAG}
+RUN_ID=choice_of_name_ncu_n${NNODES}_${DTAG}
+
+mkdir -p ${PROFILER_OUTDIR}
+#KERNEL_NAME=ampere_sgemm_128x128_tn
+KERNEL_NAME=ampere_bf16_s16816gemm_bf16_128x256_ldg8_f2f_stages_64x3_tn
+#NCU_OPTS_DETAILED=" --set detailed -k ${KERNEL_NAME} -o ${PROFILER_OUTDIR}/${RUN_ID}_%q{PMI_RANK} "
+NCU_OPTS_ROOFLINE=" --set roofline -k ${KERNEL_NAME} -o ${PROFILER_OUTDIR}/${RUN_ID}_%q{PMI_RANK} "
+#NCU_OPTS_FULL=" --set full -k ${KERNEL_NAME} -o ${PROFILER_OUTDIR}/${RUN_ID}_%q{PMI_RANK} "
+
+PROFRANK=0
+RANKCUTOFF=8
+
+if [[ $PALS_LOCAL_RANKID -eq $PROFRANK ]] && [[ $PMI_RANK -lt $RANKCUTOFF ]]; then
+  echo "On rank ${PMI_RANK}, collecting traces "
+  ncu $NCU_OPTS_DETAILED "$@"
+else
+  "$@"
+fi
+```
+
+This wrapper can be deployed as the `nsys` example above. In the `ncu` wrapper
+we explicitly set the name of the kernel that we want to analyze 
+(a gemm kernel in this case).
+The exhaustive list of option to set the amount
+of data collection can be found in the 
+[command line section](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#command-line-options)
+of the documentation. Here we only show standard options, either of the three 
+could be chosen. Note that, invoking each option will lead to varying amounts 
+of time the profiler need to run. This will be important in setting the 
+requested wall-time for your batch job.
+
+`ncu` will generate `ncu-rep` files for each traced ranks, and we will need 
+NVIDIA's Nsight Compute system on the local machine.
+
+[Download Nsight Compute](https://developer.nvidia.com/tools-overview/nsight-compute/get-started)
+
+The next step is to load the `nsys-rep` files in the Nsight Systems GUI, and
+the `ncu-rep` files to the Nsight Compute GUI.
+
+### For a single rank run
+
+#### `nsys` profiles
+In the single rank case, we go to the top left, go `file` --> `open` and select
+the file that we want to look at. For this particular example, we have focused
+on the GPU activities. This activity is shown on the second column from the
+left, named as `CUDA HW ...`. If we expand the `CUDA HW ...` tab, we find an
+`NCCL` tab. This tab shows the communicaltion library calls. 
+
+#### `ncu` profiles
+The primary qualitative distinction between the `nsys-rep` files and the
+`ncu-rep` files is that, the `nsys-rep` file presents data for the overall
+execution of the application, whereas  the `ncu-rep` file presents data for the
+execution of one particular kernel. Our setup here traces only one kernel, but
+multiple kernels could be traced at a time, but that can become a time consuming
+process.
+
+We use the `--stats=true --show-output=true`(see `nsys_wrapper.sh`)
+options while collecting the
+`nsys` data. As a result, we get a system-wide summary in our `.OU` files
+(if run with a job submission script, otherwise on the terminal), and find the
+names of the kernels that has been called/used for compute and communication.
+Often we would start with investigating the kernels that have been called the
+most times or the ones where we spent the most time executing them. In this
+particular instance we chose to analyze the `gemm` kernels, which are related
+to the matrix multiplication. The full name of this kernel is passed to the
+`ncu` profiler with the option `-k` (see `ncu_wrapper.sh`).
+
+Loading the `ncu-rep` files works similarly as the `nsys-rep` files. Here, the
+important tab is the `Details` tab. We find that at the 3rd row from the top.
+Under that tab we have the `GPU Speed of Light Throughput` section. In this
+section we can find plots showing GPU compute and memory usage. On the right
+hand side of the tab, there is a menu bar which gives us the option to select
+which plot to display, either the roofline plot or the compute-memory
+throughput chart.
+
+### For a multi-rank run
+
+#### `nsys` profiles
+In the case, where we have traced multiple ranks, whether from a single node or
+multiple nodes `nsys` GUI allow us to view the reports in a combined fashion on
+a single timeline (same time-axis for both reports). This is done through the
+"multi-report view", `file` --> `New multi-report view` or `file` --> `Open`
+and selecting however many reports we would like to see in a combined timeline,
+`nsys` prompts the user to allow for a "multi-report view". These can also be
+viewed separately.
+
+### Profiler Options
+In both cases, `nsys` and `ncu` we have used the standard option sets to
+generate the profiles. The exhaustive list could be found in the respective
+documentation pages:
+
+[Nsight System User Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html)
+
+[Nsight Compute Documentation](https://docs.nvidia.com/nsight-compute/)
+
+[Nsight Compute CLI](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html)
+
+There are many other information provided through these reports. Here we have
+discussed the way to view the high level information.
+
+
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -74,6 +74,7 @@ nav:
     - Data Science:
         - Julia: polaris/data-science/julia.md
         - Python: polaris/data-science/python.md
+        - Profiling: polaris/data-science/profiling_dl.md   
         - Frameworks:
             - TensorFlow: polaris/data-science/frameworks/tensorflow.md
             - PyTorch: polaris/data-science/frameworks/pytorch.md
@@ -203,6 +204,7 @@ nav:
         #- Applications:
           #- gpt-neox: aurora/data-science/applications/gpt-neox.md
         - Containers: aurora/data-science/containers/containers.md
+        - Profiling: aurora/data-science/profiling_dl.md  
         - Frameworks:
           #- DeepSpeed: aurora/data-science/frameworks/deepspeed.md
           #- JAX: aurora/data-science/frameworks/jax.md