In this repository we present the supplemental material for our manuscript "Demonstration of Portable Performance of Scientific Machine Learning on High Performance Computing Systems" submitted to the High Performance Python for Science at Scale (HPPSS) Workshop, which is part of the SC23 conference.
In this manuscript, we present training performance metrics (measured as throughput, e.g., Images/sec or samples/sec) for a suite of scientific AI/ML applications across three state of the art hardware: Nvidia A100, AMD Mi250, and Intel Data Center GPU Max 1550.
In this repository, we will provide the installation instructions (which might come from specific application repositories), and version details of specific applications that we have used to produce the performance data.
The following is the list of applications that we have studied:
For most of our applications, we followed the strategy of pip
installing the application dependencies on a conda
environment. As we were looking at the GPU performances, we installed the GPU libraries (for example CUDA
, CUDA-toolkit
) through conda
, and pip
install the frameworks (PyTorch
and TensorFlow
). The distributed training frameworks like Horovod
, was installed through pip
. In the versions.pdf
file at the top level of the repository has detailed information about the relevant packages.
In practice, we have taken the advantage of pre-built modules on ALCF machines like Polaris
, JLSE nodes
and Sunspot
. In each of our machines, we make a virtual environmnet with --system-site-packages
from the base conda
module, load appropriate compiler modules and mpi
backends, and pip
install frameworks with other dependencies in the virtual environment.
- We have provided sample run scripts for our applications. These scripts display most of the relevant argument flags that are available from the application.
- To get the results, we did our runs with
mpiexec
even forrank 1
trainings and used appropriate cpu bindings to improve the single rank throughput. - Specific environmental variables:
- To run an application with
FP32
onA100
, one must override the defaultTF32
data type. The flag to do soexport NVIDIA_TF32_OVERRIDE=0
- To run an application with
TF32
on Intel Max GPU, one must set the following flag:export ITEX_FP32_MATH_MODE=TF32
(for TensorFlow)export IPEX_FP32_MATH_MODE=TF32
(for PyTorch)- These are Intel's TensorFlow and PyTorch extensions.
- To run an application with