From 2f68c08cc0658fb5bb719634605c6589dfb9afad Mon Sep 17 00:00:00 2001 From: Andrew Shao Date: Mon, 23 Sep 2024 12:53:02 -0700 Subject: [PATCH] Refine install documentation for Perlmutter and Frontier (#717) After discussing with admins at OLCF, miniforge is the preferred solution for creating virtual environments on Frontier. The instructions for installing SmartSim have been updated accordingly. Additionally, perlmutter did not have a step for compiling the SmartRedis libraries. This has been rectified to bring the two systems to parity. [ committed by @ashao ] [ reviewed by @MattToast @AlyssaCote ] --- doc/changelog.md | 58 ++++++++----------- .../platform/frontier.rst | 41 +++++++------ .../platform/perlmutter.rst | 21 +++++-- 3 files changed, 64 insertions(+), 56 deletions(-) diff --git a/doc/changelog.md b/doc/changelog.md index 8dcb08d3a..d7d6905ff 100644 --- a/doc/changelog.md +++ b/doc/changelog.md @@ -9,9 +9,9 @@ Jump to: ## SmartSim -### Cuda 12 and ROCm support branch +### Development branch -To be merged into `develop` at some future point in time +To be released at some future point in time Description @@ -21,34 +21,7 @@ Description - Fine grain build support for GPUs - Update Torch to 2.1.0, Tensorflow to 2.15.0 - Better error messages in build process - -Detailed Notes - -- The RedisAIBuilder class was completely overhauled to allow users to - express a wider range of support for hardware/software stacks. This - will be extended to support ROCm, CUDA-11, and CUDA-12. -- Versions for each of these packages are no longer specified in an - internal class. Instead a default set of JSON files specifies the - sources and versions. Users can specify their own custom specifications - at smart build time -- Two new Dockerfiles are now provided (one each for 11.8 and 12.1) that - can be used to build a container to run the tutorials. No HPC support - should be expected at this time -- SmartSim can now be built using Cuda version 11.8 or Cuda 12.1 by specify - `smart build --device=cuda118` or `smart build --device=cuda121`. The - original `smart build --device=gpu` will default to using Cuda 11.8. -- As a result of the previous change, SmartSim now requires C++17 and a - minimum Cuda version of 11.8 in order to build Torch 2.1.0. -- Error messages were not being interpolated correctly. This has been - addressed to provide more context when exposing error messages to users. - -### Development branch - -To be released at some future point in time - -Description - -- Allow specifying Model and Ensemble parameters with +- Allow specifying Model and Ensemble parameters with number-like types (e.g. numpy types) - Pin watchdog to 4.x - Update codecov to 4.5.0 @@ -66,9 +39,28 @@ Description Detailed Notes -- The serializer would fail if a parameter for a Model or Ensemble - was specified as a numpy dtype. The constructors for these - methods now validate that the input is number-like and convert +- The RedisAIBuilder class was completely overhauled to allow users to + express a wider range of support for hardware/software stacks. This + will be extended to support ROCm, CUDA-11, and CUDA-12. + ([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669)) +- Versions for each of these packages are no longer specified in an + internal class. Instead a default set of JSON files specifies the + sources and versions. Users can specify their own custom specifications + at smart build time + ([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669)) +- Two new Dockerfiles are now provided (one each for 11.8 and 12.1) that + can be used to build a container to run the tutorials. No HPC support + should be expected at this time + ([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669)) +- As a result of the previous change, SmartSim now requires C++17 and a + minimum Cuda version of 11.8 in order to build Torch 2.1.0. + ([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669)) +- Error messages were not being interpolated correctly. This has been + addressed to provide more context when exposing error messages to users. + ([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669)) +- The serializer would fail if a parameter for a Model or Ensemble + was specified as a numpy dtype. The constructors for these + methods now validate that the input is number-like and convert them to strings ([SmartSim-PR676](https://github.com/CrayLabs/SmartSim/pull/676)) - Pin watchdog to 4.x because v5 introduces new types and requires diff --git a/doc/installation_instructions/platform/frontier.rst b/doc/installation_instructions/platform/frontier.rst index d4db76a6d..996688fc7 100644 --- a/doc/installation_instructions/platform/frontier.rst +++ b/doc/installation_instructions/platform/frontier.rst @@ -7,8 +7,9 @@ Known limitations We are continually working on getting all the features of SmartSim working on Frontier, however we do have some known limitations: -* For now, only Torch and ONNX runtime models are supported. If you need - Tensorflow support please contact us +* For now, only Torch models are supported. If you need Tensorflow or ONNX + support please contact us +* All SmartSim experiments must be run from Lustre, _not_ your home directory * The colocated database will fail without specifying ``custom_pinning``. This is because the default pinning assumes that processor 0 is available, but the 'low-noise' default on Frontier reserves the processor on each NUMA node. @@ -30,22 +31,28 @@ these instructions, being sure to set the following variables .. code:: bash export PROJECT_NAME=CHANGE_ME - export VENV_NAME=CHANGE_ME **Step 1:** Create and activate a virtual environment for SmartSim: .. code:: bash - module load PrgEnv-gnu cray-python - module load rocm/6.1.3 + module load PrgEnv-gnu miniforge3 rocm/6.1.3 export SCRATCH=/lustre/orion/$PROJECT_NAME/scratch/$USER/ - export VENV_HOME=$SCRATCH/$VENV_NAME/ + conda create -n smartsim python=3.11 + conda activate smartsim - python3 -m venv $VENV_HOME - source $VENV_HOME/bin/activate +**Step 1 (Optional):** If this is your first time using miniforge on +Frontier you may also have to execute the following before being able +to activate the ``smartsim`` environment -**Step 2:** Install SmartSim in the conda environment: +.. code:: bash + + conda init + source ~/.bashrc + conda activate smartsim + +**Step 2:** Build the SmartRedis C++ and Fortran libraries: .. code:: bash @@ -55,17 +62,20 @@ these instructions, being sure to set the following variables make lib-with-fortran pip install . - # Download SmartSim and site-specific files +**Step 3:** Install SmartSim in the conda environment: + +.. code:: bash + cd $SCRATCH pip install git+https://github.com/CrayLabs/SmartSim.git -**Step 3:** Build Redis, RedisAI, the backends, and all the Python packages: +**Step 4:** Build Redis, RedisAI, the backends, and all the Python packages: .. code:: bash smart build --device=rocm-6 -**Step 4:** Check that SmartSim has been installed and built correctly: +**Step 5:** Check that SmartSim has been installed and built correctly: .. code:: bash @@ -89,12 +99,11 @@ build, and some variables should be set to optimize performance: # Set these to the same values that were used for install export PROJECT_NAME=CHANGE_ME - export VENV_NAME=CHANGE_ME .. code:: bash - module load PrgEnv-gnu - module load rocm/6.1.3 + module load PrgEnv-gnu miniforge3 rocm/6.1.3 + conda activate smartsim # Optimizations for inference export SCRATCH=/lustre/orion/$PROJECT_NAME/scratch/$USER/ @@ -102,8 +111,6 @@ build, and some variables should be set to optimize performance: export MIOPEN_SYSTEM_DB_PATH=$MIOPEN_USER_DB_PATH mkdir -p $MIOPEN_USER_DB_PATH export MIOPEN_DISABLE_CACHE=1 - export VENV_HOME=$SCRATCH/$VENV_NAME/ - source $VENV_HOME/bin/activate Binding DBs to Slingshot ------------------------ diff --git a/doc/installation_instructions/platform/perlmutter.rst b/doc/installation_instructions/platform/perlmutter.rst index 6d1e22e1e..71f97a4dc 100644 --- a/doc/installation_instructions/platform/perlmutter.rst +++ b/doc/installation_instructions/platform/perlmutter.rst @@ -10,24 +10,33 @@ To install SmartSim on Perlmutter, follow these steps: .. code:: bash - module load conda + module load conda cudatoolkit/12.2 cudnn/8.9.3_cuda12 PrgEnv-gnu conda create -n smartsim python=3.11 conda activate smartsim -**Step 2:** Install SmartSim in the conda environment: +**Step 2:** Build the SmartRedis C++ and Fortran libraries: + +.. code:: bash + + git clone https://github.com/CrayLabs/SmartRedis.git + cd SmartRedis + make lib-with-fortran + pip install . + cd .. + +**Step 3:** Install SmartSim in the conda environment: .. code:: bash pip install git+https://github.com/CrayLabs/SmartSim.git -**Step 3:** Build Redis, RedisAI, the backends, and all the Python packages: +**Step 4:** Build Redis, RedisAI, the backends, and all the Python packages: .. code:: bash - module load cudatoolkit/12.2 cudnn/8.9.3_cuda12 smart build --device=cuda-12 -**Step 4:** Check that SmartSim has been installed and built correctly: +**Step 5:** Check that SmartSim has been installed and built correctly: .. code:: bash @@ -51,5 +60,5 @@ can reload the conda environment by running the following commands: .. code:: bash - module load conda cudatoolkit/12.2 cudnn/8.9.3_cuda12 + module load conda cudatoolkit/12.2 cudnn/8.9.3_cuda12 PrgEnv-gnu conda activate smartsim