Refine install documentation for Perlmutter and Frontier (#717)

After discussing with admins at OLCF, miniforge is the preferred solution for creating virtual environments on Frontier. The instructions for installing SmartSim have been updated accordingly. Additionally, perlmutter did not have a step for compiling the SmartRedis libraries. This has been rectified to bring the two systems to parity. [ committed by @ashao ] [ reviewed by @MattToast @AlyssaCote ]
CrayLabs · Sep 23, 2024 · 2f68c08 · 2f68c08
1 parent 5fb8eb4
commit 2f68c08
Show file tree

Hide file tree

Showing 3 changed files with 64 additions and 56 deletions.
diff --git a/doc/changelog.md b/doc/changelog.md
@@ -9,9 +9,9 @@ Jump to:
 
 ## SmartSim
 
-###  Cuda 12 and ROCm support branch
+### Development branch
 
-To be merged into `develop` at some future point in time
+To be released at some future point in time
 
 Description
 
@@ -21,34 +21,7 @@ Description
 - Fine grain build support for GPUs
 - Update Torch to 2.1.0, Tensorflow to 2.15.0
 - Better error messages in build process
-
-Detailed Notes
-
-- The RedisAIBuilder class was completely overhauled to allow users to
-  express a wider range of support for hardware/software stacks. This 
-  will be extended to support ROCm, CUDA-11, and CUDA-12.
-- Versions for each of these packages are no longer specified in an
-  internal class. Instead a default set of JSON files specifies the
-  sources and versions. Users can specify their own custom specifications
-  at smart build time
-- Two new Dockerfiles are now provided (one each for 11.8 and 12.1) that
-  can be used to build a container to run the tutorials. No HPC support
-  should be expected at this time
-- SmartSim can now be built using Cuda version 11.8 or Cuda 12.1 by specify
-  `smart build --device=cuda118` or `smart build --device=cuda121`. The
-  original `smart build --device=gpu` will default to using Cuda 11.8.
-- As a result of the previous change, SmartSim now requires C++17 and a
-  minimum Cuda version of 11.8 in order to build Torch 2.1.0.
-- Error messages were not being interpolated correctly. This has been
-  addressed to provide more context when exposing error messages to users.
-
-### Development branch
-
-To be released at some future point in time
-
-Description
-
-- Allow specifying Model and Ensemble parameters with 
+- Allow specifying Model and Ensemble parameters with
   number-like types (e.g. numpy types)
 - Pin watchdog to 4.x
 - Update codecov to 4.5.0
@@ -66,9 +39,28 @@ Description
 
 Detailed Notes
 
-- The serializer would fail if a parameter for a Model or Ensemble 
-  was specified as a numpy dtype. The constructors for these 
-  methods now validate that the input is number-like and convert 
+- The RedisAIBuilder class was completely overhauled to allow users to
+  express a wider range of support for hardware/software stacks. This
+  will be extended to support ROCm, CUDA-11, and CUDA-12.
+  ([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669))
+- Versions for each of these packages are no longer specified in an
+  internal class. Instead a default set of JSON files specifies the
+  sources and versions. Users can specify their own custom specifications
+  at smart build time
+  ([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669))
+- Two new Dockerfiles are now provided (one each for 11.8 and 12.1) that
+  can be used to build a container to run the tutorials. No HPC support
+  should be expected at this time
+  ([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669))
+- As a result of the previous change, SmartSim now requires C++17 and a
+  minimum Cuda version of 11.8 in order to build Torch 2.1.0.
+  ([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669))
+- Error messages were not being interpolated correctly. This has been
+  addressed to provide more context when exposing error messages to users.
+  ([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669))
+- The serializer would fail if a parameter for a Model or Ensemble
+  was specified as a numpy dtype. The constructors for these
+  methods now validate that the input is number-like and convert
   them to strings
   ([SmartSim-PR676](https://github.com/CrayLabs/SmartSim/pull/676))
 - Pin watchdog to 4.x because v5 introduces new types and requires

diff --git a/doc/installation_instructions/platform/frontier.rst b/doc/installation_instructions/platform/frontier.rst
@@ -7,8 +7,9 @@ Known limitations
 We are continually working on getting all the features of SmartSim working on
 Frontier, however we do have some known limitations:
 
-* For now, only Torch and ONNX runtime models are supported. If you need
-  Tensorflow support please contact us
+* For now, only Torch models are supported. If you need Tensorflow or ONNX
+  support please contact us
+* All SmartSim experiments must be run from Lustre, _not_ your home directory
 * The colocated database will fail without specifying ``custom_pinning``. This
   is because the default pinning assumes that processor 0 is available, but the
   'low-noise' default on Frontier reserves the processor on each NUMA node.
@@ -30,22 +31,28 @@ these instructions, being sure to set the following variables
 .. code:: bash
 
    export PROJECT_NAME=CHANGE_ME
-   export VENV_NAME=CHANGE_ME
 
 **Step 1:** Create and activate a virtual environment for SmartSim:
 
 .. code:: bash
 
-   module load PrgEnv-gnu cray-python
-   module load rocm/6.1.3
+   module load PrgEnv-gnu miniforge3 rocm/6.1.3
 
    export SCRATCH=/lustre/orion/$PROJECT_NAME/scratch/$USER/
-   export VENV_HOME=$SCRATCH/$VENV_NAME/
+   conda create -n smartsim python=3.11
+   conda activate smartsim
 
-   python3 -m venv $VENV_HOME
-   source $VENV_HOME/bin/activate
+**Step 1 (Optional):** If this is your first time using miniforge on
+Frontier you may also have to execute the following before being able
+to activate the ``smartsim`` environment
 
-**Step 2:** Install SmartSim in the conda environment:
+.. code:: bash
+
+   conda init
+   source ~/.bashrc
+   conda activate smartsim
+
+**Step 2:** Build the SmartRedis C++ and Fortran libraries:
 
 .. code:: bash
 
@@ -55,17 +62,20 @@ these instructions, being sure to set the following variables
    make lib-with-fortran
    pip install .
 
-   # Download SmartSim and site-specific files
+**Step 3:** Install SmartSim in the conda environment:
+
+.. code:: bash
+
    cd $SCRATCH
    pip install git+https://github.com/CrayLabs/SmartSim.git
 
-**Step 3:** Build Redis, RedisAI, the backends, and all the Python packages:
+**Step 4:** Build Redis, RedisAI, the backends, and all the Python packages:
 
 .. code:: bash
 
    smart build --device=rocm-6
 
-**Step 4:** Check that SmartSim has been installed and built correctly:
+**Step 5:** Check that SmartSim has been installed and built correctly:
 
 .. code:: bash
 
@@ -89,21 +99,18 @@ build, and some variables should be set to optimize performance:
 
    # Set these to the same values that were used for install
    export PROJECT_NAME=CHANGE_ME
-   export VENV_NAME=CHANGE_ME
 
 .. code:: bash
 
-   module load PrgEnv-gnu
-   module load rocm/6.1.3
+   module load PrgEnv-gnu miniforge3 rocm/6.1.3
+   conda activate smartsim
 
    # Optimizations for inference
    export SCRATCH=/lustre/orion/$PROJECT_NAME/scratch/$USER/
    export MIOPEN_USER_DB_PATH=/tmp/miopendb/
    export MIOPEN_SYSTEM_DB_PATH=$MIOPEN_USER_DB_PATH
    mkdir -p $MIOPEN_USER_DB_PATH
    export MIOPEN_DISABLE_CACHE=1
-   export VENV_HOME=$SCRATCH/$VENV_NAME/
-   source $VENV_HOME/bin/activate
 
 Binding DBs to Slingshot
 ------------------------

diff --git a/doc/installation_instructions/platform/perlmutter.rst b/doc/installation_instructions/platform/perlmutter.rst
@@ -10,24 +10,33 @@ To install SmartSim on Perlmutter, follow these steps:
 
 .. code:: bash
 
-    module load conda
+    module load conda cudatoolkit/12.2 cudnn/8.9.3_cuda12 PrgEnv-gnu
     conda create -n smartsim python=3.11
     conda activate smartsim
 
-**Step 2:** Install SmartSim in the conda environment:
+**Step 2:** Build the SmartRedis C++ and Fortran libraries:
+
+.. code:: bash
+
+    git clone https://github.com/CrayLabs/SmartRedis.git
+    cd SmartRedis
+    make lib-with-fortran
+    pip install .
+    cd ..
+
+**Step 3:** Install SmartSim in the conda environment:
 
 .. code:: bash
 
     pip install git+https://github.com/CrayLabs/SmartSim.git
 
-**Step 3:** Build Redis, RedisAI, the backends, and all the Python packages:
+**Step 4:** Build Redis, RedisAI, the backends, and all the Python packages:
 
 .. code:: bash
 
-    module load cudatoolkit/12.2 cudnn/8.9.3_cuda12
     smart build --device=cuda-12
 
-**Step 4:** Check that SmartSim has been installed and built correctly:
+**Step 5:** Check that SmartSim has been installed and built correctly:
 
 .. code:: bash
 
@@ -51,5 +60,5 @@ can reload the conda environment by running the following commands:
 
 .. code:: bash
 
-    module load conda cudatoolkit/12.2 cudnn/8.9.3_cuda12
+    module load conda cudatoolkit/12.2 cudnn/8.9.3_cuda12 PrgEnv-gnu
     conda activate smartsim