Skip to content

Commit

Permalink
Refine install documentation for Perlmutter and Frontier (#717)
Browse files Browse the repository at this point in the history
After discussing with admins at OLCF, miniforge is the preferred
solution for creating virtual environments on Frontier. The instructions
for installing SmartSim have been updated accordingly. Additionally,
perlmutter did not have a step for compiling the SmartRedis libraries.
This has been rectified to bring the two systems to parity.

[ committed by @ashao ]
[ reviewed by @MattToast @AlyssaCote ]
  • Loading branch information
ashao authored Sep 23, 2024
1 parent 5fb8eb4 commit 2f68c08
Show file tree
Hide file tree
Showing 3 changed files with 64 additions and 56 deletions.
58 changes: 25 additions & 33 deletions doc/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ Jump to:

## SmartSim

### Cuda 12 and ROCm support branch
### Development branch

To be merged into `develop` at some future point in time
To be released at some future point in time

Description

Expand All @@ -21,34 +21,7 @@ Description
- Fine grain build support for GPUs
- Update Torch to 2.1.0, Tensorflow to 2.15.0
- Better error messages in build process

Detailed Notes

- The RedisAIBuilder class was completely overhauled to allow users to
express a wider range of support for hardware/software stacks. This
will be extended to support ROCm, CUDA-11, and CUDA-12.
- Versions for each of these packages are no longer specified in an
internal class. Instead a default set of JSON files specifies the
sources and versions. Users can specify their own custom specifications
at smart build time
- Two new Dockerfiles are now provided (one each for 11.8 and 12.1) that
can be used to build a container to run the tutorials. No HPC support
should be expected at this time
- SmartSim can now be built using Cuda version 11.8 or Cuda 12.1 by specify
`smart build --device=cuda118` or `smart build --device=cuda121`. The
original `smart build --device=gpu` will default to using Cuda 11.8.
- As a result of the previous change, SmartSim now requires C++17 and a
minimum Cuda version of 11.8 in order to build Torch 2.1.0.
- Error messages were not being interpolated correctly. This has been
addressed to provide more context when exposing error messages to users.

### Development branch

To be released at some future point in time

Description

- Allow specifying Model and Ensemble parameters with
- Allow specifying Model and Ensemble parameters with
number-like types (e.g. numpy types)
- Pin watchdog to 4.x
- Update codecov to 4.5.0
Expand All @@ -66,9 +39,28 @@ Description

Detailed Notes

- The serializer would fail if a parameter for a Model or Ensemble
was specified as a numpy dtype. The constructors for these
methods now validate that the input is number-like and convert
- The RedisAIBuilder class was completely overhauled to allow users to
express a wider range of support for hardware/software stacks. This
will be extended to support ROCm, CUDA-11, and CUDA-12.
([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669))
- Versions for each of these packages are no longer specified in an
internal class. Instead a default set of JSON files specifies the
sources and versions. Users can specify their own custom specifications
at smart build time
([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669))
- Two new Dockerfiles are now provided (one each for 11.8 and 12.1) that
can be used to build a container to run the tutorials. No HPC support
should be expected at this time
([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669))
- As a result of the previous change, SmartSim now requires C++17 and a
minimum Cuda version of 11.8 in order to build Torch 2.1.0.
([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669))
- Error messages were not being interpolated correctly. This has been
addressed to provide more context when exposing error messages to users.
([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669))
- The serializer would fail if a parameter for a Model or Ensemble
was specified as a numpy dtype. The constructors for these
methods now validate that the input is number-like and convert
them to strings
([SmartSim-PR676](https://github.com/CrayLabs/SmartSim/pull/676))
- Pin watchdog to 4.x because v5 introduces new types and requires
Expand Down
41 changes: 24 additions & 17 deletions doc/installation_instructions/platform/frontier.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,9 @@ Known limitations
We are continually working on getting all the features of SmartSim working on
Frontier, however we do have some known limitations:

* For now, only Torch and ONNX runtime models are supported. If you need
Tensorflow support please contact us
* For now, only Torch models are supported. If you need Tensorflow or ONNX
support please contact us
* All SmartSim experiments must be run from Lustre, _not_ your home directory
* The colocated database will fail without specifying ``custom_pinning``. This
is because the default pinning assumes that processor 0 is available, but the
'low-noise' default on Frontier reserves the processor on each NUMA node.
Expand All @@ -30,22 +31,28 @@ these instructions, being sure to set the following variables
.. code:: bash
export PROJECT_NAME=CHANGE_ME
export VENV_NAME=CHANGE_ME
**Step 1:** Create and activate a virtual environment for SmartSim:

.. code:: bash
module load PrgEnv-gnu cray-python
module load rocm/6.1.3
module load PrgEnv-gnu miniforge3 rocm/6.1.3
export SCRATCH=/lustre/orion/$PROJECT_NAME/scratch/$USER/
export VENV_HOME=$SCRATCH/$VENV_NAME/
conda create -n smartsim python=3.11
conda activate smartsim
python3 -m venv $VENV_HOME
source $VENV_HOME/bin/activate
**Step 1 (Optional):** If this is your first time using miniforge on
Frontier you may also have to execute the following before being able
to activate the ``smartsim`` environment

**Step 2:** Install SmartSim in the conda environment:
.. code:: bash
conda init
source ~/.bashrc
conda activate smartsim
**Step 2:** Build the SmartRedis C++ and Fortran libraries:

.. code:: bash
Expand All @@ -55,17 +62,20 @@ these instructions, being sure to set the following variables
make lib-with-fortran
pip install .
# Download SmartSim and site-specific files
**Step 3:** Install SmartSim in the conda environment:

.. code:: bash
cd $SCRATCH
pip install git+https://github.com/CrayLabs/SmartSim.git
**Step 3:** Build Redis, RedisAI, the backends, and all the Python packages:
**Step 4:** Build Redis, RedisAI, the backends, and all the Python packages:

.. code:: bash
smart build --device=rocm-6
**Step 4:** Check that SmartSim has been installed and built correctly:
**Step 5:** Check that SmartSim has been installed and built correctly:

.. code:: bash
Expand All @@ -89,21 +99,18 @@ build, and some variables should be set to optimize performance:
# Set these to the same values that were used for install
export PROJECT_NAME=CHANGE_ME
export VENV_NAME=CHANGE_ME
.. code:: bash
module load PrgEnv-gnu
module load rocm/6.1.3
module load PrgEnv-gnu miniforge3 rocm/6.1.3
conda activate smartsim
# Optimizations for inference
export SCRATCH=/lustre/orion/$PROJECT_NAME/scratch/$USER/
export MIOPEN_USER_DB_PATH=/tmp/miopendb/
export MIOPEN_SYSTEM_DB_PATH=$MIOPEN_USER_DB_PATH
mkdir -p $MIOPEN_USER_DB_PATH
export MIOPEN_DISABLE_CACHE=1
export VENV_HOME=$SCRATCH/$VENV_NAME/
source $VENV_HOME/bin/activate
Binding DBs to Slingshot
------------------------
Expand Down
21 changes: 15 additions & 6 deletions doc/installation_instructions/platform/perlmutter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,24 +10,33 @@ To install SmartSim on Perlmutter, follow these steps:

.. code:: bash
module load conda
module load conda cudatoolkit/12.2 cudnn/8.9.3_cuda12 PrgEnv-gnu
conda create -n smartsim python=3.11
conda activate smartsim
**Step 2:** Install SmartSim in the conda environment:
**Step 2:** Build the SmartRedis C++ and Fortran libraries:

.. code:: bash
git clone https://github.com/CrayLabs/SmartRedis.git
cd SmartRedis
make lib-with-fortran
pip install .
cd ..
**Step 3:** Install SmartSim in the conda environment:

.. code:: bash
pip install git+https://github.com/CrayLabs/SmartSim.git
**Step 3:** Build Redis, RedisAI, the backends, and all the Python packages:
**Step 4:** Build Redis, RedisAI, the backends, and all the Python packages:

.. code:: bash
module load cudatoolkit/12.2 cudnn/8.9.3_cuda12
smart build --device=cuda-12
**Step 4:** Check that SmartSim has been installed and built correctly:
**Step 5:** Check that SmartSim has been installed and built correctly:

.. code:: bash
Expand All @@ -51,5 +60,5 @@ can reload the conda environment by running the following commands:

.. code:: bash
module load conda cudatoolkit/12.2 cudnn/8.9.3_cuda12
module load conda cudatoolkit/12.2 cudnn/8.9.3_cuda12 PrgEnv-gnu
conda activate smartsim

0 comments on commit 2f68c08

Please sign in to comment.