diff --git a/doc/changelog.md b/doc/changelog.md index 8dcb08d3a..d7d6905ff 100644 --- a/doc/changelog.md +++ b/doc/changelog.md @@ -9,9 +9,9 @@ Jump to: ## SmartSim -### Cuda 12 and ROCm support branch +### Development branch -To be merged into `develop` at some future point in time +To be released at some future point in time Description @@ -21,34 +21,7 @@ Description - Fine grain build support for GPUs - Update Torch to 2.1.0, Tensorflow to 2.15.0 - Better error messages in build process - -Detailed Notes - -- The RedisAIBuilder class was completely overhauled to allow users to - express a wider range of support for hardware/software stacks. This - will be extended to support ROCm, CUDA-11, and CUDA-12. -- Versions for each of these packages are no longer specified in an - internal class. Instead a default set of JSON files specifies the - sources and versions. Users can specify their own custom specifications - at smart build time -- Two new Dockerfiles are now provided (one each for 11.8 and 12.1) that - can be used to build a container to run the tutorials. No HPC support - should be expected at this time -- SmartSim can now be built using Cuda version 11.8 or Cuda 12.1 by specify - `smart build --device=cuda118` or `smart build --device=cuda121`. The - original `smart build --device=gpu` will default to using Cuda 11.8. -- As a result of the previous change, SmartSim now requires C++17 and a - minimum Cuda version of 11.8 in order to build Torch 2.1.0. -- Error messages were not being interpolated correctly. This has been - addressed to provide more context when exposing error messages to users. - -### Development branch - -To be released at some future point in time - -Description - -- Allow specifying Model and Ensemble parameters with +- Allow specifying Model and Ensemble parameters with number-like types (e.g. numpy types) - Pin watchdog to 4.x - Update codecov to 4.5.0 @@ -66,9 +39,28 @@ Description Detailed Notes -- The serializer would fail if a parameter for a Model or Ensemble - was specified as a numpy dtype. The constructors for these - methods now validate that the input is number-like and convert +- The RedisAIBuilder class was completely overhauled to allow users to + express a wider range of support for hardware/software stacks. This + will be extended to support ROCm, CUDA-11, and CUDA-12. + ([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669)) +- Versions for each of these packages are no longer specified in an + internal class. Instead a default set of JSON files specifies the + sources and versions. Users can specify their own custom specifications + at smart build time + ([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669)) +- Two new Dockerfiles are now provided (one each for 11.8 and 12.1) that + can be used to build a container to run the tutorials. No HPC support + should be expected at this time + ([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669)) +- As a result of the previous change, SmartSim now requires C++17 and a + minimum Cuda version of 11.8 in order to build Torch 2.1.0. + ([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669)) +- Error messages were not being interpolated correctly. This has been + addressed to provide more context when exposing error messages to users. + ([SmartSim-PR669](https://github.com/CrayLabs/SmartSim/pull/669)) +- The serializer would fail if a parameter for a Model or Ensemble + was specified as a numpy dtype. The constructors for these + methods now validate that the input is number-like and convert them to strings ([SmartSim-PR676](https://github.com/CrayLabs/SmartSim/pull/676)) - Pin watchdog to 4.x because v5 introduces new types and requires diff --git a/doc/installation_instructions/platform/frontier.rst b/doc/installation_instructions/platform/frontier.rst index d4db76a6d..996688fc7 100644 --- a/doc/installation_instructions/platform/frontier.rst +++ b/doc/installation_instructions/platform/frontier.rst @@ -7,8 +7,9 @@ Known limitations We are continually working on getting all the features of SmartSim working on Frontier, however we do have some known limitations: -* For now, only Torch and ONNX runtime models are supported. If you need - Tensorflow support please contact us +* For now, only Torch models are supported. If you need Tensorflow or ONNX + support please contact us +* All SmartSim experiments must be run from Lustre, _not_ your home directory * The colocated database will fail without specifying ``custom_pinning``. This is because the default pinning assumes that processor 0 is available, but the 'low-noise' default on Frontier reserves the processor on each NUMA node. @@ -30,22 +31,28 @@ these instructions, being sure to set the following variables .. code:: bash export PROJECT_NAME=CHANGE_ME - export VENV_NAME=CHANGE_ME **Step 1:** Create and activate a virtual environment for SmartSim: .. code:: bash - module load PrgEnv-gnu cray-python - module load rocm/6.1.3 + module load PrgEnv-gnu miniforge3 rocm/6.1.3 export SCRATCH=/lustre/orion/$PROJECT_NAME/scratch/$USER/ - export VENV_HOME=$SCRATCH/$VENV_NAME/ + conda create -n smartsim python=3.11 + conda activate smartsim - python3 -m venv $VENV_HOME - source $VENV_HOME/bin/activate +**Step 1 (Optional):** If this is your first time using miniforge on +Frontier you may also have to execute the following before being able +to activate the ``smartsim`` environment -**Step 2:** Install SmartSim in the conda environment: +.. code:: bash + + conda init + source ~/.bashrc + conda activate smartsim + +**Step 2:** Build the SmartRedis C++ and Fortran libraries: .. code:: bash @@ -55,17 +62,20 @@ these instructions, being sure to set the following variables make lib-with-fortran pip install . - # Download SmartSim and site-specific files +**Step 3:** Install SmartSim in the conda environment: + +.. code:: bash + cd $SCRATCH pip install git+https://github.com/CrayLabs/SmartSim.git -**Step 3:** Build Redis, RedisAI, the backends, and all the Python packages: +**Step 4:** Build Redis, RedisAI, the backends, and all the Python packages: .. code:: bash smart build --device=rocm-6 -**Step 4:** Check that SmartSim has been installed and built correctly: +**Step 5:** Check that SmartSim has been installed and built correctly: .. code:: bash @@ -89,12 +99,11 @@ build, and some variables should be set to optimize performance: # Set these to the same values that were used for install export PROJECT_NAME=CHANGE_ME - export VENV_NAME=CHANGE_ME .. code:: bash - module load PrgEnv-gnu - module load rocm/6.1.3 + module load PrgEnv-gnu miniforge3 rocm/6.1.3 + conda activate smartsim # Optimizations for inference export SCRATCH=/lustre/orion/$PROJECT_NAME/scratch/$USER/ @@ -102,8 +111,6 @@ build, and some variables should be set to optimize performance: export MIOPEN_SYSTEM_DB_PATH=$MIOPEN_USER_DB_PATH mkdir -p $MIOPEN_USER_DB_PATH export MIOPEN_DISABLE_CACHE=1 - export VENV_HOME=$SCRATCH/$VENV_NAME/ - source $VENV_HOME/bin/activate Binding DBs to Slingshot ------------------------ diff --git a/doc/installation_instructions/platform/perlmutter.rst b/doc/installation_instructions/platform/perlmutter.rst index 6d1e22e1e..71f97a4dc 100644 --- a/doc/installation_instructions/platform/perlmutter.rst +++ b/doc/installation_instructions/platform/perlmutter.rst @@ -10,24 +10,33 @@ To install SmartSim on Perlmutter, follow these steps: .. code:: bash - module load conda + module load conda cudatoolkit/12.2 cudnn/8.9.3_cuda12 PrgEnv-gnu conda create -n smartsim python=3.11 conda activate smartsim -**Step 2:** Install SmartSim in the conda environment: +**Step 2:** Build the SmartRedis C++ and Fortran libraries: + +.. code:: bash + + git clone https://github.com/CrayLabs/SmartRedis.git + cd SmartRedis + make lib-with-fortran + pip install . + cd .. + +**Step 3:** Install SmartSim in the conda environment: .. code:: bash pip install git+https://github.com/CrayLabs/SmartSim.git -**Step 3:** Build Redis, RedisAI, the backends, and all the Python packages: +**Step 4:** Build Redis, RedisAI, the backends, and all the Python packages: .. code:: bash - module load cudatoolkit/12.2 cudnn/8.9.3_cuda12 smart build --device=cuda-12 -**Step 4:** Check that SmartSim has been installed and built correctly: +**Step 5:** Check that SmartSim has been installed and built correctly: .. code:: bash @@ -51,5 +60,5 @@ can reload the conda environment by running the following commands: .. code:: bash - module load conda cudatoolkit/12.2 cudnn/8.9.3_cuda12 + module load conda cudatoolkit/12.2 cudnn/8.9.3_cuda12 PrgEnv-gnu conda activate smartsim