Extend smart build to CUDA-11, CUDA-12, and ROCm (#669)

- The RedisAIBuilder class was completely overhauled to allow users to express a wider range of support for hardware/software stacks. This will be extended to support ROCm, CUDA-11, and CUDA-12. - Versions for each of these packages are no longer specified in an internal class. Instead a default set of JSON files specifies the sources and versions. Users can specify their own custom specifications at smart build time --------- [ committed by @ashao ] [ reviewed by @MattToast @juliaputko ] Co-authored-by: Matt Drozt <[email protected]> Co-authored-by: Julia Putko <[email protected]>
CrayLabs · Sep 19, 2024 · 5fb8eb4 · 5fb8eb4
1 parent 72be515
commit 5fb8eb4
Show file tree

Hide file tree

Showing 51 changed files with 2,534 additions and 1,970 deletions.
diff --git a/.github/workflows/run_tests.yml b/.github/workflows/run_tests.yml
@@ -49,7 +49,7 @@ env:
 
 jobs:
   run_tests:
-    name: Run tests ${{ matrix.subset }} with ${{ matrix.os }}, Python ${{ matrix.py_v}}, RedisAI ${{ matrix.rai }}
+    name: Run tests ${{ matrix.subset }} with ${{ matrix.os }}, Python ${{ matrix.py_v}}
     runs-on: ${{ matrix.os }}
     strategy:
       fail-fast: false
@@ -63,9 +63,6 @@ jobs:
           - os: macos-14
             py_v: "3.9"
 
-    env:
-      SMARTSIM_REDISAI: ${{ matrix.rai }}
-
     steps:
       - uses: actions/checkout@v4
       - uses: actions/setup-python@v5
@@ -109,15 +106,10 @@ jobs:
       - name: Install SmartSim (with ML backends)
         run: |
           python -m pip install git+https://github.com/CrayLabs/SmartRedis.git@develop#egg=smartredis
-          python -m pip install .[dev,mypy,ml]
-
-      - name: Install ML Runtimes with Smart (with pt, tf, and onnx support)
-        if: contains( matrix.os, 'ubuntu' ) || contains( matrix.os, 'macos-12')
-        run: smart build --device cpu --onnx -v
+          python -m pip install .[dev,mypy]
 
-      - name: Install ML Runtimes with Smart (no ONNX,TF on Apple Silicon)
-        if: contains( matrix.os, 'macos-14' )
-        run: smart build --device cpu --no_tf -v
+      - name: Install ML Runtimes
+        run: smart build --device cpu -v
 
       - name: Run mypy
         run: |

diff --git a/.gitignore b/.gitignore
@@ -12,6 +12,7 @@ tests/test_output
 # Dependencies
 smartsim/_core/.third-party
 smartsim/_core/.dragon
+smartsim/_core/build
 
 # Docs
 _build

diff --git a/README.md b/README.md
@@ -643,11 +643,11 @@ from C, C++, Fortran and Python with the SmartRedis Clients:
     <tr>
       <td rowspan="3">1.2.7</td>
       <td>PyTorch</td>
-      <td>2.0.1</td>
+      <td>2.1.0</td>
     </tr>
     <tr>
       <td>TensorFlow\Keras</td>
-      <td>2.13.1</td>
+      <td>2.15.0</td>
     </tr>
     <tr>
       <td>ONNX</td>

diff --git a/doc/changelog.md b/doc/changelog.md
@@ -9,6 +9,39 @@ Jump to:
 
 ## SmartSim
 
+###  Cuda 12 and ROCm support branch
+
+To be merged into `develop` at some future point in time
+
+Description
+
+- Refactor to the RedisAI build to allow more flexibility in versions
+  and sources of ML backends
+- Add Dockerfiles with GPU support
+- Fine grain build support for GPUs
+- Update Torch to 2.1.0, Tensorflow to 2.15.0
+- Better error messages in build process
+
+Detailed Notes
+
+- The RedisAIBuilder class was completely overhauled to allow users to
+  express a wider range of support for hardware/software stacks. This 
+  will be extended to support ROCm, CUDA-11, and CUDA-12.
+- Versions for each of these packages are no longer specified in an
+  internal class. Instead a default set of JSON files specifies the
+  sources and versions. Users can specify their own custom specifications
+  at smart build time
+- Two new Dockerfiles are now provided (one each for 11.8 and 12.1) that
+  can be used to build a container to run the tutorials. No HPC support
+  should be expected at this time
+- SmartSim can now be built using Cuda version 11.8 or Cuda 12.1 by specify
+  `smart build --device=cuda118` or `smart build --device=cuda121`. The
+  original `smart build --device=gpu` will default to using Cuda 11.8.
+- As a result of the previous change, SmartSim now requires C++17 and a
+  minimum Cuda version of 11.8 in order to build Torch 2.1.0.
+- Error messages were not being interpolated correctly. This has been
+  addressed to provide more context when exposing error messages to users.
+
 ### Development branch
 
 To be released at some future point in time