Draft of ORT GPU build #5622

ChSonnabend · 2024-09-17T11:13:07Z

This is a draft PR to discuss possible changes to onnxruntime.sh for GPU builds on the EPN's and potentially CUDA (to be tested)

ChSonnabend · 2024-09-17T11:13:28Z

Ping @davidrohr

davidrohr

Du solltest ein paar Umgebungsvariable, die wir in o2.sh nutzen, auch mitaufnehmen und analog behandeln: https://github.com/alisw/alidist/blob/1916f6d88d42959097998d9481b517dc1c1ea84d/o2.sh#L191C9-L191C30

ALIBUILD_O2_FORCE_GPU
DISABLE_GPU
ALIBUILD_ENABLE_CUDA
ALIBUILD_ENABLE_HIP
ALIBUILD_O2_OVERRIDE_HIP_ARCHS
ALIBUILD_O2_OVERRIDE_CUDA_ARCHS

Wenn ENABLE_CUDA oder ENABLE_HIP gesetzt ist, sollte der build fehlschlagen, wenn er CUDA/HIP nicht bauen kann.

davidrohr · 2024-09-17T11:14:55Z

onnxruntime.sh

+                      "
+    elif command -v nvcc >/dev/null 2>&1; then
+      CUDA_VERSION=$(nvcc --version | grep "release" | awk '{print $NF}' | cut -d. -f1)
+      if [[ "$CUDA_VERSION" == "V11" ]]; then


glaube CUDA 11 kannst du weglassen, und nur >=12 annehmen

davidrohr · 2024-09-17T11:15:37Z

onnxruntime.sh

+ORT_BUILD_FLAGS=""
+case $ARCHITECTURE in
+  osx_*)
+    if [[ $ARCHITECTURE == *_x86-64 ]]; then


Solche printouts würde ich weglassen, das ist ja hauptsächlich für debugging

Ja, aber ich nehme an das es auch einen macOS build gibt der die Mac GPU anspricht. Da muss ich nochmal ein bisschen rumsuchen, dann könnte man den if-Block nämlich nehmen um da die build flags rein zu packen. Aber ja, die print-outs nehm ich am Ende natürlich noch raus

davidrohr · 2024-09-17T11:21:35Z

onnxruntime.sh

+    fi
+  ;;
+  *)
+    if command -v rocminfo >/dev/null 2>&1; then


rocm version check fehlt

Es ist nicht klar, ob rocminfo im Pfad liegt. Du solltest zumindest /opt/rocm/bin/rocminfo testen. Und dann ist migraphx ein separates ROCm paket. Sprich, wenn rocminfo vorhanden ist, heist das noch nicht, das migraphx vorhanden ist. Du solltest explicit auf migraphx testen.

Good point, das check ich nochmal

davidrohr · 2024-09-17T11:22:30Z

onnxruntime.sh

+        ORT_BUILD_FLAGS=" -Donnxruntime_USE_CUDA=ON                                                     \
+                          -DCUDA_TOOLKIT_ROOT_DIR=$CUDA_ROOT                                            \
+                          -Donnxruntime_USE_CUDA_NHWC_OPS=ON                                            \
+                          -Donnxruntime_CUDA_USE_TENSORRT=ON                                            \


Wenn du tensorrt nutzt, musst du dann prüfen, ob das explicit installiert ist? Oder ist das immer beim CUDA SDK dabei?

Scheint nicht automatisch mitzukommen (https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html)... Ok da bau ich auch noch einen Check mit ein

davidrohr · 2024-09-17T11:22:33Z

onnxruntime.sh

+                          -Donnxruntime_USE_CUDA_NHWC_OPS=ON                                            \
+                          -Donnxruntime_CUDA_USE_TENSORRT=ON                                            \
+                          "
+      elif [[ "$CUDA_VERSION" == "V12" ]]; then


Was ist wenn ROCm und CUDA beides vorhanden ist? Können wir dann nicht beides bauen?

Ne, die kann man nicht parallel bauen, es geht immer nur eins von beiden: https://github.com/microsoft/onnxruntime/blob/afd642a194b39138ad891e7bb2c8bca26d37b785/cmake/CMakeLists.txt#L288-L290

ktf · 2024-09-17T14:48:41Z

Gneau...

…adding env-variables for GPU enabling during code execution. For al9_gpu container and simultaneous CUDA & ROCm build, this requires ChSonnabend/onnxruntime@6ffc40c

…e build with CUDA and ROCm fails due to a ROCm internal check for THRUST and CUB libraries, which are not in sync (file: /opt/rocm/include/thrust/system/cuda/config.h)

davidrohr · 2024-10-04T12:40:16Z

onnxruntime.sh

+    export ORT_MIGRAPHX_BUILD=0
+fi
+### TensorRT
+if [ "$ORT_CUDA_BUILD" -eq 1 ] && [ $(find /opt/rocm* -name "libnvinfer*" -print -quit | wc -l 2>&1) -eq 1 ]; then


why do you search for tensort in /opt/rocm?

davidrohr · 2024-10-04T12:41:28Z

onnxruntime.sh

-      -DCMAKE_CXX_FLAGS="$CXXFLAGS -Wno-unknown-warning -Wno-unknown-warning-option -Wno-error=unused-but-set-variable -Wno-error=deprecated" \
-      -DCMAKE_C_FLAGS="$CFLAGS -Wno-unknown-warning -Wno-unknown-warning-option -Wno-error=unused-but-set-variable -Wno-error=deprecated"
+# Check ROCm build conditions
+if { [ "$ALIBUILD_O2_FORCE_GPU" -ne 0 ] || [ "$ALIBUILD_ENABLE_HIP" -ne 0 ] || command -v rocminfo >/dev/null 2>&1; } && \


is it guaranteed that rocminfo is in the path? Perhaps, try rocminfo or /opt/rocm/bin/rocminfo?
Also, perhaps it needs HIP, not rocminfo? So perhaps check for /opt/rocm/bin/hipcc?

davidrohr · 2024-10-04T12:42:23Z

onnxruntime.sh

+      -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++                                                       \
+      -D__HIP_PLATFORM_AMD__=1                                                                              \
+      -DCMAKE_HIP_ARCHITECTURES=gfx906,gfx908                                                               \
+      ${ALIBUILD_O2_OVERRIDE_HIP_ARCHS:+-DCMAKE_HIP_ARCHITECTURES=${ALIBUILD_O2_OVERRIDE_HIP_ARCHS}}        \


I think there is also an OVERRIDE_CUDA_ARCH, could you check and use that as well?

davidrohr · 2024-10-04T12:43:43Z

onnxruntime.sh

+      -Donnxruntime_CUDA_HOME=/usr/local/cuda                                                               \
+      -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++                                                       \
+      -D__HIP_PLATFORM_AMD__=1                                                                              \
+      -DCMAKE_HIP_ARCHITECTURES=gfx906,gfx908                                                               \


I don't understand this. You first set CMAKE_HIP_ARCHITECTURES, and then you possibly override it in the next line? Why don't you expand ALIBUILD_O2_OVERRIDE_HIP_ARCHS to the defaults if empty? ${...:-default}?

davidrohr · 2024-11-04T10:37:30Z

onnxruntime.sh

+# Check CUDA build conditions
+if { [ "$ALIBUILD_O2_FORCE_GPU" -ne 0 ] || [ "$ALIBUILD_ENABLE_CUDA" -ne 0 ] || command -v nvcc >/dev/null 2>&1; } && \
+   { [ -z "$DISABLE_GPU" ] || [ "$DISABLE_GPU" -eq 0 ]; }; then
+    export ORT_CUDA_BUILD=1


I would also set a default for ALIBUILD_O2_OVERRIDE_CUDA_ARCHS to sm_86 or sm_89 architecture for now

davidrohr

if [ "$ALIBUILD_O2_FORCE_GPU" -eq 1 ]
will not work if the variable is not defined.
You can either do [ "0$FOO" == "01" ] or use the bash [[ syntax. Please try with all variables undefined.

… to passed to o2.sh

ktf · 2024-11-29T14:36:32Z

Approving to start the CI.

ChSonnabend · 2024-12-02T21:52:49Z

macOS failure seem unrelated to this PR. It can be merged from my side if there are no objections (@ktf ).

ktf · 2024-12-19T10:27:29Z

onnxruntime.sh

+  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/lib
+else
+  export ORT_ROCM_BUILD=0
+fi


Suggested change

fi

mkdir -p $INSTALLROOT/etc

cat << EOF > $INSTALLROOT/etc/ort-init.sh

export ORT_ROCM_BUILD=$ORT_ROCM_BUILD

EOF

ktf · 2024-12-19T10:31:15Z

onnxruntime.sh

@@ -45,7 +117,6 @@ mkdir -p "$INSTALLROOT/etc/modulefiles"
 MODULEFILE="$INSTALLROOT/etc/modulefiles/$PKGNAME"
 alibuild-generate-module --lib > "$MODULEFILE"
 cat >> "$MODULEFILE" <<EoF
-
 # Our environment
 set ${PKGNAME}_ROOT \$::env(BASEDIR)/$PKGNAME/\$version
 prepend-path ROOT_INCLUDE_PATH \$${PKGNAME}_ROOT/include/onnxruntime


Suggested change

prepend-path ROOT_INCLUDE_PATH \$${PKGNAME}_ROOT/include/onnxruntime

prepend-path ROOT_INCLUDE_PATH \$${PKGNAME}_ROOT/include/onnxruntime

append-path LD_LIBRARY_PATH /opt/rocm/lib

…TH at runtime)

ChSonnabend added 5 commits August 15, 2024 17:23

OnnxRuntime build on AMD GPU's

c80e36c

Merge branch 'alisw:master' into onnxruntime-gpu

4d42b1f

Modifying recipe for build on Nvidia GPU's (still needs testing)

23c3e5f

Updating ONNX build flags

78fcf07

Merge branch 'alisw:master' into onnxruntime-gpu

6e3689b

davidrohr reviewed Sep 17, 2024

View reviewed changes

ChSonnabend added 5 commits September 27, 2024 10:52

Merge branch 'master' into onnxruntime-gpu

0fa8a06

Updating version to 1.19.0

73b54ce

Adding automatic checks for migraphx, changing build cmake flags and …

5e42e46

…adding env-variables for GPU enabling during code execution. For al9_gpu container and simultaneous CUDA & ROCm build, this requires ChSonnabend/onnxruntime@6ffc40c

Merge branch 'master' into onnxruntime-gpu

9e47dfd

This builds ORT with the GPU flags. Note: In the al9_gpu container th…

e90a9e5

…e build with CUDA and ROCm fails due to a ROCm internal check for THRUST and CUB libraries, which are not in sync (file: /opt/rocm/include/thrust/system/cuda/config.h)

davidrohr requested changes Oct 4, 2024

View reviewed changes

Adding comments and reshuffeling for better readibility

d7b089d

davidrohr reviewed Nov 4, 2024

View reviewed changes

ChSonnabend added 4 commits November 18, 2024 11:25

Adding checks for Cuda and ROCm libraries

d45d194

Updating to a recent version of ONNX

000afbb

Merge branch 'alisw:master' into onnxruntime-gpu

9613081

Changing to -eq 1

f28a341

ChSonnabend marked this pull request as ready for review November 22, 2024 21:43

ChSonnabend requested a review from a team as a code owner November 22, 2024 21:43

davidrohr requested changes Nov 22, 2024

View reviewed changes

ChSonnabend added 4 commits November 26, 2024 10:18

Changing to double-brace syntax

4813226

Changing to version 1.20 since 1.19 has issues for GPU execution

2713a61

Adding check for Alma9 (if system is AlmaLinux). exports's still need…

62ece4e

… to passed to o2.sh

Adding compile flags for ONNXRuntime

ea219bf

ChSonnabend requested a review from a team as a code owner November 28, 2024 13:29

ChSonnabend added 2 commits November 29, 2024 13:58

Adding AlmaLinux 9 as general &&-check

ae214b5

Adding ORT_ROCM_BUILD check for CUDA build

6f3d636

ktf previously approved these changes Nov 29, 2024

View reviewed changes

Removing C/CXX flag for template-id-cdtor

f5fd93b

ChSonnabend dismissed ktf’s stale review via f5fd93b November 30, 2024 09:20

ktf self-requested a review December 1, 2024 12:52

ktf previously approved these changes Dec 1, 2024

View reviewed changes

Force disabling for alma linux distribution

5e0f296

ChSonnabend dismissed ktf’s stale review via 5e0f296 December 1, 2024 18:07

ktf reviewed Dec 19, 2024

View reviewed changes

Adding ORT variables to be available at build time (and LD_LIBRARY_PA…

e01351a

…TH at runtime)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft of ORT GPU build #5622

Draft of ORT GPU build #5622

ChSonnabend commented Sep 17, 2024

ChSonnabend commented Sep 17, 2024

davidrohr left a comment

davidrohr Sep 17, 2024

davidrohr Sep 17, 2024

ChSonnabend Sep 17, 2024

davidrohr Sep 17, 2024

ChSonnabend Sep 17, 2024

davidrohr Sep 17, 2024

ChSonnabend Sep 17, 2024

davidrohr Sep 17, 2024

ChSonnabend Sep 17, 2024 •

edited

Loading

ktf commented Sep 17, 2024

davidrohr Oct 4, 2024

ChSonnabend Oct 4, 2024

davidrohr Oct 4, 2024

davidrohr Oct 4, 2024

davidrohr Oct 4, 2024

davidrohr Nov 4, 2024

davidrohr left a comment

ktf commented Nov 29, 2024

ChSonnabend commented Dec 2, 2024 •

edited

Loading

ktf Dec 19, 2024

ktf Dec 19, 2024

-fi
+mkdir -p $INSTALLROOT/etc
+cat << EOF > $INSTALLROOT/etc/ort-init.sh
+export ORT_ROCM_BUILD=$ORT_ROCM_BUILD
+EOF

	prepend-path ROOT_INCLUDE_PATH \$${PKGNAME}_ROOT/include/onnxruntime
	prepend-path ROOT_INCLUDE_PATH \$${PKGNAME}_ROOT/include/onnxruntime
	append-path LD_LIBRARY_PATH /opt/rocm/lib

Draft of ORT GPU build #5622

Are you sure you want to change the base?

Draft of ORT GPU build #5622

Conversation

ChSonnabend commented Sep 17, 2024

ChSonnabend commented Sep 17, 2024

davidrohr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChSonnabend Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

ktf commented Sep 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidrohr left a comment

Choose a reason for hiding this comment

ktf commented Nov 29, 2024

ChSonnabend commented Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChSonnabend Sep 17, 2024 •

edited

Loading

ChSonnabend commented Dec 2, 2024 •

edited

Loading