Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compile v4.1.7 with CUDA support broken #13005

Open
davidhoover opened this issue Dec 30, 2024 · 4 comments
Open

compile v4.1.7 with CUDA support broken #13005

davidhoover opened this issue Dec 30, 2024 · 4 comments
Assignees
Milestone

Comments

@davidhoover
Copy link

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v4.1.7

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From tarball openmpi-4.1.7.tar.bz2

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: rocky8
  • Computer hardware: intel x2695
  • Network type: infiniband

Details of the problem

I am attempting to compile openmpi with CUDA v11.8 support, like this:

./configure'   '--prefix=/lscratch/43997378/openmpi/4.1.7/CUDA-11.8/pmix-3.2.3/ucx-1.17.0/gcc-8.5.0' '--enable-shared' '--enable-static' '--without-verbs' '--without-mxm' '--enable-orterun-prefix-by-default' '--enable-mpi-cxx' '--with-libevent=/usr/local/libevent/libevent-2.1.12/gcc-11.3.0' '--with-ucx=/usr/local/ucx/1.17.0-nocuda-mofed4.9-6/gcc-8.5.0' '--with-pmix=/usr/local/apps/PMIx/pmix-3.2.3' '--with-slurm' '--with-cuda=/usr/local/CUDA/11.8.0' 'CC=/usr/local/GCC/8.5.0/bin/gcc' 'CXX=/usr/local/GCC/8.5.0/bin/g++' 'CXXFLAGS=-fabi-version=13 -fabi-compat-version=2 -fpermissive' 'FC=/usr/local/GCC/8.5.0/bin/gfortran' 'CPPFLAGS=    -I/usr/local/libevent/libevent-2.1.12/gcc-11.3.0/include' --cache-file=/dev/null --srcdir=. --disable-option-checking

This results in the following error:

make[2]: Entering directory '/usr/local/src/openmpi/openmpi-4.1.7/opal/mca/common/cuda'
  CC       common_cuda.lo
  LN_S     libmca_common_cuda.la
common_cuda.c: In function ‘mca_common_cuda_get_primary_context’:
common_cuda.c:1825:21: error: ‘cudaFunctionTable_t’ {aka ‘struct cudaFunctionTable’} has no member named ‘cuDevicePrimaryCtxGetState’
     result =  cuFunc.cuDevicePrimaryCtxGetState(dev_id, &flags, &active);
                     ^
common_cuda.c:1831:24: error: ‘cudaFunctionTable_t’ {aka ‘struct cudaFunctionTable’} has no member named ‘cuDevicePrimaryCtxRetain’
         result = cuFunc.cuDevicePrimaryCtxRetain(pctx, dev_id);
                        ^
make[2]: *** [Makefile:1948: common_cuda.lo] Error 1
make[2]: Leaving directory '/usr/local/src/openmpi/openmpi-4.1.7/opal/mca/common/cuda'
make[1]: *** [Makefile:2387: all-recursive] Error 1
make[1]: Leaving directory '/usr/local/src/openmpi/openmpi-4.1.7/opal'
make: *** [Makefile:1905: all-recursive] Error 1

Please note that this does not happen with v4.1.6. Something has changed with openmpi-4.1.{6,7}/opal/mca/common/cuda/common_cuda.c.

Has anyone else seen this?

Thanks, David

@ohlmann
Copy link

ohlmann commented Jan 2, 2025

I have seen the same issue. I guess that this was introduced in a697a27. A possible fix would probably be to fence lines 1821 to 1835 with #if OPAL_CUDA_VMM_SUPPORT.

@bosilca
Copy link
Member

bosilca commented Jan 2, 2025

You are correct. The following patch should fix the issue:

diff --git a/opal/mca/common/cuda/common_cuda.c b/opal/mca/common/cuda/common_cuda.c
index b8ce5a7bea..ab5177fe7f 100644
--- a/opal/mca/common/cuda/common_cuda.c
+++ b/opal/mca/common/cuda/common_cuda.c
@@ -1818,6 +1818,7 @@ static int mca_common_cuda_check_mpool(CUdeviceptr dbuf, CUmemorytype *mem_type,
 
 static int mca_common_cuda_get_primary_context(CUdevice dev_id, CUcontext *pctx)
 {
+#if OPAL_CUDA_VMM_SUPPORT
     CUresult result;
     unsigned int flags;
     int active;
@@ -1831,7 +1832,7 @@ static int mca_common_cuda_get_primary_context(CUdevice dev_id, CUcontext *pctx)
         result = cuFunc.cuDevicePrimaryCtxRetain(pctx, dev_id);
         return OPAL_SUCCESS;
     }
-
+#endif  /* OPAL_CUDA_VMM_SUPPORT */
     return OPAL_ERROR;
 }
 

@davidhoover
Copy link
Author

Beauty, that worked, cheers!

@jsquyres jsquyres removed their assignment Jan 3, 2025
@jsquyres jsquyres added this to the v4.1.7 milestone Jan 3, 2025
@jsquyres
Copy link
Member

jsquyres commented Jan 3, 2025

@bosilca I assume you'll submit a PR to fix this? 😄

bosilca added a commit to bosilca/ompi that referenced this issue Jan 3, 2025
bosilca added a commit to bosilca/ompi that referenced this issue Jan 3, 2025
Fixes open-mpi#13005.

bot:notacherrypick

Signed-off-by: George Bosilca <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants