Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE on polaris #6422

Open
ndkeen opened this issue May 16, 2024 · 8 comments · Fixed by #6423 · May be fixed by #6985
Open

(GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE on polaris #6422

ndkeen opened this issue May 16, 2024 · 8 comments · Fixed by #6423 · May be fixed by #6985
Assignees
Labels
kokkos Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes)

Comments

@ndkeen
Copy link
Contributor

ndkeen commented May 16, 2024

After #6101 which brings in kokkos 4.2, we see runtime error with a test like:
ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-gpu_gnugpu
(also on similar machine ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.muller-gpu_gnugpu)

0: (GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 97

A fix that seems to work is to add this build flag:
Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF

This will also be needed for scream repo.
@bartgol @mahf708

@bartgol
Copy link
Contributor

bartgol commented May 16, 2024

I just opened a quick follow up PR, to clean up some deprecated code issues (they did not cause falls in e3sm bc deprecated code in looks was allowed).

I can add a fix for this in that PR.

@ndkeen
Copy link
Contributor Author

ndkeen commented May 16, 2024

Fine with me. Just include the change to make the muller-gpu file same as pm-gpu.

ndkeen added a commit that referenced this issue May 20, 2024
…ASYNC=OFF' into next (PR #6423)

After #6101 which brings in kokkos 4.2, we see runtime error with a test like:
ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-gpu_gnugpu

hits runtime error like:

0: (GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 97
unfortunately, the tests hitting this error are also hanging...

A fix that seems to work is to add this build flag:
-DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF

The fix is merged in another PR for pm-gpu, so this PR just makes same change to muller-gpu.

Fixes #6422
@gsever
Copy link

gsever commented Feb 4, 2025

Just to note that the same issue shows up running a recent E3SM/SCREAM version on ALCF's Polaris:

(GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 148

Adding -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF in ./cime_config/machines/cmake_macros/gnugpu_polaris.cmake fixes the runtime error as posted here.

@mahf708
Copy link
Contributor

mahf708 commented Feb 4, 2025

Do we have a polaris machine entry? reopening to ensure it is added there

edit: looks like we do, so we should add this there.

@gsever could you issue a PR? Or @rljacob can advise who's the machine POC for polaris

@mahf708 mahf708 reopened this Feb 4, 2025
@gsever
Copy link

gsever commented Feb 4, 2025

Yes, there is Polaris setup in the master, but it fails minimum with this error:

ERROR: module command /usr/share/lmod/lmod/libexec/lmod python load cmake/3.23.2 craype-x86-rome PrgEnv-gnu/8.3.3 failed with message:
Lmod has detected the following error: The following module(s) are unknown: "PrgEnv-gnu/8.3.3" "cmake/3.23.2"

Changes are needed in:

./cime_config/machines/config_machines.xml
./cime_config/machines/config_batch.xml
./cime_config/machines/cmake_macros/gnugpu_polaris.cmake
./externals/ekat/extern/kokkos/bin/nvcc_wrapper

This file should also be included:
./components/eamxx/cmake/machine-files/polaris.cmake

I recall @amametjanov was handling Polaris changes earlier. I could help with further testing.

@mahf708
Copy link
Contributor

mahf708 commented Feb 4, 2025

@amametjanov re the last item there (file in eamxx/cmake/machine-files): that's also what's needed for anvil jfyi

@mahf708 mahf708 changed the title (GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE on pm-gpu after kokkos 4.2 PR (GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE on polaris Feb 4, 2025
@amametjanov amametjanov linked a pull request Feb 11, 2025 that will close this issue
amametjanov added a commit that referenced this issue Feb 12, 2025
Update modules on alcf polaris. Also,
- update queues
- use cray wrappers for serial-gnu
- update cmake for gnugpu builds
- add eamxx cmake machine file
- run small eam and mpas-o cases on 1 polaris node
- add MOAB_ROOT env-var

Fixes #6422

[BFB]
@ndkeen
Copy link
Contributor Author

ndkeen commented Feb 25, 2025

ok to close? or create another issue specific to polaris?

@mahf708
Copy link
Contributor

mahf708 commented Feb 26, 2025

Will be automatically closed when #6985 is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kokkos Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes)
Projects
None yet
5 participants