-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE
on polaris
#6422
Comments
I just opened a quick follow up PR, to clean up some deprecated code issues (they did not cause falls in e3sm bc deprecated code in looks was allowed). I can add a fix for this in that PR. |
Fine with me. Just include the change to make the muller-gpu file same as pm-gpu. |
…ASYNC=OFF' into next (PR #6423) After #6101 which brings in kokkos 4.2, we see runtime error with a test like: ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-gpu_gnugpu hits runtime error like: 0: (GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 97 unfortunately, the tests hitting this error are also hanging... A fix that seems to work is to add this build flag: -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF The fix is merged in another PR for pm-gpu, so this PR just makes same change to muller-gpu. Fixes #6422
Just to note that the same issue shows up running a recent E3SM/SCREAM version on ALCF's Polaris:
Adding |
Yes, there is Polaris setup in the master, but it fails minimum with this error:
Changes are needed in:
This file should also be included: I recall @amametjanov was handling Polaris changes earlier. I could help with further testing. |
@amametjanov re the last item there (file in eamxx/cmake/machine-files): that's also what's needed for anvil jfyi |
(GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE
on pm-gpu after kokkos 4.2 PR(GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE
on polaris
Update modules on alcf polaris. Also, - update queues - use cray wrappers for serial-gnu - update cmake for gnugpu builds - add eamxx cmake machine file - run small eam and mpas-o cases on 1 polaris node - add MOAB_ROOT env-var Fixes #6422 [BFB]
ok to close? or create another issue specific to polaris? |
Will be automatically closed when #6985 is merged |
After #6101 which brings in kokkos 4.2, we see runtime error with a test like:
ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-gpu_gnugpu
(also on similar machine
ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.muller-gpu_gnugpu
)A fix that seems to work is to add this build flag:
Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF
This will also be needed for scream repo.
@bartgol @mahf708
The text was updated successfully, but these errors were encountered: