-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error returned from CUDA function in CUDA-aware MPI multi-GPU test #2522
Comments
Can you explain why you think this is a CUDA.jl issue? For one, you didn't share |
@maleadt, I shared the file using MPI
using CUDA
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
# select device
comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
rank_l = MPI.Comm_rank(comm_l)
gpu_id = CUDA.device!(rank_l)
# select device
size = MPI.Comm_size(comm)
dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id), size=$size, dst=$dst, src=$src")
N = 4
send_mesg = CuArray{Float64}(undef, N)
recv_mesg = CuArray{Float64}(undef, N)
fill!(send_mesg, Float64(rank))
CUDA.synchronize()
rank==0 && println("start sending...")
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank_l: $recv_mesg")
rank==0 && println("done.") I thought it was a CUDA.jl issue as the error message stated "Error returned from CUDA function". Do you think it comes from my |
Sorry, I glossed over the link.
CUDA.jl is a Julia interface to the CUDA library, which throws an error here. I'm not familiar with MPI, MPI.jl or Open MPI. Maybe try opening a post on Discourse where other (i.e. HPC) users can chime in, or ask on the relevant channels on Slack. |
Well, seems it was a module issue from the cluster I work on and the test passes now. I’m closing the issue then. Sorry for disturbing. |
Thanks for the update! |
Describe the bug
CUDA-aware MPI multi-GPU test (available here) fails by returning the following error message:
To reproduce
The Minimal Working Example (MWE) for this bug:
Manifest.toml
Expected behavior
Test on
alltoall_test_cuda_multigpu.jl
should passVersion info
Details on Julia:
Details on CUDA:
Additional context
This is the
Project.toml
file I append to theJULIA_LOAD_PATH
environment variable:The text was updated successfully, but these errors were encountered: