Can I specify a persisent CPU-buffer for "fallback" when using heterogeneous GPUDirect+TCP MPI? #9789
Unanswered
TysonRayJones
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello brilliant UCX team and community!
I'm developing a distributed heterogeneous GPU application where very large messages (e.g.
64 GiB
, split up of course) are being exchanged between GPUs. Some GPUs are connected with NVLink and can communicate via GPUDirect, while others are indirectly connected through an CPU interconnect. My application requires multiple persistent arrays in both RAM and VRAM, which I re-use as MPI communication buffers. These persisent arrays almost fill the entirety of RAM and VRAM. I am using the latest OpenMPI, UCX and CUDA versions.Currently, my inter-GPU communication is merely
MPI_Isend
andMPI_Irecv
passing in CUDA device pointers. This works great when exchanging data between the NVLink'd GPUs; UCX uses GPUDirect and everything's lightning fast 🎉 However, using this method to exchange between the ethernet'd GPUs has issues:cudaMemcpy
), a host-to-host message, then a copy back to VRAM.My assumption is that this is due to UCX having to make its own RAM buffers in order to first route the exchanged data through the CPU . That's slower than my manual copy because it does this per-message, and performs gratuitious mallocs avoided by my use of the persisent RAM arrays. The crash is possibly caused by these temporary buffers exceeding the remaining available RAM, which is being hogged by my persistent arrays.
If my dubious understanding is correct, there seem at least two ways I can address these problems:
I'm unsure if either of these things are possible - obviously not through the message-passing interface at least, but I'm hoping I can hook into UCX at a lower level to determine or configure such things. I have fuzzy ideas about using
cudaDeviceCanAccessPeer()
, though can't see how I could query GPUs across different machines, and I also don't want to rob UCX of the opportunity to use inter-GPU exchange that is faster than my manual copying when permitted by non P2P technologies.Does anyone have any thoughts about my use-case, and ideas for a solution? I have made a stackoverflow post about this problem, where I'm thinking generally about CUDA-aware MPI, but I'm happy to build in a UCX-specific solution.
Thanks for reading!
Beta Was this translation helpful? Give feedback.
All reactions