Can I specify a persisent CPU-buffer for "fallback" when using heterogeneous GPUDirect+TCP MPI? #9789

TysonRayJones · 2024-03-30T15:57:26Z

TysonRayJones
Mar 30, 2024

Hello brilliant UCX team and community!

I'm developing a distributed heterogeneous GPU application where very large messages (e.g. 64 GiB, split up of course) are being exchanged between GPUs. Some GPUs are connected with NVLink and can communicate via GPUDirect, while others are indirectly connected through an CPU interconnect. My application requires multiple persistent arrays in both RAM and VRAM, which I re-use as MPI communication buffers. These persisent arrays almost fill the entirety of RAM and VRAM. I am using the latest OpenMPI, UCX and CUDA versions.

Currently, my inter-GPU communication is merely MPI_Isend and MPI_Irecv passing in CUDA device pointers. This works great when exchanging data between the NVLink'd GPUs; UCX uses GPUDirect and everything's lightning fast 🎉 However, using this method to exchange between the ethernet'd GPUs has issues:

It is slower (~2x) than a manual copy to RAM (via cudaMemcpy), a host-to-host message, then a copy back to VRAM.
It causes an out-of-memory crash in my settings.

My assumption is that this is due to UCX having to make its own RAM buffers in order to first route the exchanged data through the CPU . That's slower than my manual copy because it does this per-message, and performs gratuitious mallocs avoided by my use of the persisent RAM arrays. The crash is possibly caused by these temporary buffers exceeding the remaining available RAM, which is being hogged by my persistent arrays.

If my dubious understanding is correct, there seem at least two ways I can address these problems:

somehow instruct UCX to use my existing persistent RAM buffer in lieu of its temporary ones when it falls back to routing VRAM through RAM.
prior detect at runtime that UCX would route the messages through the CPU, and resort to my own manual copying.

I'm unsure if either of these things are possible - obviously not through the message-passing interface at least, but I'm hoping I can hook into UCX at a lower level to determine or configure such things. I have fuzzy ideas about using cudaDeviceCanAccessPeer(), though can't see how I could query GPUs across different machines, and I also don't want to rob UCX of the opportunity to use inter-GPU exchange that is faster than my manual copying when permitted by non P2P technologies.

Does anyone have any thoughts about my use-case, and ideas for a solution? I have made a stackoverflow post about this problem, where I'm thinking generally about CUDA-aware MPI, but I'm happy to build in a UCX-specific solution.

Thanks for reading!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I specify a persisent CPU-buffer for "fallback" when using heterogeneous GPUDirect+TCP MPI? #9789

{{title}}

Replies: 0 comments

Select a reply

Can I specify a persisent CPU-buffer for "fallback" when using heterogeneous GPUDirect+TCP MPI? #9789

TysonRayJones Mar 30, 2024

Replies: 0 comments

TysonRayJones
Mar 30, 2024