-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA][HIP] too many process spawned on multiple GPU systems #15251
Comments
Forgive me if I've misunderstood the problem since I'm not sure there is enough information or maybe I misread, but it looks to me like this is expected behaviour. As you point out you can resolve this problem via using Now I think what you are asking is a way to map cpu ranks to gpus without using This is identical to how you do MPI with native CUDA, and this is generally the case; we have tried to emphasize this in https://developer.codeplay.com/products/oneapi/nvidia/2024.2.1/guides/MPI-guide If I am wrong and there is a problem with using cuda-aware MPI in SYCL that is not documented in https://developer.codeplay.com/products/oneapi/nvidia/2024.2.1/guides/MPI-guide |
Indeed the situation described in (https://developer.codeplay.com/products/oneapi/nvidia/2024.2.1/guides/MPI-guide#mapping-mpi-ranks-to-specific-devices) is really close to what i'm doing internally.
As described in the same guide in doing something which looks like std::vector<sycl::device> Devs;
for (const auto &plt : sycl::platform::get_platforms()) {
if (plt.get_backend() == sycl::backend::cuda)
Devs.push_back(plt.get_devices()[0]);
}
sycl::queue q{Devs[rank]}; However, correct me if I am wrong, the expected behavior would be if i do
Currently by doing so you will instead get :
i.e. all ranks start the process on all GPU, even if only one of them is used per processes. The issue is that there is now way to disable streams on unused device. This confuses MPI which in turn, i suspect create the memory leak. Maybe i was unclear in the initial post, but to reproduce the issue you can simply start a SYCL programm without MPI and observe that both GPUs show up in nvidia-smi. Even if this can be fixed by using a proper binding script i suspect that this is not expected behavior of DPC++ ??? |
Yes that should be correct. I see what you mean. I have not seen such behaviour but I can try to reproduce it. I wonder first of all whether it is an artifact of some part of your program: First of all, have you tried our samples that we linked in the documentation? e.g. https://github.com/codeplaysoftware/SYCL-samples/blob/main/src/MPI_with_SYCL/send_recv_usm.cpp As I understand it, you would expect to see the same behaviour for that sample, but I don't remember ever seeing duplicate processes. If you do see the same issue with that sample, I suspect this might also be an artifact of your cluster setup. You might also want to confirm that you don't see the same behaviour with a simple cuda MPI program, e.g. https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/ I would be surprised if this is a dpc++ specific issue. Once the program is compiled, as far as MPI is concerned there is no distinction between it being compiled with dpc++ or nvcc. |
I will try, but the simplest exemple tends to already trigger the issue with dpcpp. This simple code on a dual GPU system shows the issue already without MPI: #include <sycl/sycl.hpp>
#include <iostream>
std::vector<sycl::device> get_sycl_device_list() {
std::vector<sycl::device> devs;
const auto &Platforms = sycl::platform::get_platforms();
for (const auto &Platform : Platforms) {
const auto &Devices = Platform.get_devices();
for (const auto &Device : Devices) {
devs.push_back(Device);
return devs;
}
}
return devs;
}
int main(void){
for (auto d : get_sycl_device_list()){
auto DeviceName = d.get_info<sycl::info::device::name>();
std::cout <<DeviceName << std::endl;
}
std::cin.ignore();
}
Here the process is initialised on both GPUs even though no queues have been created, and only the first device has been used (only to query its name). Including MPI would do pretty much the same times 2. Send receives works fine with that setup, except for the weird memory leak (I've checked the allocations and it is not on my side). |
This definitely isn't happening on my system (I just sanity checked it again using your code quoted above on a multi-gpu system). |
For the record, I tried with oneAPI 2024.2.0 (and a matching Codeplay plugin) on a dual-GPU machine, and have the same output as @tdavidcl: $ sycl-ls
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 2080 Ti 7.5 [CUDA 12.4]
[cuda:gpu][cuda:1] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 2080 Ti 7.5 [CUDA 12.4]
$ nvidia-smi --query-compute-apps=pid,name,gpu_bus_id,used_gpu_memory --format=csv
pid, process_name, gpu_bus_id, used_gpu_memory [MiB]
$ /opt/tcbsys/intel-oneapi/2024.2.0/compiler/2024.2/bin/compiler/clang++ -fsycl test.cpp
$ ./a.out &
[1] 17000
$ Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
Press any key...
[1]+ Stopped ./a.out
$ nvidia-smi --query-compute-apps=pid,name,gpu_bus_id,used_gpu_memory --format=csv
pid, process_name, gpu_bus_id, used_gpu_memory [MiB]
17000, ./a.out, 00000000:17:00.0, 154 MiB
17000, ./a.out, 00000000:65:00.0, 154 MiB |
Thanks, I've now reproduced the issue. We think we understand the root cause, and someone on the team has a patch on the way. It isn't a MPI specific issue, but a problem with the usage of cuContext that affects all codes. |
I opened a proposed fix for this here oneapi-src/unified-runtime#2077 If you have any feedback on this then feel free to post. Thanks |
We have updated our MPI documentation to reflect this issue. See https://developer.codeplay.com/products/oneapi/nvidia/latest/guides/MPI-guide#mapping-mpi-ranks-to-specific-devices Apologies that there is not a current solution other than to rely on environment variables as you have already done. Thank you very much for pointing this issue out to us. I have opened up an issue on OPENMPI to try to understand this issue better open-mpi/ompi#12848 |
Describe the bug
On multiple GPU systems, using HIP or CUDA, a process is spawned on all GPUs instead being spawned only on one of them. (See To reproduce section)
This result in memory leaks when SYCL is used with both mpich and openmpi as both GPUs ends up receiving the data, even though the program (in the following exemple a private HPC application) only use one of them per MPI ranks. This result in a graph like this (memory usage per process / time)
mpirun -n 2 <...>
where the blue and red curve are the working GPU processes, and the two other growing ones are the threads on the wrong GPUs.
CUDA_VISIBLE_DEVICES can be used to circumvent the issue
To reproduce
On a multiple GPU system, this code snippet result in processes being spawned on both GPUs, even though only one GPU should be initialized.
Environment
clang++ --version
]sycl-ls --verbose
]Additional context
No response
The text was updated successfully, but these errors were encountered: