Avoid poisoning process with CUDA calls as soon as importing #6810

HollowMan6 · 2024-11-29T21:49:59Z

Call torch.cuda.device_count() > 0 before torch.cuda.is_available(), to give priority to nvml based availability, so that we can try not to poison process with CUDA calls as soon as we execute import deepspeed.

https://github.com/pytorch/pytorch/blob/v2.5.1/torch/cuda/__init__.py#L120-L124

There are 2 reasons to make this change:

Firstly, if we accidentally import deepspeed, since the CUDA runtime initializes when the first CUDA API call is made and caches the device list, changing the CUDA_VISIBLE_DEVICES within the same process after initialization won't have any effect on the visible devices. The specific case:
OpenRLHF/OpenRLHF#524 (comment)

A demo for reproduction before the fix is applied:

import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""
import deepspeed
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
torch.cuda.set_device('cuda:0')

Secondly, https://pytorch.org/docs/stable/notes/cuda.html

When assessing the availability of CUDA in a given environment (is_available()), PyTorch’s default behavior is to call the CUDA Runtime API method cudaGetDeviceCount. Because this call in turn initializes the CUDA Driver API (via cuInit) if it is not already initialized, subsequent forks of a process that has run is_available() will fail with a CUDA initialization error.

tjruwase · 2024-12-06T15:24:51Z

@HollowMan6, thanks for diagnosing the problem and sharing a PR. However, I don't understand why this is the correct solution.

Switch from torch.cuda.is_available() to torch.cuda.device_count() > 0, to give priority to nvml based availability

The link you shared shows the is_available() already gives priority to NVML based checking assuming the correct environment setup. So, I am confused about your above comment.
device_count() succeeds only if NVML works, otherwise it falls back to the solution that causes fork poisoning.

Can you please explain what I am missing? Thanks!

tjruwase · 2024-12-06T15:41:31Z

@HollowMan6, I want to share my thoughts on this problem.

Building on your great analysis here and there, I did some further digging to get a better appreciation of the painful status of CUDA availability discovering on PyTorch. However, I think this is a problem that should be fixed in PyTorch rather than DeepSpeed. This is because DeepSpeed builds on PyTorch and should preserve semantics (for good or bad) as much as practical. So, my three takeaways are as follows:

1. In this case, my understanding is that `is_available()` is the recommended way to CUDA discovery, despite the limitations.

2. There is recommendation for how users can enable NVML based checking for `is_available()`. We can amplify this.

3. However, the docs also notes that NVML is not foolproof.

Hope to get your thoughts. Thanks!

HollowMan6 · 2024-12-06T16:04:42Z

Hi @tjruwase! Thank you for reviewing the PR! Some comments from me:

The link you shared shows the is_available() already gives priority to NVML based checking assuming the correct environment setup. So, I am confused about your above comment.

Yes, that's correct, but I don't think it's a good idea to force users to set the flag in the environment, as https://pytorch.org/docs/stable/notes/cuda.html noted that the NVML-based CUDA availability assessment provides a weaker guarantee than the default CUDA Runtime API approach (which requires CUDA initialization to succeed). In some circumstances, the NVML-based check may succeed while later CUDA initialization fails. (as you noted)

From my understanding, according to the comments of the code, which mentions that you only want to determine if we are on a GPU or x86 CPU with a torch instead of hoping to guarantee the CUDA initialization will succeed. But for your concerns, if you do want to ensure that CUDA initialization also work, I can change the check into

if torch.cuda.device_count() > 0 and torch.cuda.is_available():  #ignore-cuda

So that we first ensure that there are devices available, and then we do checks about CUDA initialization. If not, we shouldn't do any CUDA call.

Will update the PR for this.

However, I think this is a problem that should be fixed in PyTorch rather than DeepSpeed. This is because DeepSpeed builds on PyTorch and should preserve semantics (for good or bad) as much as practical.

This is unfortunately a CUDA issue and I don't think PyTorch can do much on their side as well: pytorch/pytorch#141678 (comment)

For DeepSpeed, what makes it worse is that we will do a CUDA call as soon as we do an import, which puts developers in a very tough situation when something like OpenRLHF/OpenRLHF#524 (comment) happens again. So, personally, I do hope that this particular case can get fixed on the DeepSpeed side (Don't do any CUDA calls when no CUDA device is available).

device_count() succeeds only if NVML works, otherwise it falls back to the solution that causes fork poisoning.

Yes, that's true and unfortunate, but the issue can get mitigated when NVML works, so it would still be great to do something on the DeepSpeed side.

tjruwase · 2024-12-09T16:47:30Z

From my understanding, according to the comments of the code, which mentions that you only want to determine if we are on a GPU or x86 CPU with a torch instead of hoping to guarantee the CUDA initialization will succeed

Actually, you are correct, the intention for this logic is device identification rather than initialization. I missed/forgot this nuance when I first reviewed. However, it seems unclear whether CUDA device identification can be generally performed without fork poisoning since device_count() and NVML-based checking are not guaranteed to always succeed. Nevertheless, your PR is usability improvement. Thanks for the great contribution.

HollowMan6 · 2024-12-09T16:50:13Z

Will fix the formatting issue now.

This failure doesn't seem to relate to this PR (permission denied error) https://github.com/microsoft/DeepSpeed/actions/runs/12240278508/job/34142717128

Call `torch.cuda.device_count() > 0` before `torch.cuda.is_available()`, to give priority to nvml based availability, so that we can try not to poison process with CUDA calls as soon as we execute `import deepspeed`. https://github.com/pytorch/pytorch/blob/v2.5.1/torch/cuda/__init__.py#L120-L124 There are 2 reasons to make this change: Firstly, if we accidentally import deepspeed, since the CUDA runtime initializes when the first CUDA API call is made and caches the device list, changing the CUDA_VISIBLE_DEVICES within the same process after initialization won't have any effect on the visible devices. The specific case: OpenRLHF/OpenRLHF#524 (comment) A demo for reproduction before the fix is applied: ```python import torch import os os.environ["CUDA_VISIBLE_DEVICES"] = "" import deepspeed os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" torch.cuda.set_device('cuda:0') ``` Secondly, https://pytorch.org/docs/stable/notes/cuda.html When assessing the availability of CUDA in a given environment (is_available()), PyTorch’s default behavior is to call the CUDA Runtime API method cudaGetDeviceCount. Because this call in turn initializes the CUDA Driver API (via cuInit) if it is not already initialized, subsequent forks of a process that has run is_available() will fail with a CUDA initialization error. Signed-off-by: Hollow Man <[email protected]>

HollowMan6 · 2024-12-09T17:00:32Z

Fixed, this is not intelligent, as those are just comments XD

loadams requested a review from tjruwase December 4, 2024 19:58

HollowMan6 marked this pull request as draft December 6, 2024 16:18

HollowMan6 force-pushed the pr branch from 8e4d59f to a3a33ee Compare December 6, 2024 17:11

HollowMan6 marked this pull request as ready for review December 6, 2024 17:11

tjruwase approved these changes Dec 9, 2024

View reviewed changes

HollowMan6 force-pushed the pr branch from d2b1b61 to 0e67208 Compare December 9, 2024 17:00

loadams added 2 commits December 9, 2024 16:43

Merge branch 'master' into pr

907dedd

Merge branch 'master' into pr

cc762aa

loadams approved these changes Dec 10, 2024

View reviewed changes

Merge branch 'master' into pr

1dc0be3

loadams enabled auto-merge December 10, 2024 23:59

loadams added 2 commits December 11, 2024 12:18

Merge branch 'master' into pr

de0f21d

Merge branch 'master' into pr

5197e35

loadams added this pull request to the merge queue Dec 12, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 12, 2024

loadams merged commit 9182947 into microsoft:master Dec 12, 2024
11 checks passed

HollowMan6 deleted the pr branch December 12, 2024 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid poisoning process with CUDA calls as soon as importing #6810

Avoid poisoning process with CUDA calls as soon as importing #6810

HollowMan6 commented Nov 29, 2024 •

edited

Loading

tjruwase commented Dec 6, 2024

tjruwase commented Dec 6, 2024 •

edited

Loading

HollowMan6 commented Dec 6, 2024

tjruwase commented Dec 9, 2024

HollowMan6 commented Dec 9, 2024 •

edited

Loading

HollowMan6 commented Dec 9, 2024

Avoid poisoning process with CUDA calls as soon as importing #6810

Avoid poisoning process with CUDA calls as soon as importing #6810

Conversation

HollowMan6 commented Nov 29, 2024 • edited Loading

tjruwase commented Dec 6, 2024

tjruwase commented Dec 6, 2024 • edited Loading

1. In this case, my understanding is that is_available() is the recommended way to CUDA discovery, despite the limitations.

2. There is recommendation for how users can enable NVML based checking for is_available(). We can amplify this.

3. However, the docs also notes that NVML is not foolproof.

HollowMan6 commented Dec 6, 2024

tjruwase commented Dec 9, 2024

HollowMan6 commented Dec 9, 2024 • edited Loading

HollowMan6 commented Dec 9, 2024

HollowMan6 commented Nov 29, 2024 •

edited

Loading

tjruwase commented Dec 6, 2024 •

edited

Loading

1. In this case, my understanding is that `is_available()` is the recommended way to CUDA discovery, despite the limitations.

2. There is recommendation for how users can enable NVML based checking for `is_available()`. We can amplify this.

HollowMan6 commented Dec 9, 2024 •

edited

Loading