-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Add env vars to enable Share AMD ROCR_VISIBLE_DEVICES
#49346
[Train] Add env vars to enable Share AMD ROCR_VISIBLE_DEVICES
#49346
Conversation
Signed-off-by: Hongpeng Guo <[email protected]>
@AVSuni @amorinConnor Feel free to take a look and review this PR. |
ROCM_VIDIABLE_DEVICES
ROCM_VISIBLE_DEVICES
@hongpeng-guo I believe AMD uses ROCR* in environmental variables, not ROCM* as you have it: https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html I will run some tests to see if this fixes the issue today. |
Just as a follow up there are already some spots inside ray where ROCR* is utilized already. [python/ray/_private/accelerators/amd_gpu.py] for example. |
@hongpeng-guo After modifying your code to use ROCR* it looks like this fixes the issue. While I'm not able to run the original code ( I think due to another problem on my end) the following examples runs without error and rocm-smi shows all 4 gpus utilized:
|
Thank you so much for testing it out! Let me update this PR and try to get it merged soon. |
Got it! Thank you so much digging deep into it. The above code are from ray core level accelerator setup. In Ray Train, our abstraction is a bit different. But I think in the long run, maybe we can reuse the Ray Core accelerator utilities. cc @matthewdeng |
Signed-off-by: Hongpeng Guo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: Fix env var naming from ROCM to ROCR. confirmed it's working on AMD devices, according to @amorinConnor
@matthewdeng PTAL.
ROCM_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
Why are these changes needed?
This PR enables to share
ROCR_VISIBLE_DEVICES
when using AMD GPUs. In this way, the devices can see and communicate with other GPU devices.Related issue number
#49260
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.