Silences NCCL warning #1055

finbarrtimbers · 2025-10-03T21:18:08Z

Previously, we would get this warning:

2025-10-03T20:49:10.479Z (PolicyTrainerRayProcess pid=1720) [rank0]:[W1003 20:49:10.123654124
  ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 as device used by this process is currently
  unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in
  init_process_group() to force use of a particular device.

By specifying the device_id in init_process_group, we don't get it any more.

Example run: Beaker

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Adds torch.cuda.set_device(self.local_rank) immediately before init_process_group() calls in PolicyTrainerRayProcess.setup_model_update_group() to silence the NCCL warning about unknown device mapping. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…ning The NCCL warning was coming from deepspeed.init_distributed(), not from the model_update_group initialization. Added torch.cuda.set_device(self.local_rank) immediately before deepspeed.init_distributed() to ensure the device is properly set when DeepSpeed creates its distributed process group. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

cursor · 2025-10-03T21:56:31Z

You have run out of free Bugbot PR reviews for this billing cycle. This will reset on November 1.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

saurabh111233212

nice ty

finbarrtimbers · 2025-10-05T01:00:18Z

/gemini what do you think?

gemini-code-assist · 2025-10-05T01:00:28Z

Thanks for asking! I've reviewed the changes in this pull request.

The addition of device_id=torch.device("cuda", self.local_rank) to torch.distributed.init_process_group in open_instruct/grpo_fast.py is a good and necessary change. As you've noted in the PR description, explicitly setting the device_id helps to silence the NCCL warning regarding unknown device usage.

This is a best practice in distributed training with PyTorch and NCCL, as it ensures that each process is correctly assigned to its intended GPU, preventing potential issues like device mismatches or hangs that the warning indicates. It also improves the clarity and robustness of the distributed setup.

Overall, this is a solid improvement that addresses a common warning in distributed PyTorch applications.

finbarrtimbers and others added 6 commits October 3, 2025 09:55

Loads model on device

352eb0f

Update open_instruct/grpo_fast.py

f431c0d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Set conditional device map

5948e8a

trying to fix it

160b43c

finbarrtimbers requested a review from saurabh111233212 October 3, 2025 21:18

finbarrtimbers added 6 commits October 3, 2025 15:18

Merge branch 'main' into silence-warning2

a09f344

Removed unneeded code.

ca46601

Cleaned up PR.

eadadc7

Cleaned PR

8c1d55d

Cleaned PR

dbc5a40

Undid changes to open_instruct/utils.py

d686ea6

finbarrtimbers marked this pull request as ready for review October 3, 2025 21:56

saurabh111233212 approved these changes Oct 3, 2025

View reviewed changes

finbarrtimbers added this pull request to the merge queue Oct 5, 2025

Merged via the queue into main with commit e8db3e7 Oct 5, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Silences NCCL warning #1055

Silences NCCL warning #1055

finbarrtimbers commented Oct 3, 2025 •

edited

Loading

Uh oh!

cursor bot commented Oct 3, 2025

Uh oh!

saurabh111233212 left a comment

Uh oh!

finbarrtimbers commented Oct 5, 2025

Uh oh!

gemini-code-assist bot commented Oct 5, 2025

Uh oh!

Uh oh!

Uh oh!

Silences NCCL warning #1055

Silences NCCL warning #1055

Conversation

finbarrtimbers commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot commented Oct 3, 2025

Uh oh!

saurabh111233212 left a comment

Choose a reason for hiding this comment

Uh oh!

finbarrtimbers commented Oct 5, 2025

Uh oh!

gemini-code-assist bot commented Oct 5, 2025

Uh oh!

Uh oh!

Uh oh!

finbarrtimbers commented Oct 3, 2025 •

edited

Loading