Skip to content

Conversation

finbarrtimbers
Copy link
Collaborator

@finbarrtimbers finbarrtimbers commented Oct 3, 2025

Previously, we would get this warning:

2025-10-03T20:49:10.479Z (PolicyTrainerRayProcess pid=1720) [rank0]:[W1003 20:49:10.123654124
  ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 as device used by this process is currently
  unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in
  init_process_group() to force use of a particular device.

By specifying the device_id in init_process_group, we don't get it any more.

Example run: Beaker

finbarrtimbers and others added 6 commits October 3, 2025 09:55
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Adds torch.cuda.set_device(self.local_rank) immediately before
init_process_group() calls in PolicyTrainerRayProcess.setup_model_update_group()
to silence the NCCL warning about unknown device mapping.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…ning

The NCCL warning was coming from deepspeed.init_distributed(), not from the
model_update_group initialization. Added torch.cuda.set_device(self.local_rank)
immediately before deepspeed.init_distributed() to ensure the device is
properly set when DeepSpeed creates its distributed process group.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@finbarrtimbers finbarrtimbers marked this pull request as ready for review October 3, 2025 21:56
Copy link

cursor bot commented Oct 3, 2025

You have run out of free Bugbot PR reviews for this billing cycle. This will reset on November 1.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Copy link
Contributor

@saurabh111233212 saurabh111233212 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice ty

@finbarrtimbers
Copy link
Collaborator Author

/gemini what do you think?

Copy link
Contributor

Thanks for asking! I've reviewed the changes in this pull request.

The addition of device_id=torch.device("cuda", self.local_rank) to torch.distributed.init_process_group in open_instruct/grpo_fast.py is a good and necessary change. As you've noted in the PR description, explicitly setting the device_id helps to silence the NCCL warning regarding unknown device usage.

This is a best practice in distributed training with PyTorch and NCCL, as it ensures that each process is correctly assigned to its intended GPU, preventing potential issues like device mismatches or hangs that the warning indicates. It also improves the clarity and robustness of the distributed setup.

Overall, this is a solid improvement that addresses a common warning in distributed PyTorch applications.

@finbarrtimbers finbarrtimbers added this pull request to the merge queue Oct 5, 2025
Merged via the queue into main with commit e8db3e7 Oct 5, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants