-
Notifications
You must be signed in to change notification settings - Fork 443
Silences NCCL warning #1055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silences NCCL warning #1055
Conversation
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Adds torch.cuda.set_device(self.local_rank) immediately before init_process_group() calls in PolicyTrainerRayProcess.setup_model_update_group() to silence the NCCL warning about unknown device mapping. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…ning The NCCL warning was coming from deepspeed.init_distributed(), not from the model_update_group initialization. Added torch.cuda.set_device(self.local_rank) immediately before deepspeed.init_distributed() to ensure the device is properly set when DeepSpeed creates its distributed process group. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
You have run out of free Bugbot PR reviews for this billing cycle. This will reset on November 1. To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice ty
/gemini what do you think? |
Thanks for asking! I've reviewed the changes in this pull request. The addition of This is a best practice in distributed training with PyTorch and NCCL, as it ensures that each process is correctly assigned to its intended GPU, preventing potential issues like device mismatches or hangs that the warning indicates. It also improves the clarity and robustness of the distributed setup. Overall, this is a solid improvement that addresses a common warning in distributed PyTorch applications. |
Previously, we would get this warning:
By specifying the device_id in
init_process_group
, we don't get it any more.Example run: Beaker