Skip to content

Commit e8db3e7

Browse files
finbarrtimbersgemini-code-assist[bot]claude
authored
Silences NCCL warning (#1055)
* Loads model on device * Update open_instruct/grpo_fast.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Set conditional device map * Set CUDA device before init_process_group to silence NCCL warning Adds torch.cuda.set_device(self.local_rank) immediately before init_process_group() calls in PolicyTrainerRayProcess.setup_model_update_group() to silence the NCCL warning about unknown device mapping. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Set CUDA device immediately before DeepSpeed init to silence NCCL warning The NCCL warning was coming from deepspeed.init_distributed(), not from the model_update_group initialization. Added torch.cuda.set_device(self.local_rank) immediately before deepspeed.init_distributed() to ensure the device is properly set when DeepSpeed creates its distributed process group. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * trying to fix it * Removed unneeded code. * Cleaned up PR. * Cleaned PR * Cleaned PR * Undid changes to open_instruct/utils.py --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Claude <[email protected]>
1 parent 0913deb commit e8db3e7

File tree

1 file changed

+9
-0
lines changed

1 file changed

+9
-0
lines changed

open_instruct/grpo_fast.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -625,6 +625,15 @@ def load(self, path: str, map_location=None):
625625
np.random.seed(worker_seed)
626626
random.seed(worker_seed)
627627

628+
torch.distributed.init_process_group(
629+
backend="nccl",
630+
init_method="env://",
631+
world_size=self.world_size,
632+
rank=self.rank,
633+
timeout=timedelta(minutes=args.backend_timeout),
634+
device_id=torch.device("cuda", self.local_rank),
635+
)
636+
628637
deepspeed.init_distributed(timeout=timedelta(minutes=args.backend_timeout))
629638

630639
ds_config = get_train_ds_config(offload=False, adam_offload=False, stage=args.deepspeed_stage, bf16=True)

0 commit comments

Comments
 (0)