Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: NCCL communicator was aborted on rank 0。单机多卡训练超时 #46

Open
answerman1 opened this issue Dec 22, 2023 · 0 comments

Comments

@answerman1
Copy link

单机多卡训练出现以下错误:
RuntimeError: NCCL communicator was aborted on rank 1. Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=204699, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800104 milliseconds before timing out.
Traceback如下:
Traceback (most recent call last):
File "train.py", line 610, in
fit_one_epoch(model_train, model, ema, yolo_loss, loss_history, eval_callback, optimizer, epoch, epoch_step, epoch_step_val, gen, gen_val, UnFreeze_Epoch, Cuda, fp16, scaler, save_period, save_dir, local_rank)
File "/apsara/TempRoot/Odps/ytrec_20231222071249521gmacm0sr1bm6_93d0ce33_26a4_47d3_889d_d09f09e82671_AlgoTask_0_0/[email protected]#0/workspace/utils/utils_fit.py", line 54, in fit_one_epoch
scaler.scale(loss_value).backward()
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 130, in backward
torch.distributed.all_reduce(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
work = group.allreduce([tensor], opts)

请问大家如何解决呢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant