You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
单机多卡训练出现以下错误:
RuntimeError: NCCL communicator was aborted on rank 1. Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=204699, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800104 milliseconds before timing out.
Traceback如下:
Traceback (most recent call last):
File "train.py", line 610, in
fit_one_epoch(model_train, model, ema, yolo_loss, loss_history, eval_callback, optimizer, epoch, epoch_step, epoch_step_val, gen, gen_val, UnFreeze_Epoch, Cuda, fp16, scaler, save_period, save_dir, local_rank)
File "/apsara/TempRoot/Odps/ytrec_20231222071249521gmacm0sr1bm6_93d0ce33_26a4_47d3_889d_d09f09e82671_AlgoTask_0_0/[email protected]#0/workspace/utils/utils_fit.py", line 54, in fit_one_epoch
scaler.scale(loss_value).backward()
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 130, in backward
torch.distributed.all_reduce(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
work = group.allreduce([tensor], opts)
请问大家如何解决呢
The text was updated successfully, but these errors were encountered:
单机多卡训练出现以下错误:
RuntimeError: NCCL communicator was aborted on rank 1. Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=204699, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800104 milliseconds before timing out.
Traceback如下:
Traceback (most recent call last):
File "train.py", line 610, in
fit_one_epoch(model_train, model, ema, yolo_loss, loss_history, eval_callback, optimizer, epoch, epoch_step, epoch_step_val, gen, gen_val, UnFreeze_Epoch, Cuda, fp16, scaler, save_period, save_dir, local_rank)
File "/apsara/TempRoot/Odps/ytrec_20231222071249521gmacm0sr1bm6_93d0ce33_26a4_47d3_889d_d09f09e82671_AlgoTask_0_0/[email protected]#0/workspace/utils/utils_fit.py", line 54, in fit_one_epoch
scaler.scale(loss_value).backward()
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 130, in backward
torch.distributed.all_reduce(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
work = group.allreduce([tensor], opts)
请问大家如何解决呢
The text was updated successfully, but these errors were encountered: