添加'crop_and_zoomin'操作后训练会卡死 #25

terryII · 2024-07-22T10:27:28Z

如上所示，若数据集中没有'crop_and_zoomin'操作时，则训练可以正常，但添加该操作后，训练会卡在fintune.py程序broadcast_auto_com函数中的mpu.broadcast_data下的torch.distributed.broadcast操作，然后返回如下结果：
`

[rank6]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3667, OpType=BROADCAST, NumelIn=2, NumelOut=2, Timeout(ms)=600000) ran for 600727 milliseconds before timing out.
[rank7]:[E ProcessGroupNCCL.cpp:523] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3667, OpType=BROADCAST, NumelIn=2, NumelOut=2, Timeout(ms)=600000) ran for 600727 milliseconds before timing out.
[rank6]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3667, OpType=BROADCAST, NumelIn=2, NumelOut=2, Timeout(ms)=600000) ran for 600727 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1704987288773/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0ea77ced87 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f0ea89934d6 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f0ea8996a2d in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f0ea8997629 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f0ef424bbf4 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x8609 (0x7f0efdccf609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f0efda9a353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank7]:[E ProcessGroupNCCL.cpp:1182] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3667, OpType=BROADCAST, NumelIn=2, NumelOut=2, Timeout(ms)=600000) ran for 600727 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1704987288773/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa40e725d87 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fa40f8ea4d6 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fa40f8eda2d in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fa40f8ee629 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7fa45b1a2bf4 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x8609 (0x7fa464c26609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fa4649f1353 in /lib/x86_64-linux-gnu/libc.so.6)

[2024-07-22 14:13:38,511] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10525
[2024-07-22 14:13:40,488] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10526
[2024-07-22 14:13:42,465] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10527
[2024-07-22 14:13:45,510] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10528
[2024-07-22 14:13:47,367] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10529
[2024-07-22 14:13:49,382] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10530
[2024-07-22 14:13:51,357] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10531
[2024-07-22 14:13:51,365] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10532

`
请问这种情况该如何解决？ @qijimrc

The text was updated successfully, but these errors were encountered:

terryII · 2024-07-22T10:28:19Z

而且官方com数据集也会出现该种情况，训练硬件为8xA10(24G),MP_SIZE=4，torch=2.2.0,cuda=12.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

添加'crop_and_zoomin'操作后训练会卡死 #25

添加'crop_and_zoomin'操作后训练会卡死 #25

terryII commented Jul 22, 2024 •

edited

Loading

terryII commented Jul 22, 2024 •

edited

Loading

添加'crop_and_zoomin'操作后训练会卡死 #25

添加'crop_and_zoomin'操作后训练会卡死 #25

Comments

terryII commented Jul 22, 2024 • edited Loading

terryII commented Jul 22, 2024 • edited Loading

terryII commented Jul 22, 2024 •

edited

Loading

terryII commented Jul 22, 2024 •

edited

Loading