Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

添加'crop_and_zoomin'操作后训练会卡死 #25

Open
terryII opened this issue Jul 22, 2024 · 1 comment
Open

添加'crop_and_zoomin'操作后训练会卡死 #25

terryII opened this issue Jul 22, 2024 · 1 comment

Comments

@terryII
Copy link

terryII commented Jul 22, 2024

如上所示,若数据集中没有'crop_and_zoomin'操作时,则训练可以正常,但添加该操作后,训练会卡在fintune.py程序broadcast_auto_com函数中的mpu.broadcast_data下的torch.distributed.broadcast操作,然后返回如下结果:
`

[rank6]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3667, OpType=BROADCAST, NumelIn=2, NumelOut=2, Timeout(ms)=600000) ran for 600727 milliseconds before timing out.
[rank7]:[E ProcessGroupNCCL.cpp:523] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3667, OpType=BROADCAST, NumelIn=2, NumelOut=2, Timeout(ms)=600000) ran for 600727 milliseconds before timing out.
[rank6]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3667, OpType=BROADCAST, NumelIn=2, NumelOut=2, Timeout(ms)=600000) ran for 600727 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1704987288773/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0ea77ced87 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f0ea89934d6 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f0ea8996a2d in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f0ea8997629 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f0ef424bbf4 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x8609 (0x7f0efdccf609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f0efda9a353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank7]:[E ProcessGroupNCCL.cpp:1182] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3667, OpType=BROADCAST, NumelIn=2, NumelOut=2, Timeout(ms)=600000) ran for 600727 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1704987288773/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa40e725d87 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fa40f8ea4d6 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fa40f8eda2d in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fa40f8ee629 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7fa45b1a2bf4 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x8609 (0x7fa464c26609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fa4649f1353 in /lib/x86_64-linux-gnu/libc.so.6)

[2024-07-22 14:13:38,511] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10525
[2024-07-22 14:13:40,488] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10526
[2024-07-22 14:13:42,465] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10527
[2024-07-22 14:13:45,510] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10528
[2024-07-22 14:13:47,367] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10529
[2024-07-22 14:13:49,382] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10530
[2024-07-22 14:13:51,357] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10531
[2024-07-22 14:13:51,365] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10532

`
请问这种情况该如何解决? @qijimrc

@terryII
Copy link
Author

terryII commented Jul 22, 2024

而且官方com数据集也会出现该种情况,训练硬件为8xA10(24G),MP_SIZE=4,torch=2.2.0,cuda=12.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant