You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[rank7]:[E1218 10:45:07.218557395 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=453, OpType=ALLREDUCE, Numel
In=113152000, NumelOut=113152000, Timeout(ms)=600000) ran for 600037 milliseconds before timing out.
[rank7]:[E1218 10:45:07.218835769 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 45
3, last enqueued NCCL work: 453, last completed NCCL work: 452.
[rank7]:[E1218 10:45:07.218872140 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 7] Timeout at NCCL work: 453, last enqueued NCCL work: 453, last completed
NCCL work: 452.
[rank7]:[E1218 10:45:07.218890390 ProcessGroupNCCL.cpp:630] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kern
els, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E1218 10:45:07.218903480 ProcessGroupNCCL.cpp:636] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E1218 10:45:07.219800514 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=453, OpType=ALLREDUCE, Numel
In=113152000, NumelOut=113152000, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
[rank7]:[E1218 10:45:07.219938677 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watch
dog caught collective operation timeout: WorkNCCL(SeqNum=453, OpType=ALLREDUCE, NumelIn=113152000, NumelOut=113152000, Timeout(ms)=600000) ran for 600037 milli
seconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x780ad52bf446 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x780a8a3bea92 in /opt/c
onda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x780a8a3c5ed3 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x780a8a3c793d in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x780ad54265c0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x780ad7744ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x780ad77d5bf4 in /lib/x86_64-linux-gnu/libc.so.6)
I found that some "time" values are less than zero in refined_narration_stream_train.json from videollm-online-chat-ego4d-134k which can cause the issue.
Is the "time" value can be less than zero?
The text was updated successfully, but these errors were encountered:
kkjh0723
changed the title
Cannot fine narration_stream_train.json for Narration Refinement
Cannot find narration_stream_train.json for Narration Refinement
Dec 19, 2024
I'm attempting to train videollm-online, so following the preprocessing steps.
For narration refinement step(https://github.com/showlab/videollm-online/tree/main/data/preprocess#narration-refinement), I downloaded full ego4d dataset but cannot find narration_stream_{split}.json file inside annotation folder.
Where can I find the file?
I also tried to use the refined file in https://huggingface.co/datasets/chenjoya/videollm-online-chat-ego4d-134k,
but the training was killed by unknown exceptions as follows.
I found that some "time" values are less than zero in

refined_narration_stream_train.json
from videollm-online-chat-ego4d-134k which can cause the issue.Is the "time" value can be less than zero?
The text was updated successfully, but these errors were encountered: