Cannot find narration_stream_train.json for Narration Refinement #47

kkjh0723 · 2024-12-19T03:08:16Z

I'm attempting to train videollm-online, so following the preprocessing steps.
For narration refinement step(https://github.com/showlab/videollm-online/tree/main/data/preprocess#narration-refinement), I downloaded full ego4d dataset but cannot find narration_stream_{split}.json file inside annotation folder.
Where can I find the file?

I also tried to use the refined file in https://huggingface.co/datasets/chenjoya/videollm-online-chat-ego4d-134k,
but the training was killed by unknown exceptions as follows.

[rank7]:[E1218 10:45:07.218557395 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=453, OpType=ALLREDUCE, Numel
In=113152000, NumelOut=113152000, Timeout(ms)=600000) ran for 600037 milliseconds before timing out.                                                           
[rank7]:[E1218 10:45:07.218835769 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 45
3, last enqueued NCCL work: 453, last completed NCCL work: 452.                                                                                                
[rank7]:[E1218 10:45:07.218872140 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 7] Timeout at NCCL work: 453, last enqueued NCCL work: 453, last completed
 NCCL work: 452.
[rank7]:[E1218 10:45:07.218890390 ProcessGroupNCCL.cpp:630] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kern
els, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E1218 10:45:07.218903480 ProcessGroupNCCL.cpp:636] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E1218 10:45:07.219800514 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=453, OpType=ALLREDUCE, Numel
In=113152000, NumelOut=113152000, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
[rank7]:[E1218 10:45:07.219938677 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watch
dog caught collective operation timeout: WorkNCCL(SeqNum=453, OpType=ALLREDUCE, NumelIn=113152000, NumelOut=113152000, Timeout(ms)=600000) ran for 600037 milli
seconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x780ad52bf446 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x780a8a3bea92 in /opt/c
onda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x780a8a3c5ed3 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x780a8a3c793d in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x780ad54265c0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x780ad7744ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x780ad77d5bf4 in /lib/x86_64-linux-gnu/libc.so.6)

I found that some "time" values are less than zero in refined_narration_stream_train.json from videollm-online-chat-ego4d-134k which can cause the issue.
Is the "time" value can be less than zero?

The text was updated successfully, but these errors were encountered:

kkjh0723 changed the title ~~Cannot fine narration_stream_train.json for Narration Refinement~~ Cannot find narration_stream_train.json for Narration Refinement Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot find narration_stream_train.json for Narration Refinement #47

Cannot find narration_stream_train.json for Narration Refinement #47

kkjh0723 commented Dec 19, 2024

Cannot find narration_stream_train.json for Narration Refinement #47

Cannot find narration_stream_train.json for Narration Refinement #47

Comments

kkjh0723 commented Dec 19, 2024