Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot find narration_stream_train.json for Narration Refinement #47

Open
kkjh0723 opened this issue Dec 19, 2024 · 0 comments
Open

Cannot find narration_stream_train.json for Narration Refinement #47

kkjh0723 opened this issue Dec 19, 2024 · 0 comments

Comments

@kkjh0723
Copy link

I'm attempting to train videollm-online, so following the preprocessing steps.
For narration refinement step(https://github.com/showlab/videollm-online/tree/main/data/preprocess#narration-refinement), I downloaded full ego4d dataset but cannot find narration_stream_{split}.json file inside annotation folder.
Where can I find the file?

I also tried to use the refined file in https://huggingface.co/datasets/chenjoya/videollm-online-chat-ego4d-134k,
but the training was killed by unknown exceptions as follows.

[rank7]:[E1218 10:45:07.218557395 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=453, OpType=ALLREDUCE, Numel
In=113152000, NumelOut=113152000, Timeout(ms)=600000) ran for 600037 milliseconds before timing out.                                                           
[rank7]:[E1218 10:45:07.218835769 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 45
3, last enqueued NCCL work: 453, last completed NCCL work: 452.                                                                                                
[rank7]:[E1218 10:45:07.218872140 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 7] Timeout at NCCL work: 453, last enqueued NCCL work: 453, last completed
 NCCL work: 452.
[rank7]:[E1218 10:45:07.218890390 ProcessGroupNCCL.cpp:630] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kern
els, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E1218 10:45:07.218903480 ProcessGroupNCCL.cpp:636] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E1218 10:45:07.219800514 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=453, OpType=ALLREDUCE, Numel
In=113152000, NumelOut=113152000, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
[rank7]:[E1218 10:45:07.219938677 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watch
dog caught collective operation timeout: WorkNCCL(SeqNum=453, OpType=ALLREDUCE, NumelIn=113152000, NumelOut=113152000, Timeout(ms)=600000) ran for 600037 milli
seconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x780ad52bf446 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x780a8a3bea92 in /opt/c
onda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x780a8a3c5ed3 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x780a8a3c793d in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x780ad54265c0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x780ad7744ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x780ad77d5bf4 in /lib/x86_64-linux-gnu/libc.so.6)

I found that some "time" values are less than zero in refined_narration_stream_train.json from videollm-online-chat-ego4d-134k which can cause the issue.
Is the "time" value can be less than zero?
image

@kkjh0723 kkjh0723 changed the title Cannot fine narration_stream_train.json for Narration Refinement Cannot find narration_stream_train.json for Narration Refinement Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant