StreamingDataset causes NCCL timeout when using multiple nodes #340

hubenjm · 2024-08-26T21:42:39Z

🐛 Bug

I'm running a training job with 2 nodes in SageMaker using torchrun to launch. I'm using a CombinedStreamingDataset for the training dataset and using train_weight_factors = [0.8,0.07,0.07,0.07]. The training stops printing out log messages after some fixed number of batches (depending on random seed I guess). Where the training stops is deterministic if seed is fixed, based on my experiments. Then the NCCL timeout triggers an exception after 30 minutes. The training code works fine on a single node though.

To Reproduce

Use CombinedStreamingDataset for training dataset with train_weight_factors not None and iterate_over_all = False. Launch training with torchrun with num_nodes > 1.


2024-08-23T21:52:39.473Z | pytorch_lightning - INFO - RANK 0 - on_train_batch_end - EPOCH 1 BATCH 70/inf(0.00%): train_loss=0.08913; train_unweighted_loss=0.05355; train_rec=0.77778; train_neg_rec=1.00000; train_prec=1.00000; lr=0.0000499696
-- | --
  | 2024-08-23T21:52:48.476Z | pytorch_lightning - INFO - RANK 0 - on_train_batch_end - EPOCH 1 BATCH 80/inf(0.00%): train_loss=0.07478; train_unweighted_loss=0.09251; train_rec=0.50000; train_neg_rec=0.98684; train_prec=0.80000; lr=0.0000499693
  | 2024-08-23T22:22:50.870Z | [rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800038 milliseconds before timing out.
  | 2024-08-23T22:22:50.870Z | [rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800058 milliseconds before timing out.
  | 2024-08-23T22:22:50.870Z | [rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800062 milliseconds before timing out.
  | 2024-08-23T22:22:50.870Z | [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800064 milliseconds before timing out.
  | 2024-08-23T22:22:50.870Z | [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800068 milliseconds before timing out.
  | 2024-08-23T22:22:50.870Z | [rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800076 milliseconds before timing out.
  | 2024-08-23T22:22:50.870Z | [rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800083 milliseconds before timing out.
  | 2024-08-23T22:22:50.870Z | [rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800085 milliseconds before timing out.
  | 2024-08-23T22:22:51.870Z | [rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 6] Timeout at NCCL work: 70426, last enqueued NCCL work: 70426, last completed NCCL work: 70425.
  | 2024-08-23T22:22:51.870Z | [rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | 2024-08-23T22:22:51.870Z | [rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
  | 2024-08-23T22:22:51.870Z | [rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800038 milliseconds before timing out.
  | 2024-08-23T22:22:51.871Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.871Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4d9b80a897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.871Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4d9cb03e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.871Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4d9cb08c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.871Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4d9cb09f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.871Z | frame #4: <unknown function> + 0xd3e95 (0x7f4deb021e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.871Z | frame #5: <unknown function> + 0x8609 (0x7f4df4be2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.871Z | frame #6: clone + 0x43 (0x7f4df49ab353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.871Z | terminate called after throwing an instance of 'c10::DistBackendError' what():
  | 2024-08-23T22:22:51.871Z | [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800038 milliseconds before timing out.
  | 2024-08-23T22:22:51.871Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.871Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4d9b80a897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.871Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4d9cb03e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.871Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4d9cb08c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.871Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4d9cb09f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.871Z | frame #4: <unknown function> + 0xd3e95 (0x7f4deb021e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.871Z | frame #5: <unknown function> + 0x8609 (0x7f4df4be2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.871Z | frame #6: clone + 0x43 (0x7f4df49ab353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.871Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | 2024-08-23T22:22:51.871Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4d9b80a897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.871Z | frame #1: <unknown function> + 0xe56473 (0x7f4d9c78b473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.871Z | frame #2: <unknown function> + 0xd3e95 (0x7f4deb021e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.871Z | frame #3: <unknown function> + 0x8609 (0x7f4df4be2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.871Z | frame #4: clone + 0x43 (0x7f4df49ab353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.872Z | [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 70426, last enqueued NCCL work: 70426, last completed NCCL work: 70425.
  | 2024-08-23T22:22:51.872Z | [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | 2024-08-23T22:22:51.872Z | [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
  | 2024-08-23T22:22:51.872Z | [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800064 milliseconds before timing out.
  | 2024-08-23T22:22:51.872Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.872Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb96b255897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.872Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb96c54ee12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb96c553c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb96c554f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #4: <unknown function> + 0xd3e95 (0x7fb9baa6ce95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.872Z | frame #5: <unknown function> + 0x8609 (0x7fb9c462d609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.872Z | frame #6: clone + 0x43 (0x7fb9c43f6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.872Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800064 milliseconds before timing out.
  | 2024-08-23T22:22:51.872Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.872Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb96b255897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.872Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb96c54ee12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb96c553c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb96c554f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #4: <unknown function> + 0xd3e95 (0x7fb9baa6ce95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.872Z | frame #5: <unknown function> + 0x8609 (0x7fb9c462d609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.872Z | frame #6: clone + 0x43 (0x7fb9c43f6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.872Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | 2024-08-23T22:22:51.872Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb96b255897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.872Z | frame #1: <unknown function> + 0xe56473 (0x7fb96c1d6473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #2: <unknown function> + 0xd3e95 (0x7fb9baa6ce95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.872Z | frame #3: <unknown function> + 0x8609 (0x7fb9c462d609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.872Z | frame #4: clone + 0x43 (0x7fb9c43f6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.872Z | [rank5]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 5] Timeout at NCCL work: 70426, last enqueued NCCL work: 70426, last completed NCCL work: 70425.
  | 2024-08-23T22:22:51.872Z | [rank5]:[E ProcessGroupNCCL.cpp:577] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | 2024-08-23T22:22:51.872Z | [rank5]:[E ProcessGroupNCCL.cpp:583] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
  | 2024-08-23T22:22:51.872Z | [rank5]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800062 milliseconds before timing out.
  | 2024-08-23T22:22:51.872Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.872Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd01c362897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.872Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fd01d65be12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd01d660c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd01d661f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #4: <unknown function> + 0xd3e95 (0x7fd06bb79e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.872Z | frame #5: <unknown function> + 0x8609 (0x7fd07573a609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.872Z | frame #6: clone + 0x43 (0x7fd075503353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.872Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800062 milliseconds before timing out.
  | 2024-08-23T22:22:51.872Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.872Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd01c362897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.872Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fd01d65be12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd01d660c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd01d661f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #4: <unknown function> + 0xd3e95 (0x7fd06bb79e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.873Z | frame #5: <unknown function> + 0x8609 (0x7fd07573a609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.873Z | frame #6: clone + 0x43 (0x7fd075503353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.873Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | 2024-08-23T22:22:51.873Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd01c362897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.873Z | frame #1: <unknown function> + 0xe56473 (0x7fd01d2e3473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #2: <unknown function> + 0xd3e95 (0x7fd06bb79e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.873Z | frame #3: <unknown function> + 0x8609 (0x7fd07573a609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.873Z | frame #4: clone + 0x43 (0x7fd075503353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.873Z | [rank4]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 4] Timeout at NCCL work: 70426, last enqueued NCCL work: 70426, last completed NCCL work: 70425.
  | 2024-08-23T22:22:51.873Z | [rank4]:[E ProcessGroupNCCL.cpp:577] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | 2024-08-23T22:22:51.873Z | [rank4]:[E ProcessGroupNCCL.cpp:583] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
  | 2024-08-23T22:22:51.873Z | [rank4]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800058 milliseconds before timing out.
  | 2024-08-23T22:22:51.873Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.873Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f74c40ea897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.873Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f74c53e3e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f74c53e8c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f74c53e9f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #4: <unknown function> + 0xd3e95 (0x7f7513901e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.873Z | frame #5: <unknown function> + 0x8609 (0x7f751d4c2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.873Z | frame #6: clone + 0x43 (0x7f751d28b353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.873Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800058 milliseconds before timing out.
  | 2024-08-23T22:22:51.873Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.873Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f74c40ea897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.873Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f74c53e3e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f74c53e8c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f74c53e9f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #4: <unknown function> + 0xd3e95 (0x7f7513901e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.873Z | frame #5: <unknown function> + 0x8609 (0x7f751d4c2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.873Z | frame #6: clone + 0x43 (0x7f751d28b353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.873Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | 2024-08-23T22:22:51.873Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f74c40ea897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.873Z | frame #1: <unknown function> + 0xe56473 (0x7f74c506b473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #2: <unknown function> + 0xd3e95 (0x7f7513901e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.873Z | frame #3: <unknown function> + 0x8609 (0x7f751d4c2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.873Z | frame #4: clone + 0x43 (0x7f751d28b353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.873Z | [rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 0] Timeout at NCCL work: 70426, last enqueued NCCL work: 70426, last completed NCCL work: 70425.
  | 2024-08-23T22:22:51.873Z | [rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | 2024-08-23T22:22:51.873Z | [rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
  | 2024-08-23T22:22:51.873Z | [rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800068 milliseconds before timing out.
  | 2024-08-23T22:22:51.873Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.873Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f483567d897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.873Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4836976e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f483697bc30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f483697cf7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #4: <unknown function> + 0xd3e95 (0x7f4884e94e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.874Z | frame #5: <unknown function> + 0x8609 (0x7f488ea55609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.874Z | frame #6: clone + 0x43 (0x7f488e81e353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.874Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800068 milliseconds before timing out.
  | 2024-08-23T22:22:51.874Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.874Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f483567d897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.874Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4836976e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f483697bc30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f483697cf7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #4: <unknown function> + 0xd3e95 (0x7f4884e94e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.874Z | frame #5: <unknown function> + 0x8609 (0x7f488ea55609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.874Z | frame #6: clone + 0x43 (0x7f488e81e353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.874Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | 2024-08-23T22:22:51.874Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f483567d897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.874Z | frame #1: <unknown function> + 0xe56473 (0x7f48365fe473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #2: <unknown function> + 0xd3e95 (0x7f4884e94e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.874Z | frame #3: <unknown function> + 0x8609 (0x7f488ea55609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.874Z | frame #4: clone + 0x43 (0x7f488e81e353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.874Z | [rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 70426, last enqueued NCCL work: 70427, last completed NCCL work: 70425.
  | 2024-08-23T22:22:51.874Z | [rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | 2024-08-23T22:22:51.874Z | [rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
  | 2024-08-23T22:22:51.874Z | [rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800076 milliseconds before timing out.
  | 2024-08-23T22:22:51.874Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.874Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f787e8c4897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.874Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f787fbbde12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f787fbc2c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f787fbc3f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #4: <unknown function> + 0xd3e95 (0x7f78ce0dbe95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.874Z | frame #5: <unknown function> + 0x8609 (0x7f78d7c9c609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.874Z | frame #6: clone + 0x43 (0x7f78d7a65353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.874Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800076 milliseconds before timing out.
  | 2024-08-23T22:22:51.874Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.874Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f787e8c4897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.874Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f787fbbde12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f787fbc2c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f787fbc3f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #4: <unknown function> + 0xd3e95 (0x7f78ce0dbe95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.874Z | frame #5: <unknown function> + 0x8609 (0x7f78d7c9c609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.874Z | frame #6: clone + 0x43 (0x7f78d7a65353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.874Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | 2024-08-23T22:22:51.874Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f787e8c4897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.874Z | frame #1: <unknown function> + 0xe56473 (0x7f787f845473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #2: <unknown function> + 0xd3e95 (0x7f78ce0dbe95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.875Z | frame #3: <unknown function> + 0x8609 (0x7f78d7c9c609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.875Z | frame #4: clone + 0x43 (0x7f78d7a65353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.875Z | [rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 70426, last enqueued NCCL work: 70427, last completed NCCL work: 70425.
  | 2024-08-23T22:22:51.875Z | [rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | 2024-08-23T22:22:51.875Z | [rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
  | 2024-08-23T22:22:51.875Z | [rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800083 milliseconds before timing out.
  | 2024-08-23T22:22:51.875Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.875Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5afbae7897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.875Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5afcde0e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.875Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5afcde5c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.875Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5afcde6f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.875Z | frame #4: <unknown function> + 0xd3e95 (0x7f5b4b2fee95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.875Z | frame #5: <unknown function> + 0x8609 (0x7f5b54ebf609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.875Z | frame #6: clone + 0x43 (0x7f5b54c88353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.875Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800083 milliseconds before timing out.
  | 2024-08-23T22:22:51.875Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.875Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5afbae7897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.875Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5afcde0e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.875Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5afcde5c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.875Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5afcde6f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.875Z | frame #4: <unknown function> + 0xd3e95 (0x7f5b4b2fee95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.875Z | frame #5: <unknown function> + 0x8609 (0x7f5b54ebf609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.875Z | frame #6: clone + 0x43 (0x7f5b54c88353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.875Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | 2024-08-23T22:22:51.875Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5afbae7897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.875Z | frame #1: <unknown function> + 0xe56473 (0x7f5afca68473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.875Z | frame #2: <unknown function> + 0xd3e95 (0x7f5b4b2fee95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.875Z | frame #3: <unknown function> + 0x8609 (0x7f5b54ebf609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.875Z | frame #4: clone + 0x43 (0x7f5b54c88353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:52.876Z | [rank7]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 7] Timeout at NCCL work: 70426, last enqueued NCCL work: 70427, last completed NCCL work: 70425.
  | 2024-08-23T22:22:52.876Z | [rank7]:[E ProcessGroupNCCL.cpp:577] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | 2024-08-23T22:22:52.876Z | [rank7]:[E ProcessGroupNCCL.cpp:583] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
  | 2024-08-23T22:22:52.876Z | [rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800085 milliseconds before timing out.
  | 2024-08-23T22:22:52.876Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:52.876Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bbadb897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:52.876Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f76bcdd4e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:52.876Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f76bcdd9c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:52.876Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76bcddaf7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:52.876Z | frame #4: <unknown function> + 0xd3e95 (0x7f770b2f2e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:52.876Z | frame #5: <unknown function> + 0x8609 (0x7f7714eb3609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:52.876Z | frame #6: clone + 0x43 (0x7f7714c7c353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:52.876Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800085 milliseconds before timing out.
  | 2024-08-23T22:22:52.876Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:52.876Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bbadb897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:52.876Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f76bcdd4e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:52.876Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f76bcdd9c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:52.876Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76bcddaf7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:52.876Z | frame #4: <unknown function> + 0xd3e95 (0x7f770b2f2e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:52.876Z | frame #5: <unknown function> + 0x8609 (0x7f7714eb3609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:52.876Z | frame #6: clone + 0x43 (0x7f7714c7c353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:52.877Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | 2024-08-23T22:22:52.877Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bbadb897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:52.877Z | frame #1: <unknown function> + 0xe56473 (0x7f76bca5c473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:52.877Z | frame #2: <unknown function> + 0xd3e95 (0x7f770b2f2e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:52.877Z | frame #3: <unknown function> + 0x8609 (0x7f7714eb3609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:52.877Z | frame #4: clone + 0x43 (0x7f7714c7c353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:58.878Z | E0823 22:22:58.472000 140029070137152 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 187) of binary: /opt/conda/bin/python
  | 2024-08-23T22:22:58.878Z | Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 33, in <module>
  | 2024-08-23T22:22:58.879Z | sys.exit(load_entry_point('torch==2.3.0', 'console_scripts', 'torchrun')())
  | 2024-08-23T22:22:58.879Z | ^
  | 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^^^^^^^
  | 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^^^^^^^
  | 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^^^^^^^^
  | 2024-08-23T22:22:58.879Z | ^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
  | 2024-08-23T22:22:58.879Z | return f(*args, **kwargs)
  | 2024-08-23T22:22:58.879Z | ^^^^
  | 2024-08-23T22:22:58.879Z | ^^^^^^^^
  | 2024-08-23T22:22:58.879Z | ^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
  | 2024-08-23T22:22:58.879Z | run(args)
  | 2024-08-23T22:22:58.879Z | File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
  | 2024-08-23T22:22:58.879Z | elastic_launch( File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
  | 2024-08-23T22:22:58.879Z | return launch_agent(self._config, self._entrypoint, list(args))
  | 2024-08-23T22:22:58.879Z | ^^^^^^
  | 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^^^^^^
  | 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^^^^^^^
  | 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
  | 2024-08-23T22:22:58.879Z | raise ChildFailedError(
  | 2024-08-23T22:22:58.879Z | torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
  | 2024-08-23T22:22:58.879Z | ====================================================
  | 2024-08-23T22:22:58.879Z | train_model.py FAILED
  | 2024-08-23T22:22:58.879Z | ----------------------------------------------------
  | 2024-08-23T22:22:58.879Z | Failures:
  | 2024-08-23T22:22:58.879Z | [1]: time : 2024-08-23_22:22:58 host : algo-2 rank : 1 (local_rank: 1) exitcode : -6 (pid: 188) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 188
  | 2024-08-23T22:22:58.879Z | [2]: time : 2024-08-23_22:22:58 host : algo-2 rank : 2 (local_rank: 2) exitcode : -6 (pid: 189) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 189
  | 2024-08-23T22:22:58.879Z | [3]: time : 2024-08-23_22:22:58 host : algo-2 rank : 3 (local_rank: 3) exitcode : -6 (pid: 190) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 190
  | 2024-08-23T22:22:58.879Z | [4]: time : 2024-08-23_22:22:58 host : algo-2 rank : 4 (local_rank: 4) exitcode : -6 (pid: 191) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 191
  | 2024-08-23T22:22:58.880Z | [5]: time : 2024-08-23_22:22:58 host : algo-2 rank : 5 (local_rank: 5) exitcode : -6 (pid: 192) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 192
  | 2024-08-23T22:22:58.880Z | [6]: time : 2024-08-23_22:22:58 host : algo-2 rank : 6 (local_rank: 6) exitcode : -6 (pid: 193) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 193
  | 2024-08-23T22:22:58.880Z | [7]: time : 2024-08-23_22:22:58 host : algo-2 rank : 7 (local_rank: 7) exitcode : -6 (pid: 194) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 194
  | 2024-08-23T22:22:58.880Z | ----------------------------------------------------
  | 2024-08-23T22:22:58.880Z | Root Cause (first observed failure):
  | 2024-08-23T22:22:58.880Z | [0]: time : 2024-08-23_22:22:58 host : algo-2 rank : 0 (local_rank: 0) exitcode : -6 (pid: 187) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 187
  | 2024-08-23T22:22:58.880Z | ====================================================
  | 2024-08-23T22:23:01.881Z | 2024-08-23 22:23:01,373 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
  | 2024-08-23T22:23:01.881Z | 2024-08-23 22:23:01,373 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
  | 2024-08-23T22:23:01.881Z | 2024-08-23 22:23:01,374 sagemaker-training-toolkit ERROR Reporting training FAILURE
  | 2024-08-23T22:23:01.881Z | 2024-08-23 22:23:01,374 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
  | 2024-08-23T22:23:01.881Z | ExitCode 1
| 2024-08-23T22:23:01.892Z | Command "torchrun --nnodes 2 --nproc_per_node 8 --master_addr algo-2 --master_port 7777 --node_rank 0 train_model.py --accelerator auto --accumulate_grad_batches 1 --attribute-names x y z --batch-size 32 --devices auto --drop-path-rate 0.1 --enable_progress_bar False --fusion-activation leaky_relu --fusion-hidden-features 512 --fusion-type simplefc --global-pool max --gradient_clip_algorithm norm --gradient_clip_val 5.0 --image-encoder convnextv2_tiny --image-size 384 --image-transform-backend albumentations --layer-decay 0.9 --log-dir /opt/ml/model/logs/ --log_every_n_steps 10 --lr 0.00005 --max_steps 123046 --min-lr 0.00000001 --min-precision 0.99 --momentum 0.9 --mu 0.3 --num-classes 3 --num-workers 4 --num_nodes 2 --opt adamw --output-dir /opt/ml/model/ --pos-weight 0.3 --precision bf16-mixed --reg-mode partial --save-top-k 3 --sched cosine_decay --sched-on-updates 1 --strategy ddp_find_unused_parameters_true --sync_batchnorm True --task multilabel --tensorboard-dir /opt/ml/output/tensorboard/ --test-input .../optimized-data/test --text-encoder google-bert/bert-base-multilingual-uncased --train-inputs negatives,crop,pack,prop --train-inputs-s3-prefix .../optimized-data/ --train-weight-factors 0.8,0.07,0.07,0.07 --val-input .../optimized-data/val --val_check_interval 12304 --warmup-steps 1000 --weight-decay 0.005" | Command "torchrun --nnodes 2 --nproc_per_node 8 --master_addr algo-2 --master_port 7777 --node_rank 0 train_model.py --accelerator auto --accumulate_grad_batches 1 --attribute-names x y z --batch-size 32 --devices auto --drop-path-rate 0.1 --enable_progress_bar False --fusion-activation leaky_relu --fusion-hidden-features 512 --fusion-type simplefc --global-pool max --gradient_clip_algorithm norm --gradient_clip_val 5.0 --image-encoder convnextv2_tiny --image-size 384 --image-transform-backend albumentations --layer-decay 0.9 --log-dir /opt/ml/model/logs/ --log_every_n_steps 10 --lr 0.00005 --max_steps 123046 --min-lr 0.00000001 --min-precision 0.99 --momentum 0.9 --mu 0.3 --num-classes 3 --num-workers 4 --num_nodes 2 --opt adamw --output-dir /opt/ml/model/ --pos-weight 0.3 --precision bf16-mixed --reg-mode partial --save-top-k 3 --sched cosine_decay --sched-on-updates 1 --strategy ddp_find_unused_parameters_true --sync_batchnorm True --task multilabel --tensorboard-dir /opt/ml/output/tensorboard/ --test-input .../optimized-data/test --text-encoder google-bert/bert-base-multilingual-uncased --train-inputs negatives,x,y,z --train-inputs-s3-prefix .../optimized-data/ --train-weight-factors 0.8,0.07,0.07,0.07 --val-input .../optimized-data/val --val_check_interval 12304 --warmup-steps 1000 --weight-decay 0.005"
-- | --
  | 2024-08-23T22:23:01.892Z | 2024-08-23 22:23:01,374 sagemaker-training-toolkit ERROR Encountered exit_code 1

Code sample

Expected behavior

Training should not softlock in the middle of an epoch

Environment

PyTorch Version (e.g., 1.0): 2.1
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): SageMaker prebuilt deep learning container (763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker, see https://github.com/aws/deep-learning-containers/blob/master/available_images.md)
Build command you used (if compiling from source):
Python version: 3.11
CUDA/cuDNN version: 12.1
GPU models and configuration: A10G (g5.48xlarge instance type in AWS)
Any other relevant information:

Additional context

If you have any other suggestions about why multi-node training with CombinedDataset would fail like this, any help is appreciated.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-08-26T21:43:05Z

Hi! thanks for your contribution!, great first issue!

tchaton · 2024-08-31T17:20:45Z

Hey @hubenjm,

Did you provide drop_last=True to the StreamingDataLoader for the training dataset ?

Could you share a reproducible script or the code of your training dataset ?

tchaton · 2024-09-03T18:00:52Z

Hey @hubenjm. Any updates ?

hubenjm · 2024-09-03T18:38:00Z

@tchaton Thanks for the suggestions. I am currently trying to run my code again while explicitly setting drop_last = True when instantiating any StreamingDataset objects.

hubenjm · 2024-09-03T19:33:26Z

OK, that did not work either, so I am going to have to work on creating a simpler example code to share that replicates the problem

hubenjm · 2024-09-05T23:28:30Z

LitDataBugExampleCode.zip

To replicate the problem, you first need to run python generate_optimized_data.py --data-s3-prefix s3://<whatever-bucket-you-own-is>/toy-combined-dataset-example/optimized-data/

Then after that data is generated in s3, to submit a training job in SageMaker with e.g. 2 nodes, follow the submit_training_job.ipynb notebook (replace the s3_data_input_prefix variable with same value as used above. Change num_nodes variable from 2 to 1 or whatever.

If you want to run the training code in your own cluster via torchrun directly, you can do

torchrun --nnodes=1 --nproc_per_node=8 train.py --train-inputs-s3-prefix s3://<your-s3-bucket-name>/toy-combined-dataset-example/optimized-data/train/ --train-inputs 0,1,2 --val-input s3://<your-s3-bucket-name>/toy-combined-dataset-example/optimized-data/val/ --train-weight-factors 0.2,0.6,0.2 --precision bf16-mixed --accumulate_grad_batches 1 --batch-size 32 --gradient_clip_val 5.0 --gradient_clip_algorithm norm --num-workers 4 --num_nodes 1 --enable_progress_bar False --sync_batchnorm True --accelerator auto --devices auto --log_every_n_steps 10 --output-dir /home/ec2-user/SageMaker/toy-combined-dataset-model/ --strategy ddp --max_epochs 20 --check_val_every_n_epoch 1

or replace --nnodes=2 or something else.

NOTE that I only replicated the error using the SageMaker training job approach above, but I don't think there's any significant difference between running it there versus on a self-managed cluster, since under the hood Sagemaker will execute a very similar torchrun command as above.

With above code example and arguments used I got a softlock to occur at around epoch 5 with 2 nodes. With 1 node it runs fine.

hubenjm · 2024-09-06T19:54:56Z

OK, another update.
I tried running the same code but without the CombinedStreamingDataset class. Instead I used a single StreamingDataset class for the training set. With 2 nodes, this still leads to a NCCL timeout error. So I am wondering if there might be two different process groups being used somehow?

hubenjm · 2024-09-07T23:39:57Z

My next step will be to try getting a multi node sagemaker training job working with the same code but replacing the dataset/data loader with standard torch dataset and DataLoader class. If that doesn't work then I suppose this issue is moot and the problem is something else. But it would be very useful to a lot of folks in general to be able to use LitData and Lightning effectively with multinode sagemaker training jobs.

tchaton · 2024-09-08T09:15:43Z

Hey @hubenjm Could you check the dataset length or the number of batch read on each rank ? This can happen if somehow the length wasn't inferred properly and one rank gets more data.
We thought we fixed all of them but it seems there might be some issues still.
You can try the Lightning Platform if you want to try multi node with lot of ease.

tchaton · 2024-10-26T08:04:28Z

Hey @hubenjm. If you are available next week, let's try to get us to reproduce this issue on Lightning.ai. If I can reproduce it, I can fix it.

hubenjm · 2024-10-28T18:46:16Z

@tchaton Sure I will try to help out. As an update, I ran some more tests a couple weeks ago and I found the following specific to SageMaker

Running the same code but replacing the dataset with a simple random image generator torch Dataset plus torch DataLoader works with multiple nodes. I also set use_distributed_sampler to True.
The same code but replacing the train/val datasets with StreamingDataSet objects (doesn't require CombinedStreamingDataset for the problem to occur), I get timeout issue with multiple nodes

I can work on streamlining my code example more to make it easier to work with. My current guess is that the problem lies somehow with how the distributed process group is being set up with StreamingDataLoader vs with the standard torch DataLoader. And maybe it has to do with some behind the scenes setup that SageMaker does with environment variables and in renaming the hosts as 'algo-1', 'algo-2', etc.

tchaton · 2024-10-28T19:17:51Z

Hey @hubenjm. This happens if the number of batches isn't the same on all ranks. For the training streaming dataset, do you provide drop_last=True.

Yes, a reproducible example would be super helpful.

hubenjm · 2024-10-31T18:52:54Z

Hey @hubenjm. This happens if the number of batches isn't the same on all ranks. For the training streaming dataset, do you provide drop_last=True.

Yes, a reproducible example would be super helpful.

Yes I do set drop_last=True in both the StreamingDataSet and StreamingDataLoader classes.
I have a new code example that I have verified fails in SageMaker with 2 nodes using StreamingDataset and works fine using a simple random image generator class and standard torch DataLoader. Will attach below. It includes a readme file with instructions on how to reproduce in SageMaker. I suppose the next step step would be to adapt the same example to work with Lightning Studio and see if it works there.

hubenjm · 2024-10-31T18:56:15Z

litdata_multinode_example_code.tar.gz

From README.md in .tar.gz attached:

Overview

This code is intended to test out ability to run distributed (DDP) training jobs in SageMaker with multiple nodes using PyTorch Lightning with or without LitData StreamingDataset as the data source.

Instructions

You will need an AWS account with appropriate IAM role with SageMaker access privileges, and a SageMaker notebook started that uses that IAM role
In constants.py file, change MY_S3_PREFIX to your own s3 bucket prefix that you want to use for storing data and artifacts
Generate optimized data in your s3 location by running the generate_optimized_data.py script. Specifically:
1. cd <directory_of_this_code_folder>
2. Activate the pytorch_p310 environment via source activate pytorch_p310
3. install litdata: pip install litdata
4. python generate_optimized_data.py
Once the data is generated in your bucket, open submit_training_job.ipynb notebook
1. Follow from top to bottom to submit a training job
2. Note that you can change num_nodes parameter to 2 to initiate multi-node training job. You can also specify instance_type.
3. Change use-litdata parameter to True or False to run training code with litdata StreamingDataSet or using native Pytorch local random image dataset with standard DataLoader class
You should see that multi-node training works fine with use-litdata = False and num_nodes = 2, but fails when use-litdata = True

tchaton · 2024-11-01T17:14:27Z

Thanks @hubenjm. Multi node on Lightning.AI is much simpler and cheaper than Sagemaker. You should give it a try. It also support fault tolerance with automatic restart.

Here are the docs: https://lightning.ai/docs/overview/train-models/multi-node-training.

I will try to find some time to look into this. Thanks.

kandapagari · 2024-12-10T19:37:44Z

I have the same issue, is there any fix available for this

bhimrazy · 2024-12-10T19:47:38Z

Hi Pavan, Could you help us with a minimal reproducible script or Lightning Studio? Also, could you describe the scenario a bit? I tried reproducing it once but wasn’t successfull.

kandapagari · 2024-12-11T01:44:55Z

Hi Bhimraj,

Sure, I am trying to pre-train a VLM model. I have created an optimized dataset using litdata in EC2. I had to do it in chunks and merge them later. Then, I am using Sagemaker with p5 instances to train the model. I am trying to use FSDP or DeepSpeed (either is fine). The data is mounted as /fsx to the cluster. I am using a steaming dataset and dataloader. So when I select more than one node, I get an NCCL timeout at the ALL_GATHER operation after some training steps, mostly waiting for data I think. when I change NCCL_TIMEOUT var from the default value, it is just stuck. I also observed that it happened at a particular index always. Slightly different if I change the batch size but always similar tho. I will try to get a sample script.

bhimrazy · 2024-12-11T07:44:42Z

Thank you @kandapagari

tchaton · 2024-12-11T08:02:37Z

Hey @kandapagari. Would you be free to a debugging call ? Could you share access to the dataset ?

kandapagari · 2024-12-11T11:50:42Z

Sure i can join in for a call. You can reach me through my email. The dataset itself is in fsx which i think i cannot give access to directly. Ill see if i can push it into a s3 bucket. Thank you.

tchaton · 2024-12-12T10:46:33Z

Hey @kandapagari My email is [email protected]. Send me an invitation, so we can look into this.

Also, u should try the Lightning-AI platform. This would make your life much simpler

kandapagari · 2025-03-05T08:53:45Z

hey @tchaton, sorry for the late reply, BTW we figured out the reason for the time out.

when processing the data using compute the data (chunks) sometimes get corrupted and could not be read. When this happens and we try to load the same data using a streaming dataloader (even with time out set), the timeout doesn't happen and one GPU is still trying to load the data. At the same time, the other GPUs wait for this GPU to process that data which eventually causes NCCL error.

We solved it by trying to read all the data offline (on a single machine), removing the corrupted chunks (by dropping them in the index.json file), and then trying to load the data.

hubenjm · 2025-03-05T14:23:45Z

hey @tchaton, sorry for the late reply, BTW we figured out the reason for the time out.

when processing the data using compute the data (chunks) sometimes get corrupted and could not be read. When this happens and we try to load the same data using a streaming dataloader (even with time out set), the timeout doesn't happen and one GPU is still trying to load the data. At the same time, the other GPUs wait for this GPU to process that data which eventually causes NCCL error.

We solved it by trying to read all the data offline (on a single machine), removing the corrupted chunks (by dropping them in the index.json file), and then trying to load the data.

I don't think this is what caused my problem, because my example code posted here worked for a single node just fine.

tchaton · 2025-03-05T15:31:21Z

Hey @kandapagari. Oh wow, that's super interesting. Any ideas what could have caused the corruption. I have never seen this before.

tchaton · 2025-03-05T15:50:33Z

Hey @hubenjm. Could you try the latest version of LitData, we fixed a few things and I am curious if this still happen.

hubenjm added bug Something isn't working help wanted Extra attention is needed labels Aug 26, 2024

github-staff deleted a comment from Lxx-c Oct 23, 2024

deependujha mentioned this issue Nov 3, 2024

training hangs with lightning ddp and cloud dir? #408

Closed

hubenjm changed the title ~~CombinedStreamingDataset causes NCCL timeout when using multiple nodes~~ StreamingDataset causes NCCL timeout when using multiple nodes Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StreamingDataset causes NCCL timeout when using multiple nodes #340

StreamingDataset causes NCCL timeout when using multiple nodes #340

hubenjm commented Aug 26, 2024

github-actions bot commented Aug 26, 2024

tchaton commented Aug 31, 2024

tchaton commented Sep 3, 2024

hubenjm commented Sep 3, 2024

hubenjm commented Sep 3, 2024

hubenjm commented Sep 5, 2024 •

edited

Loading

hubenjm commented Sep 6, 2024

hubenjm commented Sep 7, 2024

tchaton commented Sep 8, 2024

tchaton commented Oct 26, 2024

hubenjm commented Oct 28, 2024

tchaton commented Oct 28, 2024

hubenjm commented Oct 31, 2024

hubenjm commented Oct 31, 2024 •

edited

Loading

tchaton commented Nov 1, 2024

kandapagari commented Dec 10, 2024

bhimrazy commented Dec 10, 2024 via email

kandapagari commented Dec 11, 2024

bhimrazy commented Dec 11, 2024

tchaton commented Dec 11, 2024

kandapagari commented Dec 11, 2024 •

edited

Loading

tchaton commented Dec 12, 2024 •

edited

Loading

kandapagari commented Mar 5, 2025 •

edited

Loading

hubenjm commented Mar 5, 2025

tchaton commented Mar 5, 2025

tchaton commented Mar 5, 2025

StreamingDataset causes NCCL timeout when using multiple nodes #340

StreamingDataset causes NCCL timeout when using multiple nodes #340

Comments

hubenjm commented Aug 26, 2024

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

github-actions bot commented Aug 26, 2024

tchaton commented Aug 31, 2024

tchaton commented Sep 3, 2024

hubenjm commented Sep 3, 2024

hubenjm commented Sep 3, 2024

hubenjm commented Sep 5, 2024 • edited Loading

hubenjm commented Sep 6, 2024

hubenjm commented Sep 7, 2024

tchaton commented Sep 8, 2024

tchaton commented Oct 26, 2024

hubenjm commented Oct 28, 2024

tchaton commented Oct 28, 2024

hubenjm commented Oct 31, 2024

hubenjm commented Oct 31, 2024 • edited Loading

Overview

Instructions

tchaton commented Nov 1, 2024

kandapagari commented Dec 10, 2024

bhimrazy commented Dec 10, 2024 via email

kandapagari commented Dec 11, 2024

bhimrazy commented Dec 11, 2024

tchaton commented Dec 11, 2024

kandapagari commented Dec 11, 2024 • edited Loading

tchaton commented Dec 12, 2024 • edited Loading

kandapagari commented Mar 5, 2025 • edited Loading

hubenjm commented Mar 5, 2025

tchaton commented Mar 5, 2025

tchaton commented Mar 5, 2025

hubenjm commented Sep 5, 2024 •

edited

Loading

hubenjm commented Oct 31, 2024 •

edited

Loading

kandapagari commented Dec 11, 2024 •

edited

Loading

tchaton commented Dec 12, 2024 •

edited

Loading

kandapagari commented Mar 5, 2025 •

edited

Loading