-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torch.distributed.DistBackendError: NCCL error #715
Comments
Hi. This appears to be the first error:
It would be helpful to get a full log with |
Thanks for your time, the following is what is shown after setting the above two environment variables. Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.11/site-packages (from google-auth~=2.0->databricks-sdk<1,>=0.20.0->mlflow-skinny==2.17.0->mlflow>=2.8->sagemaker-mlflow->sagemaker->-r requirements.txt (line 8)) (0.4.1) | 2024-11-28T07:41:34.516Z | Requirement already satisfied: rsa<5,>=3.1.4 in /opt/conda/lib/python3.11/site-packages (from google-auth~=2.0->databricks-sdk<1,>=0.20.0->mlflow-skinny==2.17.0->mlflow>=2.8->sagemaker-mlflow->sagemaker->-r requirements.txt (line 8)) (4.7.2) |
Thank for providing the logs with
Would you be able to share the ulimits ( |
Hi @AvivBenchorin , thanks for the answer. I added ulimit -a in the codes, and run my training using 2 ml.p4d.24xlarge in SageMaker Trainingjobs for 2 times, 1 succeeded and 1 failed. The logs are as follows:
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5 ^^^^^^^^
|
Hi, Since the ulimit settings look OK, some suggestions:
|
I met a quite quirky issue. I used 2 p4d.24xlarge (8xA100) in AWS to train my model. The bash code first download data and only when data finishes downloading, does the training process starts by running
torchrun ${DISTRIBUTED_ARGS} ${WORKING_DIR}/dlrm_main.py --print_sharding_plan --model_type dnn
--epochs 1 --embedding_dim 16 --batch_size 8192 --learning_rate 0.006 --adagrad --num_embeddings 1000000000
--binary_path $binary_path --training_days 14 --valid_hour 23/00
--test_hour 23/00 --num_workers 4 --prefetch_factor 8 --save_dir $SM_WORKING_DIR
When the data downloading process takes more than 20 min, the training fails with the following error:
2024-11-15T06:01:13.169Z
Traceback (most recent call last): File "/opt/ml/code/dlrm_main.py", line 954, in
2024-11-15T06:01:13.169Z
Traceback (most recent call last): File "/opt/ml/code/dlrm_main.py", line 954, in
2024-11-15T06:01:13.169Z
invoke_main() invoke_main()
2024-11-15T06:01:13.169Z
File "/opt/ml/code/dlrm_main.py", line 951, in invoke_main File "/opt/ml/code/dlrm_main.py", line 951, in invoke_main
2024-11-15T06:01:13.169Z
main(sys.argv[1:])
2024-11-15T06:01:13.169Z
File "/opt/ml/code/dlrm_main.py", line 760, in main
2024-11-15T06:01:13.169Z
main(sys.argv[1:])
2024-11-15T06:01:13.170Z
dist.init_process_group(backend=backend, timeout=timeout_duration, device_id=device) File "/opt/ml/code/dlrm_main.py", line 760, in main
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs)
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
2024-11-15T06:01:13.170Z
func_return = func(*args, **kwargs) ^^dist.init_process_group(backend=backend, timeout=timeout_duration, device_id=device)^
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1527, in init_process_group
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs)
2024-11-15T06:01:13.170Z
^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper default_pg, _ = _new_process_group_helper(
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^ ^func_return = func(*args, **kwargs)^^
2024-11-15T06:01:13.170Z
^^^^^^^^^^
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1867, in _new_process_group_helper ^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1527, in init_process_group
2024-11-15T06:01:13.170Z
eager_backend.eager_connect_single_device(device_id)
2024-11-15T06:01:13.170Z
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
2024-11-15T06:01:13.170Z
ncclInternalError: Internal check failed.
2024-11-15T06:01:13.170Z
Last error:
2024-11-15T06:01:13.170Z
NET/OFI Couldn't open CQ. RC: -22, ERROR: Invalid argument
2024-11-15T06:01:13.170Z
default_pg, _ = _new_process_group_helper(
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1867, in _new_process_group_helper
2024-11-15T06:01:13.170Z
eager_backend.eager_connect_single_device(device_id)
2024-11-15T06:01:13.170Z
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
2024-11-15T06:01:13.170Z
ncclInternalError: Internal check failed.
2024-11-15T06:01:13.170Z
Last error:
2024-11-15T06:01:13.170Z
NET/OFI Error accessing endpoint. Endpoint has not been initialized.
torch version: 2.5.0
cuda version: 12.4
It seems that communication between gpus of different nodes fail after more than 20 min or more considering all initialization time. I also tested with downloading less data (downloading takes less than 20 min), the training has no problem. Also, single node with more data also has no problem. Please help, thanks a lot!
The text was updated successfully, but these errors were encountered: