You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout
#4
Open
dempsey-wen opened this issue
Dec 15, 2023
· 4 comments
Hello, Jerry Sun. Thank you for the sharing of your good implementation of DDP training for CrossPoint.
When I was conducting the training, I met the issue:
work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: [3] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout
It seems that the processes failed to communicate with each other when 'allgather' was conducted.
I was training the model on a server with 4 Nvidia 2080Ti.
The running environment: Ubuntu 18.04, Nvidia driver 525.89.02, CUDA 10.2.
The following is my trials to solve the problem:
To figure out the reason of communication failure among processes, I monitored the system status with htop and nvidia-smi.
It was shown that only GPU 0 was processing and the rest were idle. However, the program occupied memory of the four GPUs. I suppose the model was conveyed to 4 GPUs, but no data was transmitted to GPU 1, 2, 3. So the master process cannot gain response from the other processes.
Could you provide any ideas about how to fix the problem?
Thank you for your time! ;)
The text was updated successfully, but these errors were encountered:
Do you have same environment settings with mine? I list my environment settings in the README.md, such as CUDA and PyTorch vesion, etc.
I don't encounter your problems so I am not clear about the reason of your bug. I suggest you reproduce the experiments with my settings. Ensure the port 12355 and 28282 are not used by other processes since you use these ports in the experiments.
Yes, I created a new conda env CrossPoint to conduct:
pip install torch==1.11.0+cu102 torchvision==0.12.0+cu102 --extra-index-url https://download.pytorch.org/whl/cu102
pip install -r requirements.txt
Here is the package installed on CrossPoint:
cudatoolkit 10.2.89
python 3.7.13
pytorch 1.11.0
torch 1.11.0+cu102
torchvision 0.12.0+cu102
I believe that 28282 is not occupied by other programs as I can access the dashboard of wandb. I tried to modify the master port to 12366, and the same issue remained.
Here is how I solved the problem:
I met the issue when the training process of epoch 0 finished, as the log shows:
Since it is reported that the error raised in executing all_gather_object(). I tested the function in the beginning of my code, received the error message: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646755861072/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, invalid usage, NCCL version 21.0.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
The new error message helped me find the solution.
Hello, Jerry Sun. Thank you for the sharing of your good implementation of DDP training for CrossPoint.
When I was conducting the training, I met the issue:
work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: [3] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout
It seems that the processes failed to communicate with each other when 'allgather' was conducted.
Here are the Parser settings:
Namespace(backend='nccl', batch_size=1024, class_choice=None, dropout=0.5, emb_dims=1024, epochs=250, eval=False, exp_name='exp', ft_dataset='ModelNet40', gpu_id=0, img_model_path='', k=20, lr=0.001, master_addr='localhost', master_port='12355', model='dgcnn', model_path='', momentum=0.9, no_cuda=False, num_classes=40, num_ft_points=1024, num_pt_points=2048, num_workers=32, print_freq=200, rank=-1, resume=False, save_freq=50, scheduler='cos', seed=1, test_batch_size=16, use_sgd=False, wb_key='local-e6f***', wb_url='http://localhost:28282', world_size=4)
I was training the model on a server with 4 Nvidia 2080Ti.
The running environment: Ubuntu 18.04, Nvidia driver 525.89.02, CUDA 10.2.
The following is my trials to solve the problem:
To figure out the reason of communication failure among processes, I monitored the system status with
htop
andnvidia-smi
.It was shown that only GPU 0 was processing and the rest were idle. However, the program occupied memory of the four GPUs. I suppose the model was conveyed to 4 GPUs, but no data was transmitted to GPU 1, 2, 3. So the master process cannot gain response from the other processes.


Could you provide any ideas about how to fix the problem?
Thank you for your time! ;)
The text was updated successfully, but these errors were encountered: