RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout #4

dempsey-wen · 2023-12-15T08:54:31Z

Hello, Jerry Sun. Thank you for the sharing of your good implementation of DDP training for CrossPoint.

When I was conducting the training, I met the issue:
work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: [3] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout

It seems that the processes failed to communicate with each other when 'allgather' was conducted.

Here are the Parser settings:
Namespace(backend='nccl', batch_size=1024, class_choice=None, dropout=0.5, emb_dims=1024, epochs=250, eval=False, exp_name='exp', ft_dataset='ModelNet40', gpu_id=0, img_model_path='', k=20, lr=0.001, master_addr='localhost', master_port='12355', model='dgcnn', model_path='', momentum=0.9, no_cuda=False, num_classes=40, num_ft_points=1024, num_pt_points=2048, num_workers=32, print_freq=200, rank=-1, resume=False, save_freq=50, scheduler='cos', seed=1, test_batch_size=16, use_sgd=False, wb_key='local-e6f***', wb_url='http://localhost:28282', world_size=4)

I was training the model on a server with 4 Nvidia 2080Ti.
The running environment: Ubuntu 18.04, Nvidia driver 525.89.02, CUDA 10.2.

The following is my trials to solve the problem:
To figure out the reason of communication failure among processes, I monitored the system status with htop and nvidia-smi.

It was shown that only GPU 0 was processing and the rest were idle. However, the program occupied memory of the four GPUs. I suppose the model was conveyed to 4 GPUs, but no data was transmitted to GPU 1, 2, 3. So the master process cannot gain response from the other processes.

Could you provide any ideas about how to fix the problem?

Thank you for your time! ;)

The text was updated successfully, but these errors were encountered:

auniquesun · 2023-12-19T07:07:50Z

Do you have same environment settings with mine? I list my environment settings in the README.md, such as CUDA and PyTorch vesion, etc.

I don't encounter your problems so I am not clear about the reason of your bug. I suggest you reproduce the experiments with my settings. Ensure the port 12355 and 28282 are not used by other processes since you use these ports in the experiments.

dempsey-wen · 2023-12-19T08:50:01Z

Yes, I created a new conda env CrossPoint to conduct:
pip install torch==1.11.0+cu102 torchvision==0.12.0+cu102 --extra-index-url https://download.pytorch.org/whl/cu102
pip install -r requirements.txt

Here is the package installed on CrossPoint:
cudatoolkit 10.2.89
python 3.7.13
pytorch 1.11.0
torch 1.11.0+cu102
torchvision 0.12.0+cu102

I believe that 28282 is not occupied by other programs as I can access the dashboard of wandb. I tried to modify the master port to 12366, and the same issue remained.

dempsey-wen · 2024-01-12T07:42:07Z

Update:
The problem is solved with adding the setting before DDP initialization (dist.init_process_group()):

    os.environ.setdefault("CUDA_VISIBLE_DEVICES", "0, 1, 2, 3")
    torch.cuda.set_device(rank)

Here is how I solved the problem:
I met the issue when the training process of epoch 0 finished, as the log shows:

Since it is reported that the error raised in executing all_gather_object(). I tested the function in the beginning of my code, received the error message:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646755861072/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, invalid usage, NCCL version 21.0.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

The new error message helped me find the solution.

auniquesun · 2024-01-12T11:31:09Z

@dempsey-wen Congratulations! Great Job!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout #4

RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout #4

dempsey-wen commented Dec 15, 2023

auniquesun commented Dec 19, 2023 •

edited

Loading

dempsey-wen commented Dec 19, 2023

dempsey-wen commented Jan 12, 2024

auniquesun commented Jan 12, 2024

RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store-&gt;get('1') got error: Socket Timeout #4

RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store-&gt;get('1') got error: Socket Timeout #4

Comments

dempsey-wen commented Dec 15, 2023

auniquesun commented Dec 19, 2023 • edited Loading

dempsey-wen commented Dec 19, 2023

dempsey-wen commented Jan 12, 2024

auniquesun commented Jan 12, 2024

RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout #4

RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout #4

auniquesun commented Dec 19, 2023 •

edited

Loading