Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout #4

Open
dempsey-wen opened this issue Dec 15, 2023 · 4 comments

Comments

@dempsey-wen
Copy link

Hello, Jerry Sun. Thank you for the sharing of your good implementation of DDP training for CrossPoint.

When I was conducting the training, I met the issue:
work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: [3] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout

It seems that the processes failed to communicate with each other when 'allgather' was conducted.

Here are the Parser settings:
Namespace(backend='nccl', batch_size=1024, class_choice=None, dropout=0.5, emb_dims=1024, epochs=250, eval=False, exp_name='exp', ft_dataset='ModelNet40', gpu_id=0, img_model_path='', k=20, lr=0.001, master_addr='localhost', master_port='12355', model='dgcnn', model_path='', momentum=0.9, no_cuda=False, num_classes=40, num_ft_points=1024, num_pt_points=2048, num_workers=32, print_freq=200, rank=-1, resume=False, save_freq=50, scheduler='cos', seed=1, test_batch_size=16, use_sgd=False, wb_key='local-e6f***', wb_url='http://localhost:28282', world_size=4)

I was training the model on a server with 4 Nvidia 2080Ti.
The running environment: Ubuntu 18.04, Nvidia driver 525.89.02, CUDA 10.2.

The following is my trials to solve the problem:
To figure out the reason of communication failure among processes, I monitored the system status with htop and nvidia-smi.

It was shown that only GPU 0 was processing and the rest were idle. However, the program occupied memory of the four GPUs. I suppose the model was conveyed to 4 GPUs, but no data was transmitted to GPU 1, 2, 3. So the master process cannot gain response from the other processes.
image
微信图片编辑_20231215165040

Could you provide any ideas about how to fix the problem?

Thank you for your time! ;)

@auniquesun
Copy link
Owner

auniquesun commented Dec 19, 2023

Do you have same environment settings with mine? I list my environment settings in the README.md, such as CUDA and PyTorch vesion, etc.

I don't encounter your problems so I am not clear about the reason of your bug. I suggest you reproduce the experiments with my settings. Ensure the port 12355 and 28282 are not used by other processes since you use these ports in the experiments.

@dempsey-wen
Copy link
Author

Yes, I created a new conda env CrossPoint to conduct:
pip install torch==1.11.0+cu102 torchvision==0.12.0+cu102 --extra-index-url https://download.pytorch.org/whl/cu102
pip install -r requirements.txt

Here is the package installed on CrossPoint:
cudatoolkit 10.2.89
python 3.7.13
pytorch 1.11.0
torch 1.11.0+cu102
torchvision 0.12.0+cu102

I believe that 28282 is not occupied by other programs as I can access the dashboard of wandb. I tried to modify the master port to 12366, and the same issue remained.

@dempsey-wen
Copy link
Author

Update:
The problem is solved with adding the setting before DDP initialization (dist.init_process_group()):

    os.environ.setdefault("CUDA_VISIBLE_DEVICES", "0, 1, 2, 3")
    torch.cuda.set_device(rank)

Here is how I solved the problem:
I met the issue when the training process of epoch 0 finished, as the log shows:
image

Since it is reported that the error raised in executing all_gather_object(). I tested the function in the beginning of my code, received the error message:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646755861072/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, invalid usage, NCCL version 21.0.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

The new error message helped me find the solution.

@auniquesun
Copy link
Owner

@dempsey-wen Congratulations! Great Job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants