Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00) #3

Open
Lzyin opened this issue Oct 12, 2023 · 5 comments

Comments

@Lzyin
Copy link

Lzyin commented Oct 12, 2023

Excuse me, when I was conducting distributed training, the log kept outputting "DEBUG SenderThread: 1236909 [sender. py: send(): 182] send: stats", and finally reported an RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00). The parser settings are as follows:
parser.add_argument('--master_addr', type=str, default='localhost', help='ip of master node')
parser.add_argument('--master_port', type=str, default='12355', help='port of master node')
Do I need to change these parameters?

Thanks for your reply.

@auniquesun
Copy link
Owner

please provide detailed descriptions of your running, ie.

pytorch version, cuda version, running script, and other things you think important to solve the error.

@Lzyin
Copy link
Author

Lzyin commented Oct 12, 2023

please provide detailed descriptions of your running, ie.

pytorch version, cuda version, running script, and other things you think important to solve the error.

My running environment: Pytorch1.8.0, cuda11.4
Run command: CUDA_VISIBLE_DEVICES=3,4 python train_crosspoint.py --model dgcnn_seg --exp_name crosspoint_dgcnn_pt_seg --epochs 100 --lr 0.001 --batch_size 20 --print_freq 200 --k 15 --num_workers 1

I did not use pueue. Additionally, due to the device occupancy, I only used two graphics cards numbered 3 and 4.

The parser settings for distributed training are as follows:

distributed training on multiple GPUs

parser.add_argument('--rank', type=int, default=-1, help='the rank for current GPU or process, '
'ususally one process per GPU')
parser.add_argument('--backend', type=str, default='nccl', help='DDP communication backend')
parser.add_argument('--world_size', type=int, default=2, help='number of GPUs')
parser.add_argument('--master_addr', type=str, default='localhost', help='ip of master node')
parser.add_argument('--master_port', type=str, default='12355', help='port of master node')

@auniquesun
Copy link
Owner

auniquesun commented Oct 15, 2023

First of all, the RuntimeError may be caused by wandb package, I use it in the docker environment. You need to specify your --wb_url and --wb_key to login. If you run the docker on same machine as the experiment, localhost is proper for --wb_url, but for --wb_key, you need to find yours according to your wandb settings. Otherwise, the timeout error will be reported.

For my case, I have 6 GPUs on one machine, so the master_addr is localhost. Here master_port means the port for pytorch ddp communication, you can change it to different one and ensure that port is not used by other processes.

If you use different number of GPUS, you can specify the --world_size 2 explicitly in the command. I think CUDA_VISIBLE_DEVICES=3,4 is also necessary.

Finally, you can refer to the scirpts for pretraining (pt), finetuning (ft), classification (cls) and part segmentation (partseg) commands. I have test them in my settings.

@Lzyin
Copy link
Author

Lzyin commented Oct 15, 2023

I have successfully logged into wandb:
wandb: Currently logged in as: lzyin (use wandb login --relogin to force relogin)
wandb: wandb version 0.15.12 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.12.1
wandb: Syncing run crosspoint_dgcnn_pt_seg
wandb: ⭐️ View project at http://localhost:28282/lzyin/CrossPoint
wandb: 🚀 View run at http://localhost:28282/lzyin/CrossPoint/runs/1ladqjt1

I tried changing --master_addr to --wb_url, i.e. http://localhost:28282 and it reports ValueError: host not found: Name or service not known. I also tried commenting out the wandb codes, but the problem still exists.

@auniquesun
Copy link
Owner

auniquesun commented Oct 20, 2023

First, the --master_addr and --wb_url should have different ports. So changing --master_addr to --wb_url is wrong.

Second, you can run the code using one GPU to verify whether the same error still occurs.

Third, your pytorch version and other settings are not same as mine. So maybe the environment does not work in your case.

Finally, I suggest you try to figure out the real reason behind the error.

For example, according to your firstly reported error, "RuntimeError: Timed out initializing process group in store based barrier on rank: 1". You should ensure the error can be reproduced when you run the experiment again. If it can be reproduced, you can google related topics to figure out the real reason behind your error. I think it is related to initialize pytorch ddp process group.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants