RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00) #3

Lzyin · 2023-10-12T09:41:04Z

Excuse me, when I was conducting distributed training, the log kept outputting "DEBUG SenderThread: 1236909 [sender. py: send(): 182] send: stats", and finally reported an RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00). The parser settings are as follows:
parser.add_argument('--master_addr', type=str, default='localhost', help='ip of master node')
parser.add_argument('--master_port', type=str, default='12355', help='port of master node')
Do I need to change these parameters?

Thanks for your reply.

auniquesun · 2023-10-12T10:06:38Z

please provide detailed descriptions of your running, ie.

pytorch version, cuda version, running script, and other things you think important to solve the error.

Lzyin · 2023-10-12T10:24:11Z

please provide detailed descriptions of your running, ie.

pytorch version, cuda version, running script, and other things you think important to solve the error.

My running environment: Pytorch1.8.0, cuda11.4
Run command: CUDA_VISIBLE_DEVICES=3,4 python train_crosspoint.py --model dgcnn_seg --exp_name crosspoint_dgcnn_pt_seg --epochs 100 --lr 0.001 --batch_size 20 --print_freq 200 --k 15 --num_workers 1

I did not use pueue. Additionally, due to the device occupancy, I only used two graphics cards numbered 3 and 4.

The parser settings for distributed training are as follows:

distributed training on multiple GPUs

parser.add_argument('--rank', type=int, default=-1, help='the rank for current GPU or process, '
'ususally one process per GPU')
parser.add_argument('--backend', type=str, default='nccl', help='DDP communication backend')
parser.add_argument('--world_size', type=int, default=2, help='number of GPUs')
parser.add_argument('--master_addr', type=str, default='localhost', help='ip of master node')
parser.add_argument('--master_port', type=str, default='12355', help='port of master node')

auniquesun · 2023-10-15T12:11:21Z

First of all, the RuntimeError may be caused by wandb package, I use it in the docker environment. You need to specify your --wb_url and --wb_key to login. If you run the docker on same machine as the experiment, localhost is proper for --wb_url, but for --wb_key, you need to find yours according to your wandb settings. Otherwise, the timeout error will be reported.

For my case, I have 6 GPUs on one machine, so the master_addr is localhost. Here master_port means the port for pytorch ddp communication, you can change it to different one and ensure that port is not used by other processes.

If you use different number of GPUS, you can specify the --world_size 2 explicitly in the command. I think CUDA_VISIBLE_DEVICES=3,4 is also necessary.

Finally, you can refer to the scirpts for pretraining (pt), finetuning (ft), classification (cls) and part segmentation (partseg) commands. I have test them in my settings.

Lzyin · 2023-10-15T12:52:38Z

I have successfully logged into wandb:
wandb: Currently logged in as: lzyin (use wandb login --relogin to force relogin)
wandb: wandb version 0.15.12 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.12.1
wandb: Syncing run crosspoint_dgcnn_pt_seg
wandb: ⭐️ View project at http://localhost:28282/lzyin/CrossPoint
wandb: 🚀 View run at http://localhost:28282/lzyin/CrossPoint/runs/1ladqjt1

I tried changing --master_addr to --wb_url, i.e. http://localhost:28282 and it reports ValueError: host not found: Name or service not known. I also tried commenting out the wandb codes, but the problem still exists.

auniquesun · 2023-10-20T16:08:01Z

First, the --master_addr and --wb_url should have different ports. So changing --master_addr to --wb_url is wrong.

Second, you can run the code using one GPU to verify whether the same error still occurs.

Third, your pytorch version and other settings are not same as mine. So maybe the environment does not work in your case.

Finally, I suggest you try to figure out the real reason behind the error.

For example, according to your firstly reported error, "RuntimeError: Timed out initializing process group in store based barrier on rank: 1". You should ensure the error can be reproduced when you run the experiment again. If it can be reproduced, you can google related topics to figure out the real reason behind your error. I think it is related to initialize pytorch ddp process group.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00) #3

RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00) #3

Lzyin commented Oct 12, 2023

auniquesun commented Oct 12, 2023

Lzyin commented Oct 12, 2023

auniquesun commented Oct 15, 2023 •

edited

Loading

Lzyin commented Oct 15, 2023

auniquesun commented Oct 20, 2023 •

edited

Loading

RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00) #3

RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00) #3

Comments

Lzyin commented Oct 12, 2023

auniquesun commented Oct 12, 2023

Lzyin commented Oct 12, 2023

distributed training on multiple GPUs

auniquesun commented Oct 15, 2023 • edited Loading

Lzyin commented Oct 15, 2023

auniquesun commented Oct 20, 2023 • edited Loading

auniquesun commented Oct 15, 2023 •

edited

Loading

auniquesun commented Oct 20, 2023 •

edited

Loading