-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00) #3
Comments
please provide detailed descriptions of your running, ie. pytorch version, cuda version, running script, and other things you think important to solve the error. |
My running environment: Pytorch1.8.0, cuda11.4 I did not use pueue. Additionally, due to the device occupancy, I only used two graphics cards numbered 3 and 4. The parser settings for distributed training are as follows: distributed training on multiple GPUsparser.add_argument('--rank', type=int, default=-1, help='the rank for current GPU or process, ' |
First of all, the For my case, I have 6 GPUs on one machine, so the If you use different number of GPUS, you can specify the Finally, you can refer to the |
I have successfully logged into wandb: I tried changing --master_addr to --wb_url, i.e. http://localhost:28282 and it reports ValueError: host not found: Name or service not known. I also tried commenting out the wandb codes, but the problem still exists. |
First, the Second, you can run the code using one GPU to verify whether the same error still occurs. Third, your pytorch version and other settings are not same as mine. So maybe the environment does not work in your case. Finally, I suggest you try to figure out the real reason behind the error. For example, according to your firstly reported error, "RuntimeError: Timed out initializing process group in store based barrier on rank: 1". You should ensure the error can be reproduced when you run the experiment again. If it can be reproduced, you can google related topics to figure out the real reason behind your error. I think it is related to initialize pytorch ddp process group. |
Excuse me, when I was conducting distributed training, the log kept outputting "DEBUG SenderThread: 1236909 [sender. py: send(): 182] send: stats", and finally reported an RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00). The parser settings are as follows:
parser.add_argument('--master_addr', type=str, default='localhost', help='ip of master node')
parser.add_argument('--master_port', type=str, default='12355', help='port of master node')
Do I need to change these parameters?
Thanks for your reply.
The text was updated successfully, but these errors were encountered: