Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

'host not found' error occurs during PyTorch distributed learning #333

Open
JGoo1 opened this issue Apr 30, 2021 · 1 comment
Open

'host not found' error occurs during PyTorch distributed learning #333

JGoo1 opened this issue Apr 30, 2021 · 1 comment

Comments

@JGoo1
Copy link

JGoo1 commented Apr 30, 2021

During the PyTorch Job distributed learning, sometimes the 'Worker' cannot find the 'Master' with below message.

Traceback (most recent call last):
  File "/workspace/src/bert/benchmark.py", line 2248, in <module>
    main()
  File "/workspace/src/bert/benchmark.py", line 2212, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 423, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
ValueError: host not found: Name or service not known

In pytorch job, 'worker' check connection with 'master' using 'nslookup' command as below, but the connection between 'master' and 'worker' might not be fully ready even if nslookup command succeeds.

 command: ['sh', '-c', 'until nslookup {{.MasterAddr}}; do echo waiting for master; sleep 2; done;']`

So, I'm using 'netcat' command instead of 'nslookup'.

The following example shows that netcat test fails even if the nslookup test succeeds.
netcat shows success within 4~10 sec after nslookup succeeds in my environment.

master address: pytorch-bert-test-g16-master-0
default port: 23456
used command: 
 - nslookup pytorch-bert-test-g16-master-0
 - nc -w 1 -z pytorch-bert-test-g16-master-0 23456

nslookup: can't resolve 'pytorch-bert-test-g16-master-0': Name does not resolve  <-- nslookup failure
nc: bad address 'pytorch-bert-test-g16-master-0'    
netcat 1   <-- netcat failure


Name:      pytorch-bert-test-g16-master-0
Address 1: 172.30.0.42 172-30-0-42.pytorch-bert-test-g16-master-0.default.svc.cluster.local  <-- nslookup succeess!
netcat 1 <-- netcat failure


Name:      pytorch-bert-test-g16-master-0
Address 1: 172.30.0.42 172-30-0-42.pytorch-bert-test-g16-master-0.default.svc.cluster.local <-- nslookup succeess!
netcat 1 <-- netcat failure


(tried several times...)


Name:      pytorch-bert-test-g16-master-0
Address 1: 172.30.0.42 172-30-0-42.pytorch-bert-test-g16-master-0.default.svc.cluster.local <-- nslookup succeess!
netcat 0 <-- netcat succeess!
 

I guess there is a slight delay until virtual ip with the port is opened completely in k8s after service is created and endpoint is assigned.

So, Could you please check this issue?

And are there any plans to modify below code to pass the master port as a parameter as well as the master address when creating the init Container?

# pytorch-operator/pkg/controller.v1/pytorch/pod.go 
...
	if !masterRole {
		masterAddr := jobcontroller.GenGeneralName(job.Name, strings.ToLower(string(pyv1.PyTorchReplicaTypeMaster)), strconv.Itoa(0))
		err := AddInitContainerForWorkerPod(podTemplate, InitContainerParam{
			MasterAddr:         masterAddr,
			InitContainerImage: pc.initContainerImage,
		})
		if err != nil {
			return err
		}
	}
...

Because, I'm using 'netcat' command with hard-coded port, because only 'MasterAddr' is passed as a parameter when creating an init container.

Best regards!

@gaocegege
Copy link
Member

And are there any plans to modify below code to pass the master port as a parameter as well as the master address when creating the init Container?

I think we should have it, thanks for the issue.

/kind feature

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants