Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

PytorchJob DDP training will stop if I delete a worker pod #364

Open
Shuai-Xie opened this issue Nov 20, 2021 · 2 comments
Open

PytorchJob DDP training will stop if I delete a worker pod #364

Shuai-Xie opened this issue Nov 20, 2021 · 2 comments

Comments

@Shuai-Xie
Copy link

Hi, everyone.

I want to test the failure tolerance of PytorchJob.

I started a PytorchJob with 1 master and 3 workers.

$ kubectl get pods -o wide
NAME                 READY   STATUS    RESTARTS   AGE     IP           NODE
mnist-ddp-master-0   1/1     Running   0          2m55s   11.80.0.36   11.71.1.160
mnist-ddp-worker-0   1/1     Running   0          2m55s   11.80.0.37   11.71.1.160
mnist-ddp-worker-1   1/1     Running   0          2m55s   11.80.0.38   11.71.1.160
mnist-ddp-worker-2   1/1     Running   0          89s     11.80.0.46   11.71.1.160

It trains fine.

Then I deleted a worker.

$ kubectl delete pod mnist-ddp-worker-1

As I set restartPolicy: OnFailure, this pod will restart quickly with the same name mnist-ddp-worker-1.

But sadly, I can't see this newborn worker join the DDP training.

Thanks.

@gaocegege
Copy link
Member

This repository will be deprecated soon, please open an issue at github.com/kubeflow/training-operator

@Shuai-Xie
Copy link
Author

haolei, gege @gaocegege

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants