This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

container "pytorch" is waiting to start: PodInitializing #348

Open

gogogwwb opened this issue Aug 15, 2021 · 20 comments

Labels

gogogwwb commented Aug 15, 2021 •

edited

Loading

When the master is finished running, the worker is still initializing.

worker log：
Error from server (BadRequest): container "pytorch" in pod"xxx-jxosi-worker-0" is waiting to start: PodInitializing

What is the reason for this?

Member

gaocegege commented Aug 16, 2021

Could you please run kubectl describe and post the result here?

Member

gaocegege commented Aug 16, 2021

Can you show more about it? Especially the events section.

Member

gaocegege commented Aug 16, 2021

Seems that the init container is pending. Can you show its log?

Author

gogogwwb commented Aug 16, 2021

init-pytorch log：

Author

gogogwwb commented Aug 16, 2021

master svc ：

Member

gaocegege commented Aug 16, 2021

Can you try kubectl debug to run an ephemeral container, then run ping xxx-master-0?

Author

gogogwwb commented Aug 16, 2021

ping master-0 in an ephemeral container :

Author

gogogwwb commented Aug 16, 2021 •

edited

Loading

kubectl get ep -A，then i found ep appeared, but disappeared again after a while

Member

gaocegege commented Aug 17, 2021

It's weird.

Author

gogogwwb commented Aug 17, 2021

I put the program to sleep for a while and found that the worker can run. Is there any restriction on the creation order of service and pod in pytorchjob?

Author

gogogwwb commented Aug 17, 2021

It should be that the master executes too fast, so that the ep of the service finally becomes none, and the worker cannot obtain the IP address of the master.

Member

gaocegege commented Aug 17, 2021

It should be that the master executes too fast, so that the ep of the service finally becomes none, and the worker cannot obtain the IP address of the master.

Interesting. /cc @johnugeorge

gaocegege added the kind/bug label

Member

johnugeorge commented Aug 17, 2021

But in that case, master should not start the job until workers are up. Are you using distributed setup itself in the code?

Author

gogogwwb commented Aug 17, 2021

I did not use distributed steps in the code. After master running, it becomes completed state, "kubectl get ep -ntest", and found that ep is none

Author

gogogwwb commented Aug 17, 2021

But in that case, master should not start the job until workers are up. Are you using distributed setup itself in the code?

If the code is not sleeping, the master is in a short running state, but the worker is in the init state

Member

johnugeorge commented Aug 18, 2021

If you are not using distributed pytorch in the code, this can happen. Master can start executing and gets completed before worker starts. Can you confirm whether you are using distributed APIs?

Author

gogogwwb commented Sep 13, 2021

If you are not using distributed pytorch in the code, this can happen. Master can start executing and gets completed before worker starts. Can you confirm whether you are using distributed APIs?

I think i don't use distributed APIs.
the code：

Member

johnugeorge commented Sep 14, 2021

That is issue. Any reason in using pytorch-operator without using distributed version?

Example:

https://github.com/kubeflow/tf-operator/blob/1aa44a68cd364ed6e30c0841e6daf1d93a29f146/examples/pytorch/mnist/mnist.py#L72

Author

gogogwwb commented Sep 29, 2021

How do I make the pod created by pytorchjob not automatically disappear after completion, is it to use cleanpolicy? How to set it up? thanks

Member

johnugeorge commented Oct 9, 2021

Set to None

https://github.com/kubeflow/common/blob/f7c41a08761ff3b215553a051fd529efd22782a1/pkg/apis/common/v1/types.go#L136

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.