Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

container "pytorch" is waiting to start: PodInitializing #348

Open
gogogwwb opened this issue Aug 15, 2021 · 20 comments
Open

container "pytorch" is waiting to start: PodInitializing #348

gogogwwb opened this issue Aug 15, 2021 · 20 comments
Labels

Comments

@gogogwwb
Copy link

gogogwwb commented Aug 15, 2021

When the master is finished running, the worker is still initializing.

1629019887(1)

worker log:
Error from server (BadRequest): container "pytorch" in pod"xxx-jxosi-worker-0" is waiting to start: PodInitializing

What is the reason for this?

@gaocegege
Copy link
Member

Could you please run kubectl describe and post the result here?

@gaocegege
Copy link
Member

Can you show more about it? Especially the events section.

@gaocegege
Copy link
Member

Seems that the init container is pending. Can you show its log?

@gogogwwb
Copy link
Author

init-pytorch log:
image

@gogogwwb
Copy link
Author

master svc :

image

@gaocegege
Copy link
Member

Can you try kubectl debug to run an ephemeral container, then run ping xxx-master-0?

@gogogwwb
Copy link
Author

ping master-0 in an ephemeral container :
image

@gogogwwb
Copy link
Author

gogogwwb commented Aug 16, 2021

kubectl get ep -A,then i found ep appeared, but disappeared again after a while

image

image

@gaocegege
Copy link
Member

It's weird.

@gogogwwb
Copy link
Author

I put the program to sleep for a while and found that the worker can run. Is there any restriction on the creation order of service and pod in pytorchjob?

@gogogwwb
Copy link
Author

It should be that the master executes too fast, so that the ep of the service finally becomes none, and the worker cannot obtain the IP address of the master.

@gaocegege
Copy link
Member

It should be that the master executes too fast, so that the ep of the service finally becomes none, and the worker cannot obtain the IP address of the master.

Interesting. /cc @johnugeorge

@johnugeorge
Copy link
Member

But in that case, master should not start the job until workers are up. Are you using distributed setup itself in the code?

@gogogwwb
Copy link
Author

I did not use distributed steps in the code. After master running, it becomes completed state, "kubectl get ep -ntest", and found that ep is none

@gogogwwb
Copy link
Author

But in that case, master should not start the job until workers are up. Are you using distributed setup itself in the code?

If the code is not sleeping, the master is in a short running state, but the worker is in the init state

@johnugeorge
Copy link
Member

If you are not using distributed pytorch in the code, this can happen. Master can start executing and gets completed before worker starts. Can you confirm whether you are using distributed APIs?

@gogogwwb
Copy link
Author

If you are not using distributed pytorch in the code, this can happen. Master can start executing and gets completed before worker starts. Can you confirm whether you are using distributed APIs?

I think i don't use distributed APIs.
the code:

image

@johnugeorge
Copy link
Member

That is issue. Any reason in using pytorch-operator without using distributed version?

Example:

https://github.com/kubeflow/tf-operator/blob/1aa44a68cd364ed6e30c0841e6daf1d93a29f146/examples/pytorch/mnist/mnist.py#L72

@gogogwwb
Copy link
Author

How do I make the pod created by pytorchjob not automatically disappear after completion, is it to use cleanpolicy? How to set it up? thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants