-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] Update waiting logic for init containers #3762
Conversation
…o init_container_fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @romilbhardwaj! Left several coments.
raise config_lib.KubernetesError( | ||
f'Failed to run init container for pod {pod.metadata.name}.' | ||
f' Error details: {msg}.') | ||
init_waiting = init_status.state.waiting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we still need to check init_waiting
if init_terminated
is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, checking init_waiting
is not required when init_terminated
is not None. Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, should this be return
or continue
? Will there be multiple init_containers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops good catch - should be continue
if (init_waiting is not None and init_waiting.reason | ||
not in ['ContainerCreating', 'PodInitializing']): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, if init_waiting.reason
is not in these two states, what else states it could be in, and why do we directly fail for those cases?
It would be great if we can have a reference link to those states in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately there's no listing of state.reason available (it's marked as a str without a list of enum values). Asked an AI assistant and it listed these states:
ContainerCreating: The container is still in the process of being created.
CrashLoopBackOff: The container is repeatedly crashing and Kubernetes is backing off before restarting it again.
ErrImagePull: There was an error pulling the container image from the container registry.
ImagePullBackOff: Kubernetes is backing off from pulling the image after a series of failures.
CreateContainerConfigError: There was an error in creating the container configuration.
InvalidImageName: The provided image name is invalid.
PodInitializing: The pod is in the process of initializing.
RunContainerError: There was an error running the container.
ContainerCannotRun: The container cannot run, possibly due to a command or configuration error.
ErrImageNeverPull: The image was not pulled because the policy is set to never pull.
NetworkPluginNotReady: The network plugin is not ready for the container.
BackOff: Kubernetes is backing off from restarting the container.
I don't fully trust this list so I don't want to include it in comments, but given this looks like the only successful states are ContainerCreating
and PodInitializing
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a TODO here saying that there might be other states we need to include here when it occurs during usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea - added now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @romilbhardwaj! LGTM except for the two comments below : )
raise config_lib.KubernetesError( | ||
f'Failed to run init container for pod {pod.metadata.name}.' | ||
f' Error details: {msg}.') | ||
init_waiting = init_status.state.waiting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, should this be return
or continue
? Will there be multiple init_containers?
if (init_waiting is not None and init_waiting.reason | ||
not in ['ContainerCreating', 'PodInitializing']): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a TODO here saying that there might be other states we need to include here when it occurs during usage?
Thanks @Michaelvll - will re-run the manual tests then merge. |
Closes #3702.
Tested with:
sky launch -c test --cloud kubernetes --image-id romilb/fakeimage -- echo hi
sky launch -c test --cloud kubernetes
with this config.yaml:sky launch -c test --cloud kubernetes
with this config.yaml: