Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Update waiting logic for init containers #3762

Merged
merged 6 commits into from
Jul 24, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 35 additions & 6 deletions sky/provision/kubernetes/instance.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,32 @@ def _wait_for_pods_to_run(namespace, new_nodes):
Pods may be pulling images or may be in the process of container
creation.
"""

def _check_init_containers(pod):
# Check if any of the init containers failed
# to start. Could be because the init container
# command failed or failed to pull image etc.
for init_status in pod.status.init_container_statuses:
init_terminated = init_status.state.terminated
if init_terminated:
if init_terminated.exit_code != 0:
msg = init_terminated.message if (
init_terminated.message) else str(init_terminated)
raise config_lib.KubernetesError(
'Failed to run init container for pod '
f'{pod.metadata.name}. Error details: {msg}.')
continue
init_waiting = init_status.state.waiting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we still need to check init_waiting if init_terminated is true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, checking init_waiting is not required when init_terminated is not None. Updated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, should this be return or continue? Will there be multiple init_containers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops good catch - should be continue

if (init_waiting is not None and init_waiting.reason
not in ['ContainerCreating', 'PodInitializing']):
Comment on lines +241 to +242
Copy link
Collaborator

@Michaelvll Michaelvll Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, if init_waiting.reason is not in these two states, what else states it could be in, and why do we directly fail for those cases?

It would be great if we can have a reference link to those states in the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately there's no listing of state.reason available (it's marked as a str without a list of enum values). Asked an AI assistant and it listed these states:

ContainerCreating: The container is still in the process of being created.
CrashLoopBackOff: The container is repeatedly crashing and Kubernetes is backing off before restarting it again.
ErrImagePull: There was an error pulling the container image from the container registry.
ImagePullBackOff: Kubernetes is backing off from pulling the image after a series of failures.
CreateContainerConfigError: There was an error in creating the container configuration.
InvalidImageName: The provided image name is invalid.
PodInitializing: The pod is in the process of initializing.
RunContainerError: There was an error running the container.
ContainerCannotRun: The container cannot run, possibly due to a command or configuration error.
ErrImageNeverPull: The image was not pulled because the policy is set to never pull.
NetworkPluginNotReady: The network plugin is not ready for the container.
BackOff: Kubernetes is backing off from restarting the container.

I don't fully trust this list so I don't want to include it in comments, but given this looks like the only successful states are ContainerCreating and PodInitializing...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a TODO here saying that there might be other states we need to include here when it occurs during usage?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - added now.

# TODO(romilb): There may be more states to check for. Add
# them as needed.
msg = init_waiting.message if (
init_waiting.message) else str(init_waiting)
raise config_lib.KubernetesError(
'Failed to create init container for pod '
f'{pod.metadata.name}. Error details: {msg}.')

while True:
all_pods_running = True
# Iterate over each pod to check their status
Expand All @@ -246,12 +272,15 @@ def _wait_for_pods_to_run(namespace, new_nodes):
# See list of possible reasons for waiting here:
# https://stackoverflow.com/a/57886025
waiting = container_status.state.waiting
if (waiting is not None and
waiting.reason != 'ContainerCreating'):
raise config_lib.KubernetesError(
'Failed to create container while launching '
'the node. Error details: '
f'{container_status.state.waiting.message}.')
if waiting is not None:
if waiting.reason == 'PodInitializing':
_check_init_containers(pod)
elif waiting.reason != 'ContainerCreating':
msg = waiting.message if waiting.message else str(
waiting)
raise config_lib.KubernetesError(
'Failed to create container while launching '
f'the node. Error details: {msg}.')
# Reaching this point means that one of the pods had an issue,
# so break out of the loop, and wait until next second.
break
Expand Down
Loading