Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Update waiting logic for init containers #3762

Merged
merged 6 commits into from
Jul 24, 2024

Conversation

romilbhardwaj
Copy link
Collaborator

Closes #3702.

Tested with:

  1. Base case - bad image, no init container, returns clear error message: sky launch -c test --cloud kubernetes --image-id romilb/fakeimage -- echo hi
run_instances: Error occurred when creating pods: Failed to create container while launching the node. Error details: failed to pull and unpack image "docker.io/romilb/fakeimage:latest": failed to resolve reference "docker.io/romilb/fakeimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed.
  1. Healthy init container - sky launch -c test --cloud kubernetes with this config.yaml:
kubernetes:
  pod_config:
    spec:
      initContainers:
        - name: init-myservice
          image: busybox
          command: ['sh', '-c', 'echo hi']
  1. Bad init container - sky launch -c test --cloud kubernetes with this config.yaml:
kubernetes:
  pod_config:
    spec:
      initContainers:
        - name: init-myservice
          image: busybox
          command: ['sh', '-c', 'exit 1']
W 07-18 11:55:11 instance.py:604] run_instances: Error occurred when creating pods: Failed to run init container for pod test-2ea4-head. Error details: {'container_id': 'containerd://88a2f346c4b2b5abc09493cb7204d1bcd91eca0fa35670d9142b5c6840e00610',
W 07-18 11:55:11 instance.py:604]  'exit_code': 1,
W 07-18 11:55:11 instance.py:604]  'finished_at': datetime.datetime(2024, 7, 18, 18, 55, 11, tzinfo=tzutc()),
W 07-18 11:55:11 instance.py:604]  'message': None,
W 07-18 11:55:11 instance.py:604]  'reason': 'Error',
W 07-18 11:55:11 instance.py:604]  'signal': None,
W 07-18 11:55:11 instance.py:604]  'started_at': datetime.datetime(2024, 7, 18, 18, 55, 11, tzinfo=tzutc())}.

@romilbhardwaj romilbhardwaj requested a review from Michaelvll July 24, 2024 21:30
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @romilbhardwaj! Left several coments.

sky/provision/kubernetes/instance.py Outdated Show resolved Hide resolved
raise config_lib.KubernetesError(
f'Failed to run init container for pod {pod.metadata.name}.'
f' Error details: {msg}.')
init_waiting = init_status.state.waiting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we still need to check init_waiting if init_terminated is true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, checking init_waiting is not required when init_terminated is not None. Updated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, should this be return or continue? Will there be multiple init_containers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops good catch - should be continue

Comment on lines +239 to +240
if (init_waiting is not None and init_waiting.reason
not in ['ContainerCreating', 'PodInitializing']):
Copy link
Collaborator

@Michaelvll Michaelvll Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, if init_waiting.reason is not in these two states, what else states it could be in, and why do we directly fail for those cases?

It would be great if we can have a reference link to those states in the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately there's no listing of state.reason available (it's marked as a str without a list of enum values). Asked an AI assistant and it listed these states:

ContainerCreating: The container is still in the process of being created.
CrashLoopBackOff: The container is repeatedly crashing and Kubernetes is backing off before restarting it again.
ErrImagePull: There was an error pulling the container image from the container registry.
ImagePullBackOff: Kubernetes is backing off from pulling the image after a series of failures.
CreateContainerConfigError: There was an error in creating the container configuration.
InvalidImageName: The provided image name is invalid.
PodInitializing: The pod is in the process of initializing.
RunContainerError: There was an error running the container.
ContainerCannotRun: The container cannot run, possibly due to a command or configuration error.
ErrImageNeverPull: The image was not pulled because the policy is set to never pull.
NetworkPluginNotReady: The network plugin is not ready for the container.
BackOff: Kubernetes is backing off from restarting the container.

I don't fully trust this list so I don't want to include it in comments, but given this looks like the only successful states are ContainerCreating and PodInitializing...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a TODO here saying that there might be other states we need to include here when it occurs during usage?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - added now.

@romilbhardwaj romilbhardwaj requested a review from Michaelvll July 24, 2024 22:08
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @romilbhardwaj! LGTM except for the two comments below : )

raise config_lib.KubernetesError(
f'Failed to run init container for pod {pod.metadata.name}.'
f' Error details: {msg}.')
init_waiting = init_status.state.waiting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, should this be return or continue? Will there be multiple init_containers?

Comment on lines +239 to +240
if (init_waiting is not None and init_waiting.reason
not in ['ContainerCreating', 'PodInitializing']):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a TODO here saying that there might be other states we need to include here when it occurs during usage?

@romilbhardwaj
Copy link
Collaborator Author

Thanks @Michaelvll - will re-run the manual tests then merge.

@romilbhardwaj romilbhardwaj added this pull request to the merge queue Jul 24, 2024
Merged via the queue into master with commit 754bf57 Jul 24, 2024
20 checks passed
@romilbhardwaj romilbhardwaj deleted the init_container_fix branch July 24, 2024 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] unable to launch pod with init container
2 participants