-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] handle network error from pulling docker image #2551
[k8s] handle network error from pulling docker image #2551
Conversation
@romilbhardwaj This is now ready for a look. Wondering if we should keep Update: Depending on either or not there's a slight delay, |
@romilbhardwaj Did some refactoring on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter! Left some comments.
if waiting and (waiting.reason == 'ErrImagePull' or | ||
waiting.reason == 'ImagePullBackOff'): | ||
raise config.KubernetesError( | ||
'Failed to pull docker image while ' | ||
'launching the node. Please check ' | ||
'your network connection. Error details: ' | ||
f'{container_status.state.waiting.message}.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this check and raise be moved to after L265, where we already have similar checks in place? It seems to make more sense to have all waiting related errors handled at one place, and this method should be relegated to simply be a wait loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. I'm wondering now if the following check from _wait_for_pods_to_schedule
should actually be done after the pods are scheduled and doesn't need to be checked from _wait_for_pods_to_schedule
. waiting.reason
can be set to 'ContainerCreating'
only after the pods are scheduled, so checking if the pods reached ContainerCreating
state should be placed in _wait_for_pods_to_run
. And this can update the original waiting check at _wait_for_pods_to_run
with the waiting check from _wait_for_pods_to_schedule
.
for container_status in pod.status.container_statuses:
# If the container wasn't in 'ContainerCreating'
# state, then we know pod wasn't scheduled or
# had some other error, such as image pull error.
# See list of possible reasons for waiting here:
# https://stackoverflow.com/a/57886025
waiting = container_status.state.waiting
if waiting is not None and waiting.reason != 'ContainerCreating':
all_pods_scheduled = False
break
I updated
if waiting and (waiting.reason == 'ErrImagePull' or
waiting.reason == 'ImagePullBackOff'):
from _wait_for_pods_to_run
with
if waiting is not None and waiting.reason != 'ContainerCreating':
so that the post-schedule errors can be hanlded from _wait_for_pods_to_run
. Please correct me if I'm missing anything! Tested for network error(post-schedule error) and excessive resource request error(pre-schedule error), and both failed over correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter! Left some comments.
Co-authored-by: Romil Bhardwaj <[email protected]>
…apepainter/skypilot into handle_image_pull_failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romilbhardwaj This is ready for another look!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter - this is good to go!
This resolves #2523
Some users have network restrictions and this prevents them from pulling the docker image from registry while creating the node(pod). And when this prevention happens, our sky launch was silently stalling without failing early. This PR resolves this issue by failing early and releasing the scheduled pod.
To reproduce the image pull failure event with networking error, it was necessary to block the registry domain, us-central1-docker.pkg.dev. This can be done by adding
to
/etc/hosts
.To reproduce this with
kind
setting, we would have to setat
/etc/hosts
in the node container rather than directly in the machine where you are runningkind
. The node container appears to have the name ofskypilot-control-plane
when runningdocker ps
, and the interactive session can be entered by runningdocker exec -it skypilot-control-plane /bin/bash
. We need to handle this differently askind
simulates a node with docker container, which is in our case,skypilot-control-plane
. And kubelet forkind
shares its networking setting with the node container rather than the machine runningkind
.Tested (run the relevant ones):
bash format.sh