[k8s] handle network error from pulling docker image #2551

landscapepainter · 2023-09-13T06:19:20Z

This resolves #2523

Some users have network restrictions and this prevents them from pulling the docker image from registry while creating the node(pod). And when this prevention happens, our sky launch was silently stalling without failing early. This PR resolves this issue by failing early and releasing the scheduled pod.

To reproduce the image pull failure event with networking error, it was necessary to block the registry domain, us-central1-docker.pkg.dev. This can be done by adding

127.0.0.1    us-central1-docker.pkg.dev

to /etc/hosts.

To reproduce this with kind setting, we would have to set

127.0.0.1    us-central1-docker.pkg.dev

at /etc/hosts in the node container rather than directly in the machine where you are running kind. The node container appears to have the name of skypilot-control-plane when running docker ps, and the interactive session can be entered by running docker exec -it skypilot-control-plane /bin/bash. We need to handle this differently as kind simulates a node with docker container, which is in our case, skypilot-control-plane. And kubelet for kind shares its networking setting with the node container rather than the machine running kind.

Tested (run the relevant ones):

Code formatting: bash format.sh
Manual testing for kind/GKE by blocking registry domain ensuring early fail over
pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials"

landscapepainter · 2023-09-21T03:49:14Z

@romilbhardwaj This is now ready for a look. Wondering if we should keep if 'rpc error: code = Unknown' in container_status.state.waiting.message: or not since checking for if container_status.state.waiting.reason == 'ErrImagePull': could be enough and allow more general handling.

Update: Depending on either or not there's a slight delay, waiting.reason can be 'ImagePullBackOff' as well. Removing if 'rpc error: code = Unknown' in container_status.state.waiting.message: and conditioning on waiting.reason == 'ErrImagePull' or waiting.reason == 'ImagePullBackOff' to raise network error message.

landscapepainter · 2023-09-27T06:54:28Z

@romilbhardwaj Did some refactoring on create_node for waiting for pod schdule, waiting for pods to run, and injecting env vars in to pods. This is ready for a look now.

merge

romilbhardwaj

Thanks @landscapepainter! Left some comments.

sky/utils/kubernetes/create_cluster.sh

sky/skylet/providers/kubernetes/node_provider.py

romilbhardwaj · 2023-11-07T22:34:39Z

sky/skylet/providers/kubernetes/node_provider.py

+                        if waiting and (waiting.reason == 'ErrImagePull' or
+                                        waiting.reason == 'ImagePullBackOff'):
+                            raise config.KubernetesError(
+                                'Failed to pull docker image while '
+                                'launching the node. Please check '
+                                'your network connection. Error details: '
+                                f'{container_status.state.waiting.message}.')


Shouldn't this check and raise be moved to after L265, where we already have similar checks in place? It seems to make more sense to have all waiting related errors handled at one place, and this method should be relegated to simply be a wait loop.

Thanks for catching this. I'm wondering now if the following check from _wait_for_pods_to_schedule should actually be done after the pods are scheduled and doesn't need to be checked from _wait_for_pods_to_schedule. waiting.reason can be set to 'ContainerCreating' only after the pods are scheduled, so checking if the pods reached ContainerCreating state should be placed in _wait_for_pods_to_run. And this can update the original waiting check at _wait_for_pods_to_run with the waiting check from _wait_for_pods_to_schedule.

for container_status in pod.status.container_statuses: # If the container wasn't in 'ContainerCreating' # state, then we know pod wasn't scheduled or # had some other error, such as image pull error. # See list of possible reasons for waiting here: # https://stackoverflow.com/a/57886025 waiting = container_status.state.waiting if waiting is not None and waiting.reason != 'ContainerCreating': all_pods_scheduled = False break

I updated

if waiting and (waiting.reason == 'ErrImagePull' or waiting.reason == 'ImagePullBackOff'):

from _wait_for_pods_to_run with

if waiting is not None and waiting.reason != 'ContainerCreating':

so that the post-schedule errors can be hanlded from _wait_for_pods_to_run. Please correct me if I'm missing anything! Tested for network error(post-schedule error) and excessive resource request error(pre-schedule error), and both failed over correctly.

sky/skylet/providers/kubernetes/node_provider.py

romilbhardwaj

Thanks @landscapepainter! Left some comments.

sky/skylet/providers/kubernetes/node_provider.py

Co-authored-by: Romil Bhardwaj <[email protected]>

…apepainter/skypilot into handle_image_pull_failure

landscapepainter

@romilbhardwaj This is ready for another look!

romilbhardwaj

Thanks @landscapepainter - this is good to go!

sky/skylet/providers/kubernetes/node_provider.py

landscapepainter added 2 commits September 13, 2023 06:13

handle network error from pulling docker image

3d5e97b

update timeout

f6f93be

landscapepainter marked this pull request as draft September 13, 2023 06:19

landscapepainter added 2 commits September 14, 2023 05:06

nit

7e84a1b

nit

2bf6968

landscapepainter marked this pull request as ready for review September 16, 2023 01:32

landscapepainter added 4 commits September 16, 2023 01:33

separate scheduing and post-scheduling

88fe005

Merge branch 'master' into handle_image_pull_failure

3f8b923

nit

565765f

nit

5a7e843

landscapepainter marked this pull request as draft September 19, 2023 08:00

nit

980446f

landscapepainter marked this pull request as ready for review September 21, 2023 03:46

landscapepainter requested a review from romilbhardwaj September 21, 2023 03:49

landscapepainter added 5 commits September 26, 2023 07:27

refactor pod scheduling check

e808b86

refactor create_node

30d1220

nit

fc08d50

nit

4039a51

nit

0d7fce5

landscapepainter added 7 commits October 8, 2023 23:45

testing images

079caff

back

a98a6b4

Merge branch 'master' of https://github.com/landscapepainter/skypilot

704b7b8

merge

Merge branch 'master' into handle_image_pull_failure

554fd37

format

324c6e1

nit

7cd299d

nit

67e02eb

romilbhardwaj reviewed Nov 7, 2023

View reviewed changes

sky/skylet/providers/kubernetes/node_provider.py Show resolved Hide resolved

landscapepainter and others added 5 commits November 9, 2023 07:04

nit

5e92f8a

Update sky/skylet/providers/kubernetes/node_provider.py

5ec4a64

Co-authored-by: Romil Bhardwaj <[email protected]>

Merge branch 'handle_image_pull_failure' of https://github.com/landsc…

f7bcb97

…apepainter/skypilot into handle_image_pull_failure

nit

a364581

update with more waiting.reason list

f590ace

landscapepainter commented Nov 12, 2023

View reviewed changes

update waiting check locaion

87d41cf

landscapepainter requested a review from romilbhardwaj November 12, 2023 06:58

romilbhardwaj reviewed Nov 16, 2023

View reviewed changes

sky/skylet/providers/kubernetes/node_provider.py Outdated Show resolved Hide resolved

nit

c131990

romilbhardwaj approved these changes Nov 17, 2023

View reviewed changes

landscapepainter merged commit 7b1bf0b into skypilot-org:master Nov 17, 2023
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] handle network error from pulling docker image #2551

[k8s] handle network error from pulling docker image #2551

landscapepainter commented Sep 13, 2023 •

edited

Loading

landscapepainter commented Sep 21, 2023 •

edited

Loading

landscapepainter commented Sep 27, 2023

romilbhardwaj left a comment

romilbhardwaj Nov 7, 2023

landscapepainter Nov 12, 2023 •

edited

Loading

romilbhardwaj left a comment

landscapepainter left a comment

romilbhardwaj left a comment

[k8s] handle network error from pulling docker image #2551

[k8s] handle network error from pulling docker image #2551

Conversation

landscapepainter commented Sep 13, 2023 • edited Loading

landscapepainter commented Sep 21, 2023 • edited Loading

landscapepainter commented Sep 27, 2023

romilbhardwaj left a comment

Choose a reason for hiding this comment

romilbhardwaj Nov 7, 2023

Choose a reason for hiding this comment

landscapepainter Nov 12, 2023 • edited Loading

Choose a reason for hiding this comment

romilbhardwaj left a comment

Choose a reason for hiding this comment

landscapepainter left a comment

Choose a reason for hiding this comment

romilbhardwaj left a comment

Choose a reason for hiding this comment

landscapepainter commented Sep 13, 2023 •

edited

Loading

landscapepainter commented Sep 21, 2023 •

edited

Loading

landscapepainter Nov 12, 2023 •

edited

Loading