[K8s] Wait until endpoint to be ready for `--endpoint` call #3634

Michaelvll · 2024-06-05T05:23:55Z

This PR adds waiting for fetching endpoint of a newly launched cluster on kubernetes.

The following does not work on master:
sky launch -c test-port --ports 8889 --cloud kubernetes --cpus 2; sky status --endpoint 8889 test-port

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- sky launch -c test-port --ports 8889 --cloud kubernetes --cpus 2; sky status --endpoint 8889 test-port
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

romilbhardwaj · 2024-07-01T16:44:42Z

sky/provision/kubernetes/network_utils.py

+    start_time = time.time()
+    retry_cnt = 0
+    while ip is None and time.time() - start_time < timeout:
+        service = core_api.read_namespaced_service(
+            service_name, namespace, _request_timeout=kubernetes.API_TIMEOUT)
+        if service.status.load_balancer.ingress is not None:
+            ip = (service.status.load_balancer.ingress[0].ip or
+                  service.status.load_balancer.ingress[0].hostname)
+        if ip is None:
+            retry_cnt += 1
+            if retry_cnt % 5 == 0:
+                logger.debug('Waiting for load balancer IP to be assigned'
+                             '...')
+            time.sleep(1)
+    return ip


Following the end-to-end argument, I'm thinking this retry + timeout functionality is better implemented higher up in the stack (perhaps in backend_utils.get_endpoints when provision_lib.query_ports is called). E.g., it might also be required for other port query methods/clouds in the future.

wdyt?

Does this issue happen to ingress mode as well? Adding to the higher level makes sense to me, when the issue is general. Otherwise, it may introduce unnecessary overheads for retrying before erroring out, when the ports actually fail to expose. Wdyt?

I see - I'm okay with having this check here then :) Maybe we can make a note in provision_lib.query_ports doc str that the underlying implementation is responsible for retries and timeout.

Good point! Updated.

…o wait-until-endpoint-ready

romilbhardwaj

Thanks @Michaelvll - tested on GKE and local clusters. Left a small comment about timeout=0 condition, otherwise LGTM.

sky/provision/kubernetes/network_utils.py

romilbhardwaj

Thanks @Michaelvll!

Michaelvll · 2024-07-04T07:49:53Z

Tested:

sky serve up --cloud kubernetes --cpus 2 examples/serve/http_server/task.yaml (controller on kubernetes, GKE)
sky launch -c test-port --ports 8889 --cloud kubernetes --cpus 2 -d; sky status --endpoint 8889 test-port

* Wait until endpoint to be ready for k8s * fix * Less debug output * ux * fix * address comments * Add rich status for endpoint fetching * Add rich status for waiting for the endpoint

Michaelvll added 3 commits June 5, 2024 02:58

Wait until endpoint to be ready for k8s

6fc56be

fix

44ba799

Less debug output

1eb36b5

Michaelvll requested a review from romilbhardwaj June 5, 2024 05:39

Michaelvll added 2 commits June 5, 2024 05:47

ux

30b05be

fix

c2b1a30

Michaelvll mentioned this pull request Jun 5, 2024

[K8s] Fail to launch service on GKE cluster #3635

Closed

Michaelvll changed the title ~~[K8s] Wait until endpoint to be ready for load balancer~~ [K8s] Wait until endpoint to be ready for --endpoint call Jun 7, 2024

Michaelvll added this to the v0.6.1 milestone Jun 25, 2024

romilbhardwaj reviewed Jul 1, 2024

View reviewed changes

Merge branch 'master' of https://github.com/skypilot-org/skypilot int…

cfc53f4

…o wait-until-endpoint-ready

romilbhardwaj reviewed Jul 3, 2024

View reviewed changes

sky/provision/kubernetes/network_utils.py Outdated Show resolved Hide resolved

address comments

b12a8c3

romilbhardwaj approved these changes Jul 3, 2024

View reviewed changes

Michaelvll added 2 commits July 4, 2024 03:39

Add rich status for endpoint fetching

422e9f4

Add rich status for waiting for the endpoint

b04c69a

Michaelvll merged commit 92f55a4 into master Jul 4, 2024
20 checks passed

Michaelvll deleted the wait-until-endpoint-ready branch July 4, 2024 07:50

Michaelvll mentioned this pull request Aug 6, 2024

[Serve][k8s] K8s replica ports not detected #3798

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[K8s] Wait until endpoint to be ready for `--endpoint` call #3634

[K8s] Wait until endpoint to be ready for `--endpoint` call #3634

Michaelvll commented Jun 5, 2024 •

edited

Loading

romilbhardwaj Jul 1, 2024

Michaelvll Jul 1, 2024 •

edited

Loading

romilbhardwaj Jul 3, 2024

Michaelvll Jul 3, 2024

romilbhardwaj left a comment

romilbhardwaj left a comment

Michaelvll commented Jul 4, 2024

[K8s] Wait until endpoint to be ready for --endpoint call #3634

[K8s] Wait until endpoint to be ready for --endpoint call #3634

Conversation

Michaelvll commented Jun 5, 2024 • edited Loading

romilbhardwaj Jul 1, 2024

Choose a reason for hiding this comment

Michaelvll Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

romilbhardwaj Jul 3, 2024

Choose a reason for hiding this comment

Michaelvll Jul 3, 2024

Choose a reason for hiding this comment

romilbhardwaj left a comment

Choose a reason for hiding this comment

romilbhardwaj left a comment

Choose a reason for hiding this comment

Michaelvll commented Jul 4, 2024

[K8s] Wait until endpoint to be ready for `--endpoint` call #3634

[K8s] Wait until endpoint to be ready for `--endpoint` call #3634

Michaelvll commented Jun 5, 2024 •

edited

Loading

Michaelvll Jul 1, 2024 •

edited

Loading