[Core][RunPod] Fix fetching RunPod endpoint #3876

cblmemo · 2024-08-26T22:09:06Z

Sometimes, when the port opening takes more time than expected (like on RunPod), it will fail to retrieve the endpoint. This PR adds a retry for it.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Michaelvll

Thanks for identifying this issue and the quick fix @cblmemo! Left a comment.

Michaelvll · 2024-08-26T22:30:58Z

sky/serve/core.py

+            def _get_endpoint() -> str:
+                start_time = time.time()
+                while True:
+                    lb_endpoint = backend_utils.get_endpoints(
+                        controller_handle.cluster_name,
+                        lb_port,
+                        skip_status_check=True).get(lb_port)
+                    if lb_endpoint is not None:
+                        return lb_endpoint
+                    if (time.time() - start_time > serve_constants.
+                            SERVICE_CONTROLLER_INGRESS_TIMEOUT_SECONDS):
+                        raise RuntimeError(
+                            'Failed to get ingress endpoint for controller.')
+                    time.sleep(1)
+
+            endpoint = _get_endpoint()


Should we instead add the retry in the underlying cloud implementation, as that is more like a cloud-specific issue, e.g. https://github.com/skypilot-org/skypilot/blob/master/sky/provision/kubernetes/network_utils.py#L258-L280

Good point! Fixed. Thanks!

This reverts commit 0c89714.

Michaelvll · 2024-08-29T06:20:05Z

sky/provision/runpod/instance.py

-    }
+    # RunPod ports sometimes take a while to be ready.
+    start_time = time.time()
+    while True:


Let's add a timeout, in case there is any issue with the ports. For example, the docker image does not expose ssh port.

There is already a 30 second timeout..?

Ahh, I missed that. Just wondering if we should raise an error or print and warning for that, instead of silently ignore?

Good point! Added. Thanks!

fix

0c89714

Michaelvll reviewed Aug 26, 2024

View reviewed changes

cblmemo added 2 commits August 28, 2024 11:38

Revert "fix"

ba1a442

This reverts commit 0c89714.

fix

1a1030e

Michaelvll reviewed Aug 29, 2024

View reviewed changes

cblmemo added 2 commits September 3, 2024 14:30

add comment

05777b0

Merge remote-tracking branch 'origin/master' into fix-serve-get-endpoint

543d5d6

Michaelvll approved these changes Sep 6, 2024

View reviewed changes

cblmemo changed the title ~~[Serve] Fix fetching service ingress endpoint~~ [Core][RunPod] Fix fetching RunPod endpoint Sep 6, 2024

cblmemo added this pull request to the merge queue Sep 6, 2024

Merged via the queue into master with commit 90a840f Sep 6, 2024
20 checks passed

cblmemo deleted the fix-serve-get-endpoint branch September 6, 2024 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][RunPod] Fix fetching RunPod endpoint #3876

[Core][RunPod] Fix fetching RunPod endpoint #3876

cblmemo commented Aug 26, 2024

Michaelvll left a comment

Michaelvll Aug 26, 2024

cblmemo Aug 28, 2024

Michaelvll Aug 29, 2024

cblmemo Aug 29, 2024

Michaelvll Sep 2, 2024

cblmemo Sep 3, 2024

[Core][RunPod] Fix fetching RunPod endpoint #3876

[Core][RunPod] Fix fetching RunPod endpoint #3876

Conversation

cblmemo commented Aug 26, 2024

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Aug 26, 2024

Choose a reason for hiding this comment

cblmemo Aug 28, 2024

Choose a reason for hiding this comment

Michaelvll Aug 29, 2024

Choose a reason for hiding this comment

cblmemo Aug 29, 2024

Choose a reason for hiding this comment

Michaelvll Sep 2, 2024

Choose a reason for hiding this comment

cblmemo Sep 3, 2024

Choose a reason for hiding this comment