Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][RunPod] Fix fetching RunPod endpoint #3876

Merged
merged 5 commits into from
Sep 6, 2024
Merged

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Aug 26, 2024

Sometimes, when the port opening takes more time than expected (like on RunPod), it will fail to retrieve the endpoint. This PR adds a retry for it.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for identifying this issue and the quick fix @cblmemo! Left a comment.

Comment on lines 263 to 278
def _get_endpoint() -> str:
start_time = time.time()
while True:
lb_endpoint = backend_utils.get_endpoints(
controller_handle.cluster_name,
lb_port,
skip_status_check=True).get(lb_port)
if lb_endpoint is not None:
return lb_endpoint
if (time.time() - start_time > serve_constants.
SERVICE_CONTROLLER_INGRESS_TIMEOUT_SECONDS):
raise RuntimeError(
'Failed to get ingress endpoint for controller.')
time.sleep(1)

endpoint = _get_endpoint()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead add the retry in the underlying cloud implementation, as that is more like a cloud-specific issue, e.g. https://github.com/skypilot-org/skypilot/blob/master/sky/provision/kubernetes/network_utils.py#L258-L280

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Fixed. Thanks!

This reverts commit 0c89714.
}
# RunPod ports sometimes take a while to be ready.
start_time = time.time()
while True:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a timeout, in case there is any issue with the ports. For example, the docker image does not expose ssh port.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a 30 second timeout..?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I missed that. Just wondering if we should raise an error or print and warning for that, instead of silently ignore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Added. Thanks!

@cblmemo cblmemo changed the title [Serve] Fix fetching service ingress endpoint [Core][RunPod] Fix fetching RunPod endpoint Sep 6, 2024
@cblmemo cblmemo added this pull request to the merge queue Sep 6, 2024
Merged via the queue into master with commit 90a840f Sep 6, 2024
20 checks passed
@cblmemo cblmemo deleted the fix-serve-get-endpoint branch September 6, 2024 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants