-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][RunPod] Fix fetching RunPod endpoint #3876
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for identifying this issue and the quick fix @cblmemo! Left a comment.
sky/serve/core.py
Outdated
def _get_endpoint() -> str: | ||
start_time = time.time() | ||
while True: | ||
lb_endpoint = backend_utils.get_endpoints( | ||
controller_handle.cluster_name, | ||
lb_port, | ||
skip_status_check=True).get(lb_port) | ||
if lb_endpoint is not None: | ||
return lb_endpoint | ||
if (time.time() - start_time > serve_constants. | ||
SERVICE_CONTROLLER_INGRESS_TIMEOUT_SECONDS): | ||
raise RuntimeError( | ||
'Failed to get ingress endpoint for controller.') | ||
time.sleep(1) | ||
|
||
endpoint = _get_endpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we instead add the retry in the underlying cloud implementation, as that is more like a cloud-specific issue, e.g. https://github.com/skypilot-org/skypilot/blob/master/sky/provision/kubernetes/network_utils.py#L258-L280
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Fixed. Thanks!
This reverts commit 0c89714.
} | ||
# RunPod ports sometimes take a while to be ready. | ||
start_time = time.time() | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a timeout, in case there is any issue with the ports. For example, the docker image does not expose ssh port.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a 30 second timeout..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I missed that. Just wondering if we should raise an error or print and warning for that, instead of silently ignore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Added. Thanks!
Sometimes, when the port opening takes more time than expected (like on RunPod), it will fail to retrieve the endpoint. This PR adds a retry for it.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh