-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[K8s] Wait until endpoint to be ready for --endpoint
call
#3634
Conversation
--endpoint
call
start_time = time.time() | ||
retry_cnt = 0 | ||
while ip is None and time.time() - start_time < timeout: | ||
service = core_api.read_namespaced_service( | ||
service_name, namespace, _request_timeout=kubernetes.API_TIMEOUT) | ||
if service.status.load_balancer.ingress is not None: | ||
ip = (service.status.load_balancer.ingress[0].ip or | ||
service.status.load_balancer.ingress[0].hostname) | ||
if ip is None: | ||
retry_cnt += 1 | ||
if retry_cnt % 5 == 0: | ||
logger.debug('Waiting for load balancer IP to be assigned' | ||
'...') | ||
time.sleep(1) | ||
return ip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the end-to-end argument, I'm thinking this retry + timeout functionality is better implemented higher up in the stack (perhaps in backend_utils.get_endpoints
when provision_lib.query_ports
is called). E.g., it might also be required for other port query methods/clouds in the future.
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this issue happen to ingress mode as well? Adding to the higher level makes sense to me, when the issue is general. Otherwise, it may introduce unnecessary overheads for retrying before erroring out, when the ports actually fail to expose. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - I'm okay with having this check here then :) Maybe we can make a note in provision_lib.query_ports
doc str that the underlying implementation is responsible for retries and timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Updated.
…o wait-until-endpoint-ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll - tested on GKE and local clusters. Left a small comment about timeout=0 condition, otherwise LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll!
Tested:
|
* Wait until endpoint to be ready for k8s * fix * Less debug output * ux * fix * address comments * Add rich status for endpoint fetching * Add rich status for waiting for the endpoint
This PR adds waiting for fetching endpoint of a newly launched cluster on kubernetes.
The following does not work on master:
sky launch -c test-port --ports 8889 --cloud kubernetes --cpus 2; sky status --endpoint 8889 test-port
Tested (run the relevant ones):
bash format.sh
sky launch -c test-port --ports 8889 --cloud kubernetes --cpus 2; sky status --endpoint 8889 test-port
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh