-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[K8s] Zero config networking for Kubernetes #2500
Conversation
# Conflicts: # sky/backends/backend_utils.py # sky/backends/cloud_vm_ray_backend.py # sky/registry.py # sky/utils/ux_utils.py
# Conflicts: # sky/__init__.py # sky/authentication.py # sky/backends/backend_utils.py # sky/backends/cloud_vm_ray_backend.py # sky/clouds/__init__.py # sky/clouds/service_catalog/__init__.py # sky/setup_files/MANIFEST.in # sky/utils/ux_utils.py
# Conflicts: # sky/backends/cloud_vm_ray_backend.py
Blocked on #2556. This will likely need minor changes after it is merged. Rest can still be reviewed. |
* surface provision failure message * nit * nit * format * nit * CPU message fix * update Insufficient memory handling * nit * nit * Update sky/skylet/providers/kubernetes/node_provider.py Co-authored-by: Romil Bhardwaj <[email protected]> * Update sky/skylet/providers/kubernetes/node_provider.py Co-authored-by: Romil Bhardwaj <[email protected]> * Update sky/skylet/providers/kubernetes/node_provider.py Co-authored-by: Romil Bhardwaj <[email protected]> * Update sky/skylet/providers/kubernetes/node_provider.py Co-authored-by: Romil Bhardwaj <[email protected]> * format * update gpu failure message and condition * fix GPU handling cases * fix * comment * nit * add try except block with general error handling --------- Co-authored-by: Romil Bhardwaj <[email protected]>
…roconf_networking # Conflicts: # sky/clouds/kubernetes.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @romilbhardwaj @landscapepainter @aviweit and @hemildesai ! Just tested with a newly launched GKE cluster (1 t4, 2 n2-highmem-8) without any network configuration.
Tried the following commands and it works like magic:
sky launch -c test-k8s --memory 60+ echo hi
sky launch -c test-k8s-2 --memory 60+ echo hi
sky launch -c test-k8s-3 --gpus t4 nvidia-smi
ssh test-k8s-3; nvidia-smi
The code looks mostly good to me. One question I have is whether we would like to preserve the old NodePort way, as it seems we have removed some NodePort related code, not sure if it will still work. Also, for code simplicity, it would be nice if we can remove the old mode, if there is no strong need for it. ; )
svc_name = f'{self.cluster_name_on_cloud}-ray-head-ssh' | ||
retry_cnt = 0 | ||
while True: | ||
try: | ||
head_ssh_port = clouds.Kubernetes.get_port(svc_name) | ||
break | ||
except Exception: # pylint: disable=broad-except | ||
retry_cnt += 1 | ||
if retry_cnt >= max_attempts: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does removing this mean the NodePort
mode will not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, NodePort would still work - it's just that now everything goes through a SSH Jump Pod, so the SSH port remains fixed at 22 and we don't need to get port here. Note that the jump port is dynamic and is fetched in kubernetes_utils.get_ssh_proxy_command
at provisioning time.
Thanks for the reviews @Michaelvll! This is ready for another look. Running smoke tests on GKE now:
That's a good point. The NodePort method is preserved for now since the port-forward mode might be considered as a hack by some (since it relies on tunneling over the API server, and that tunnel is designed only for development work). I was thinking we could collect feedback from users and deprecate it in the future if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix @romilbhardwaj! The code looks pretty good to me.
Thanks for the fast reviews @Michaelvll! Waiting on nodeport smoke tests to pass, will merge after that. |
…roconf_networking # Conflicts: # tests/kubernetes/README.md
This PR introduces new networking features for our Kubernetes support. In particular, we no longer need opening many ports on the Kubernetes cluster nodes. Now we support two modes of operation:
portforward
: Open no ports, and we usekubectl port-forward
under the hood to reach the pods. This requires zero configuration on the user's end, and is only marginally worse (~10%) in performance (see benchmarks). Given the significantly better UX, this will the default mode of operation.nodeport
: Open 1 port, and we run a ssh jump pod on that port to reach other pods. This requires opening one port on any one node in the Kubernetes cluster, and offers the highest performance while minimizing the number of open ports needed.Users who don't want to use
portforward
can switch tonodeport
by modifying their~/.sky/config
file:Note that we currently create one jump pod per user. Eventually, we want to share the jump pod across many users (See #2499)
This PR also has other bug fixes, including populating k8s envvars when the user runs SSH (#2287 and #2453 will also be closed by this PR).
Thanks to @landscapepainter, @aviweit and @hemildesai for their contributions.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials"