[K8s] Zero config networking for Kubernetes #2500

romilbhardwaj · 2023-08-31T17:22:30Z

This PR introduces new networking features for our Kubernetes support. In particular, we no longer need opening many ports on the Kubernetes cluster nodes. Now we support two modes of operation:

portforward: Open no ports, and we use kubectl port-forward under the hood to reach the pods. This requires zero configuration on the user's end, and is only marginally worse (~10%) in performance (see benchmarks). Given the significantly better UX, this will the default mode of operation.
nodeport: Open 1 port, and we run a ssh jump pod on that port to reach other pods. This requires opening one port on any one node in the Kubernetes cluster, and offers the highest performance while minimizing the number of open ports needed.

Users who don't want to use portforward can switch to nodeport by modifying their ~/.sky/config file:

kubernetes:
  networking: nodeport

Note that we currently create one jump pod per user. Eventually, we want to share the jump pod across many users (See #2499)

This PR also has other bug fixes, including populating k8s envvars when the user runs SSH (#2287 and #2453 will also be closed by this PR).

Thanks to @landscapepainter, @aviweit and @hemildesai for their contributions.

Tested (run the relevant ones):

Code formatting: bash format.sh
pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials"
Ensure [k8s] CUDA envvars don't work in ssh #2453 is fixed

# Conflicts: # sky/backends/backend_utils.py # sky/backends/cloud_vm_ray_backend.py # sky/registry.py # sky/utils/ux_utils.py

# Conflicts: # sky/__init__.py # sky/authentication.py # sky/backends/backend_utils.py # sky/backends/cloud_vm_ray_backend.py # sky/clouds/__init__.py # sky/clouds/service_catalog/__init__.py # sky/setup_files/MANIFEST.in # sky/utils/ux_utils.py

# Conflicts: # sky/backends/cloud_vm_ray_backend.py

…s_zeroconf_networking

…roconf_networking

romilbhardwaj · 2023-09-14T06:41:58Z

Blocked on #2556. This will likely need minor changes after it is merged. Rest can still be reviewed.

* surface provision failure message * nit * nit * format * nit * CPU message fix * update Insufficient memory handling * nit * nit * Update sky/skylet/providers/kubernetes/node_provider.py Co-authored-by: Romil Bhardwaj <[email protected]> * Update sky/skylet/providers/kubernetes/node_provider.py Co-authored-by: Romil Bhardwaj <[email protected]> * Update sky/skylet/providers/kubernetes/node_provider.py Co-authored-by: Romil Bhardwaj <[email protected]> * Update sky/skylet/providers/kubernetes/node_provider.py Co-authored-by: Romil Bhardwaj <[email protected]> * format * update gpu failure message and condition * fix GPU handling cases * fix * comment * nit * add try except block with general error handling --------- Co-authored-by: Romil Bhardwaj <[email protected]>

…roconf_networking # Conflicts: # sky/clouds/kubernetes.py

Michaelvll

Thanks for the PR @romilbhardwaj @landscapepainter @aviweit and @hemildesai ! Just tested with a newly launched GKE cluster (1 t4, 2 n2-highmem-8) without any network configuration.
Tried the following commands and it works like magic:

sky launch -c test-k8s --memory 60+ echo hi
sky launch -c test-k8s-2 --memory 60+ echo hi
sky launch -c test-k8s-3 --gpus t4 nvidia-smi
ssh test-k8s-3; nvidia-smi

The code looks mostly good to me. One question I have is whether we would like to preserve the old NodePort way, as it seems we have removed some NodePort related code, not sure if it will still work. Also, for code simplicity, it would be nice if we can remove the old mode, if there is no strong need for it. ; )

sky/authentication.py

Michaelvll · 2023-09-15T04:01:22Z

sky/backends/cloud_vm_ray_backend.py

-            svc_name = f'{self.cluster_name_on_cloud}-ray-head-ssh'
-            retry_cnt = 0
-            while True:
-                try:
-                    head_ssh_port = clouds.Kubernetes.get_port(svc_name)
-                    break
-                except Exception:  # pylint: disable=broad-except
-                    retry_cnt += 1
-                    if retry_cnt >= max_attempts:
-                        raise


Does removing this mean the NodePort mode will not work?

No, NodePort would still work - it's just that now everything goes through a SSH Jump Pod, so the SSH port remains fixed at 22 and we don't need to get port here. Note that the jump port is dynamic and is fetched in kubernetes_utils.get_ssh_proxy_command at provisioning time.

sky/skylet/providers/kubernetes/node_provider.py

sky/templates/kubernetes-ray.yml.j2

sky/utils/command_runner.py

sky/utils/kubernetes/sshjump_lcm.py

sky/utils/kubernetes_utils.py

…roconf_networking

romilbhardwaj · 2023-09-15T22:46:51Z

Thanks for the reviews @Michaelvll! This is ready for another look.

Running smoke tests on GKE now:

pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials" with default port-forward mode
pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials" with nodeport mode set in ~/.sky/config
Tested jump pod lifecycle management by making sure ssh jump pod terminates after 10 min of no SkyPilot pods running in the cluster.

One question I have is whether we would like to preserve the old NodePort way, as it seems we have removed some NodePort related code, not sure if it will still work. Also, for code simplicity, it would be nice if we can remove the old mode, if there is no strong need for it. ; )

That's a good point. The NodePort method is preserved for now since the port-forward mode might be considered as a hack by some (since it relies on tunneling over the API server, and that tunnel is designed only for development work). I was thinking we could collect feedback from users and deprecate it in the future if port-forward works fine. In the meanwhile, we have an easy way to switch between methods if port-forward doesn't work for them. We are also not documenting the NodePort ability for now to make sure users do not use it, unless they really need to.

Michaelvll

Thanks for the quick fix @romilbhardwaj! The code looks pretty good to me.

romilbhardwaj · 2023-09-16T16:13:44Z

Thanks for the fast reviews @Michaelvll! Waiting on nodeport smoke tests to pass, will merge after that.

…roconf_networking # Conflicts: # tests/kubernetes/README.md

romilbhardwaj added 30 commits February 3, 2023 16:47

Working Ray K8s node provider based on SSH

0431f96

Merge branch 'master' into k8s_cloud

5f715e8

wip

197acea

working provisioning with SkyPilot and ssh config

f06b22d

working provisioning with SkyPilot and ssh config

cf1ddec

Merge branch 'master' into k8s_cloud

0937cc3

# Conflicts: # sky/backends/backend_utils.py # sky/backends/cloud_vm_ray_backend.py # sky/registry.py # sky/utils/ux_utils.py

Updates to master

40aad6d

ray2.3

47d0953

Clean up docs

9f59467

multiarch build

07f9bcb

hacking around ray start

bd12014

more port fixes

4baf0b6

fix up default instance selection

7ed02eb

fix resource selection

898a851

Add provisioning timeout by checking if pods are ready

fcb51d1

Working mounting

13eb198

Remove catalog

428f143

fixes

ebf9d83

fixes

da570fc

Fix ssh-key auth to create unique secrets

1bea866

Fix for ContainerCreating timeout

9def756

Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud

8f9cafe

# Conflicts: # sky/backends/cloud_vm_ray_backend.py

Fix head node ssh port caching

65366eb

mypy

b984ead

lint

3bca8a9

fix ports

61df297

typo

036eaf9

cleanup

95e160c

cleanup

301a914

landscapepainter and others added 4 commits August 31, 2023 16:37

Update sky/skylet/providers/kubernetes/node_provider.py

08fd88d

fix test

693af6d

Merge remote-tracking branch 'origin/k8s_zeroconf_networking' into k8…

582b484

…s_zeroconf_networking

Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_ze…

33439e3

…roconf_networking

romilbhardwaj requested a review from concretevitamin September 8, 2023 18:49

romilbhardwaj added this to the 0.4 milestone Sep 11, 2023

This was linked to issues Sep 11, 2023

[k8s] CUDA envvars don't work in ssh #2453

Closed

[k8s] Kubernetes environment variables don't show up in SkyPilot tasks #2287

Closed

romilbhardwaj added the blocked PR blocked by other issues label Sep 14, 2023

romilbhardwaj mentioned this pull request Sep 14, 2023

[k8s] Kubernetes Docs #2324

Merged

1 task

romilbhardwaj removed the blocked PR blocked by other issues label Sep 15, 2023

landscapepainter and others added 4 commits September 14, 2023 19:30

cleanup

4e8b678

Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_ze…

21cee8b

…roconf_networking # Conflicts: # sky/clouds/kubernetes.py

lint

d8302f0

Michaelvll reviewed Sep 15, 2023

View reviewed changes

romilbhardwaj added 6 commits September 15, 2023 10:58

Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_ze…

c7e8429

…roconf_networking

fix for ssh jump image_id

fd2976a

comments

9827bbb

ssh jump refactor

f74c9df

lint

657cd6f

image build fixes

9c4e338

romilbhardwaj requested a review from Michaelvll September 15, 2023 22:46

Michaelvll approved these changes Sep 16, 2023

View reviewed changes

Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_ze…

add29dd

…roconf_networking # Conflicts: # tests/kubernetes/README.md

romilbhardwaj merged commit f0d3dfc into master Sep 16, 2023
18 checks passed

romilbhardwaj deleted the k8s_zeroconf_networking branch September 16, 2023 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[K8s] Zero config networking for Kubernetes #2500

[K8s] Zero config networking for Kubernetes #2500

romilbhardwaj commented Aug 31, 2023 •

edited

Loading

romilbhardwaj commented Sep 14, 2023

Michaelvll left a comment •

edited

Loading

Michaelvll Sep 15, 2023

romilbhardwaj Sep 15, 2023 •

edited

Loading

romilbhardwaj commented Sep 15, 2023 •

edited

Loading

Michaelvll left a comment

romilbhardwaj commented Sep 16, 2023

[K8s] Zero config networking for Kubernetes #2500

[K8s] Zero config networking for Kubernetes #2500

Conversation

romilbhardwaj commented Aug 31, 2023 • edited Loading

romilbhardwaj commented Sep 14, 2023

Michaelvll left a comment • edited Loading

Choose a reason for hiding this comment

Michaelvll Sep 15, 2023

Choose a reason for hiding this comment

romilbhardwaj Sep 15, 2023 • edited Loading

Choose a reason for hiding this comment

romilbhardwaj commented Sep 15, 2023 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

romilbhardwaj commented Sep 16, 2023

romilbhardwaj commented Aug 31, 2023 •

edited

Loading

Michaelvll left a comment •

edited

Loading

romilbhardwaj Sep 15, 2023 •

edited

Loading

romilbhardwaj commented Sep 15, 2023 •

edited

Loading