Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[K8s] Zero config networking for Kubernetes #2500

Merged
merged 211 commits into from
Sep 16, 2023
Merged
Show file tree
Hide file tree
Changes from 210 commits
Commits
Show all changes
211 commits
Select commit Hold shift + click to select a range
0431f96
Working Ray K8s node provider based on SSH
romilbhardwaj Feb 4, 2023
5f715e8
Merge branch 'master' into k8s_cloud
romilbhardwaj Feb 4, 2023
197acea
wip
romilbhardwaj Feb 5, 2023
f06b22d
working provisioning with SkyPilot and ssh config
romilbhardwaj Feb 7, 2023
cf1ddec
working provisioning with SkyPilot and ssh config
romilbhardwaj Feb 8, 2023
0937cc3
Merge branch 'master' into k8s_cloud
romilbhardwaj Mar 16, 2023
40aad6d
Updates to master
romilbhardwaj Mar 16, 2023
47d0953
ray2.3
romilbhardwaj Mar 21, 2023
9f59467
Clean up docs
romilbhardwaj Mar 29, 2023
07f9bcb
multiarch build
romilbhardwaj Mar 31, 2023
bd12014
hacking around ray start
romilbhardwaj Mar 31, 2023
4baf0b6
more port fixes
romilbhardwaj Apr 3, 2023
b08eb1b
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj Jun 8, 2023
7ed02eb
fix up default instance selection
romilbhardwaj Jun 8, 2023
898a851
fix resource selection
romilbhardwaj Jun 8, 2023
fcb51d1
Add provisioning timeout by checking if pods are ready
romilbhardwaj Jun 9, 2023
13eb198
Working mounting
romilbhardwaj Jun 9, 2023
428f143
Remove catalog
romilbhardwaj Jun 13, 2023
ebf9d83
fixes
romilbhardwaj Jun 14, 2023
da570fc
fixes
romilbhardwaj Jun 15, 2023
1bea866
Fix ssh-key auth to create unique secrets
romilbhardwaj Jun 15, 2023
9def756
Fix for ContainerCreating timeout
romilbhardwaj Jun 15, 2023
8f9cafe
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj Jun 15, 2023
65366eb
Fix head node ssh port caching
romilbhardwaj Jun 15, 2023
b984ead
mypy
romilbhardwaj Jun 15, 2023
3bca8a9
lint
romilbhardwaj Jun 16, 2023
61df297
fix ports
romilbhardwaj Jun 16, 2023
036eaf9
typo
romilbhardwaj Jun 16, 2023
95e160c
cleanup
romilbhardwaj Jun 16, 2023
301a914
cleanup
romilbhardwaj Jun 16, 2023
2c88daf
wip
romilbhardwaj Jun 16, 2023
7ece7f7
Update setup
romilbhardwaj Jun 16, 2023
cc85f94
readme updates
romilbhardwaj Jun 16, 2023
0450cee
lint
romilbhardwaj Jun 16, 2023
f3f0578
Fix failover
romilbhardwaj Jun 16, 2023
574a9c6
Fix failover
romilbhardwaj Jun 16, 2023
0632b48
optimize setup
romilbhardwaj Jun 16, 2023
05508d3
Fix sync down logs for k8s
romilbhardwaj Jun 16, 2023
fb36a40
test wip
romilbhardwaj Jun 18, 2023
7db4027
instance name parsing wip
romilbhardwaj Jun 19, 2023
632ed30
Fix instance name parsing
romilbhardwaj Jun 20, 2023
d7bd766
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj Jun 20, 2023
1a444d1
Merge fixes for query_status
romilbhardwaj Jun 20, 2023
da9cba2
[k8s_cloud] Delete k8s service resources. (#2105)
aviweit Jun 20, 2023
81871ac
Status refresh WIP
romilbhardwaj Jun 20, 2023
0d1c4ac
refactor to kubernetes adaptor
romilbhardwaj Jun 20, 2023
8017020
tests wip
romilbhardwaj Jun 21, 2023
5d7f8e8
clean up auth
romilbhardwaj Jun 22, 2023
aa787f8
wip tests
romilbhardwaj Jun 22, 2023
c026559
cli
romilbhardwaj Jun 22, 2023
3dc80d2
cli
romilbhardwaj Jun 23, 2023
63ce29b
sky local up/down cli
romilbhardwaj Jun 23, 2023
f9d5b73
cli
romilbhardwaj Jun 23, 2023
b81647a
lint
romilbhardwaj Jun 23, 2023
050cfc2
lint
romilbhardwaj Jun 23, 2023
d64c394
lint
romilbhardwaj Jun 23, 2023
7367b4a
Speed up kind cluster creation
romilbhardwaj Jun 23, 2023
756c56c
tests
romilbhardwaj Jun 23, 2023
d4c0990
lint
romilbhardwaj Jun 23, 2023
b64dd19
tests
romilbhardwaj Jun 24, 2023
10333d7
handling for non-reachable clusters
romilbhardwaj Jun 25, 2023
b07fc58
Invalid kubeconfig handling
romilbhardwaj Jun 26, 2023
5af58aa
Timeout for sky check
romilbhardwaj Jun 26, 2023
4d6710f
code cleanup
romilbhardwaj Jun 27, 2023
c057c88
lint
romilbhardwaj Jun 27, 2023
b8e414e
Do not raise error if GPUs requested, return empty list
romilbhardwaj Jul 3, 2023
c2ebfe7
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj Jul 3, 2023
1fc857b
Address comments
romilbhardwaj Jul 5, 2023
0ae92eb
comments
romilbhardwaj Jul 5, 2023
10f302f
lint
romilbhardwaj Jul 5, 2023
2a4caac
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj Jul 13, 2023
54b2b28
Remove public key upload
romilbhardwaj Jul 13, 2023
fc362b7
GPU support init
romilbhardwaj Jul 14, 2023
36f9ebc
wip
romilbhardwaj Jul 15, 2023
5ee821d
add shebang
romilbhardwaj Jul 15, 2023
d6ca85a
comments
romilbhardwaj Jul 16, 2023
fbae4bf
change permissions
romilbhardwaj Jul 16, 2023
6e9e6ba
remove chmod
romilbhardwaj Jul 16, 2023
7fa9d7e
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj Jul 16, 2023
a3f827e
merge 2241
romilbhardwaj Jul 16, 2023
9687ea8
add todo
romilbhardwaj Jul 16, 2023
4b54555
Handle kube config management for sky local commands (#2253)
hemildesai Jul 19, 2023
f73f1b2
Switch context in create_cluster if cluster already exists.
romilbhardwaj Jul 19, 2023
0c45b9a
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj Jul 20, 2023
a69df01
fix typo
romilbhardwaj Jul 20, 2023
ff1d832
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj Jul 20, 2023
6a931e2
update sky check error msg after sky local down
romilbhardwaj Jul 20, 2023
662e4b9
lint
romilbhardwaj Jul 20, 2023
4046749
update timeout check
romilbhardwaj Jul 21, 2023
92d588d
fix import error
romilbhardwaj Jul 21, 2023
9ff1662
Fix kube API access from within cluster (load_incluster_auth)
romilbhardwaj Jul 21, 2023
364b03f
lint
romilbhardwaj Jul 21, 2023
691f6b7
lint
romilbhardwaj Jul 21, 2023
ed0741f
working autodown and sky status -r
romilbhardwaj Jul 21, 2023
3fe9bfb
lint
romilbhardwaj Jul 21, 2023
b98ced3
add test_kubernetes_autodown
romilbhardwaj Jul 21, 2023
07ea97d
lint
romilbhardwaj Jul 24, 2023
73ee737
address comments
romilbhardwaj Jul 24, 2023
7726850
address comments
romilbhardwaj Jul 24, 2023
2ee4833
lint
romilbhardwaj Jul 24, 2023
9e0f5b6
deletion timeouts wip
romilbhardwaj Jul 25, 2023
b36fba4
[k8s_cloud] Ray pod not created under current context namespace. (#2302)
aviweit Jul 26, 2023
c137360
Merge branch 'k8s_cloud' of github.com:skypilot-org/skypilot into k8s…
romilbhardwaj Jul 26, 2023
a806b39
head ssh port namespace fix
romilbhardwaj Jul 26, 2023
a9b9636
[k8s-cloud] Typo in sky local --help. (#2308)
aviweit Jul 26, 2023
7903339
[k8s-cloud] Set build_image.sh to be executable. (#2307)
aviweit Jul 26, 2023
4ab5329
remove ingress
romilbhardwaj Jul 26, 2023
4b49241
remove debug statements
romilbhardwaj Jul 26, 2023
83aecd3
UX and readme updates
romilbhardwaj Jul 26, 2023
bdeb7d5
lint
romilbhardwaj Jul 26, 2023
993f736
Merge branch 'k8s_cloud' of github.com:skypilot-org/skypilot into k8s…
romilbhardwaj Jul 26, 2023
4fb1d94
fix logging for 409 retry
romilbhardwaj Jul 26, 2023
02e3415
lint
romilbhardwaj Jul 26, 2023
c1b7438
lint
romilbhardwaj Jul 26, 2023
b9701ca
Merge branch 'k8s_cloud' of github.com:skypilot-org/skypilot into k8s…
romilbhardwaj Jul 31, 2023
4289462
Debug dockerfile
romilbhardwaj Jul 31, 2023
3d770bd
wip
romilbhardwaj Aug 1, 2023
25f84b1
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cl…
romilbhardwaj Aug 3, 2023
2875ff9
Fix GPU image
romilbhardwaj Aug 3, 2023
1202c34
Query cloud specific env vars in task setup (#2347)
hemildesai Aug 4, 2023
d8e5bd2
Merge branch 'k8s_cloud_beta1' of github.com:skypilot-org/skypilot in…
romilbhardwaj Aug 4, 2023
d1a6ef4
working GPU type selection for GKE and EKS. GFD needs work.
romilbhardwaj Aug 4, 2023
b3fcadc
TODO for auto-detection
romilbhardwaj Aug 4, 2023
4a7d5d7
Add image toggling for CPU/GPU
romilbhardwaj Aug 4, 2023
85ee1e1
Add image toggling for CPU/GPU
romilbhardwaj Aug 4, 2023
d95438b
Fix none acce_type
romilbhardwaj Aug 4, 2023
607ad85
remove memory from j2
romilbhardwaj Aug 7, 2023
6f702da
Make resnet examples run again
romilbhardwaj Aug 7, 2023
738ae19
lint
romilbhardwaj Aug 7, 2023
9cdbf86
Merge branch 'example_resnet_cudnn' of github.com:skypilot-org/skypil…
romilbhardwaj Aug 7, 2023
c3420a8
v100 readme
romilbhardwaj Aug 8, 2023
c87c64d
dockerfile and smoketest
romilbhardwaj Aug 9, 2023
85f2b9e
fractional cpu and mem
romilbhardwaj Aug 13, 2023
509fd96
nits
romilbhardwaj Aug 13, 2023
22b1d17
refactor utils
romilbhardwaj Aug 13, 2023
552481c
lint and cleanup
romilbhardwaj Aug 13, 2023
33b29b8
lint and cleanup
romilbhardwaj Aug 13, 2023
e65d3c1
lint and cleanup
romilbhardwaj Aug 13, 2023
82327fb
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cl…
romilbhardwaj Aug 13, 2023
22fc6ad
lint and cleanup
romilbhardwaj Aug 13, 2023
3e9656a
lint and cleanup
romilbhardwaj Aug 13, 2023
be3d905
lint and cleanup
romilbhardwaj Aug 13, 2023
277295a
lint
romilbhardwaj Aug 14, 2023
69168dd
lint
romilbhardwaj Aug 14, 2023
cfe2502
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cl…
romilbhardwaj Aug 14, 2023
3004951
manual lint
romilbhardwaj Aug 14, 2023
b76b3a6
manual isort
romilbhardwaj Aug 14, 2023
7207c34
test readme update
romilbhardwaj Aug 14, 2023
56ac60f
Remove EKS
romilbhardwaj Aug 14, 2023
d988307
lint
romilbhardwaj Aug 14, 2023
a208d91
add gpu labeler
romilbhardwaj Aug 15, 2023
c857a9d
updates
romilbhardwaj Aug 15, 2023
ee89f65
lint
romilbhardwaj Aug 15, 2023
8934b22
update script
romilbhardwaj Aug 15, 2023
9b5019b
ux
romilbhardwaj Aug 15, 2023
53e5d80
fix formatter
romilbhardwaj Aug 15, 2023
f806aed
test update
romilbhardwaj Aug 15, 2023
4bf43ee
test update
romilbhardwaj Aug 15, 2023
429eed4
fix test_optimizer_dryruns
romilbhardwaj Aug 15, 2023
8dd1a76
docs
romilbhardwaj Aug 15, 2023
df10bc6
cleanup
romilbhardwaj Aug 15, 2023
512d9fb
test readme update
romilbhardwaj Aug 15, 2023
858eb51
lint
romilbhardwaj Aug 16, 2023
96647bf
lint
romilbhardwaj Aug 16, 2023
fdff1a6
[k8s_cloud_beta1] Add sshjump host support. (#2369)
aviweit Aug 17, 2023
1b8385c
Merge branch 'k8s_cloud_beta1' of github.com:skypilot-org/skypilot in…
romilbhardwaj Aug 17, 2023
3ef135a
Update build image
romilbhardwaj Aug 17, 2023
7b638cc
fix image path
romilbhardwaj Aug 17, 2023
7da33e4
fix merge
romilbhardwaj Aug 17, 2023
5d4d27c
cleanup
romilbhardwaj Aug 17, 2023
e9d0ed1
lint
romilbhardwaj Aug 17, 2023
f21b50a
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cl…
romilbhardwaj Aug 17, 2023
f736236
fix utils ref
romilbhardwaj Aug 20, 2023
7b5d0b5
typo
romilbhardwaj Aug 20, 2023
8a3d5a7
refactor pod creation
romilbhardwaj Aug 20, 2023
58b8126
lint
romilbhardwaj Aug 20, 2023
f9b401e
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cl…
romilbhardwaj Aug 25, 2023
950de00
merge fixes
romilbhardwaj Aug 25, 2023
292a350
portfix
romilbhardwaj Aug 25, 2023
c17f854
merge fixes
romilbhardwaj Aug 25, 2023
330c3b4
[k8s_cloud_beta1] Sky down for a cluster deployed in Kubernetes to po…
aviweit Aug 27, 2023
d760676
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cl…
romilbhardwaj Aug 27, 2023
f2ea761
cleanup
romilbhardwaj Aug 28, 2023
9d34ff7
Add networking benchmarks
romilbhardwaj Aug 28, 2023
5b5aacd
comment
romilbhardwaj Aug 29, 2023
aae4676
comment
romilbhardwaj Aug 29, 2023
2eedca6
lint
romilbhardwaj Aug 29, 2023
b07748d
autodown fixes
romilbhardwaj Aug 29, 2023
e379291
lint
romilbhardwaj Aug 29, 2023
482a69d
fix label
romilbhardwaj Aug 30, 2023
fb09398
[k8s_cloud_beta1] Adding support for ssh using kubectl port-forward t…
landscapepainter Aug 31, 2023
a721f83
refactor
romilbhardwaj Aug 31, 2023
9a1cdbe
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_ze…
romilbhardwaj Aug 31, 2023
c620f94
fix
romilbhardwaj Aug 31, 2023
94bf1a9
updates
romilbhardwaj Aug 31, 2023
48d53a5
lint
romilbhardwaj Aug 31, 2023
08fd88d
Update sky/skylet/providers/kubernetes/node_provider.py
landscapepainter Aug 31, 2023
693af6d
fix test
romilbhardwaj Sep 5, 2023
582b484
Merge remote-tracking branch 'origin/k8s_zeroconf_networking' into k8…
romilbhardwaj Sep 5, 2023
33439e3
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_ze…
romilbhardwaj Sep 7, 2023
d214495
[k8s] Showing reasons for provisioning failure in K8s (#2422)
landscapepainter Sep 15, 2023
4e8b678
cleanup
romilbhardwaj Sep 15, 2023
21cee8b
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_ze…
romilbhardwaj Sep 15, 2023
d8302f0
lint
romilbhardwaj Sep 15, 2023
c7e8429
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_ze…
romilbhardwaj Sep 15, 2023
fd2976a
fix for ssh jump image_id
romilbhardwaj Sep 15, 2023
9827bbb
comments
romilbhardwaj Sep 15, 2023
f74c9df
ssh jump refactor
romilbhardwaj Sep 15, 2023
657cd6f
lint
romilbhardwaj Sep 15, 2023
9c4e338
image build fixes
romilbhardwaj Sep 15, 2023
add29dd
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_ze…
romilbhardwaj Sep 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions Dockerfile_k8s
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,9 @@ RUN cd /skypilot/ && \
sudo mv -v sky/setup_files/* . && \
pip install ".[aws]"

# Set PYTHONUNBUFFERED=1 to have Python print to stdout/stderr immediately
ENV PYTHONUNBUFFERED=1

# Set WORKDIR and initialize conda for sky user
WORKDIR /home/sky
RUN conda init
3 changes: 3 additions & 0 deletions Dockerfile_k8s_gpu
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,9 @@ RUN cd /skypilot/ && \
sudo mv -v sky/setup_files/* . && \
pip install ".[aws]"

# Set PYTHONUNBUFFERED=1 to have Python print to stdout/stderr immediately
ENV PYTHONUNBUFFERED=1

# Set WORKDIR and initialize conda for sky user
WORKDIR /home/sky
RUN conda init
43 changes: 43 additions & 0 deletions sky/authentication.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,12 @@

from sky import clouds
from sky import sky_logging
from sky import skypilot_config
from sky.adaptors import gcp
from sky.adaptors import ibm
from sky.skylet.providers.lambda_cloud import lambda_utils
from sky.utils import common_utils
from sky.utils import kubernetes_utils
from sky.utils import subprocess_utils
from sky.utils import ux_utils

Expand Down Expand Up @@ -377,6 +379,21 @@ def setup_scp_authentication(config: Dict[str, Any]) -> Dict[str, Any]:


def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
# Default ssh session is established with kubectl port-forwarding with
# ClusterIP service.
nodeport_mode = kubernetes_utils.KubernetesNetworkingMode.NODEPORT
port_forward_mode = kubernetes_utils.KubernetesNetworkingMode.PORTFORWARD
network_mode_str = skypilot_config.get_nested(('kubernetes', 'networking'),
port_forward_mode.value)
try:
network_mode = kubernetes_utils.KubernetesNetworkingMode.from_str(
network_mode_str)
except ValueError as e:
# Add message saying "Please check: ~/.sky/config.yaml" to the error
# message.
with ux_utils.print_exception_no_traceback():
raise ValueError(str(e) + ' Please check: ~/.sky/config.yaml.') \
from None
get_or_generate_keys()

# Run kubectl command to add the public key to the cluster.
Expand All @@ -403,4 +420,30 @@ def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
logger.error(suffix)
raise

ssh_jump_name = clouds.Kubernetes.SKY_SSH_JUMP_NAME
if network_mode == nodeport_mode:
service_type = kubernetes_utils.KubernetesServiceType.NODEPORT
elif network_mode == port_forward_mode:
kubernetes_utils.check_port_forward_mode_dependencies()
# Using `kubectl port-forward` creates a direct tunnel to jump pod and
# does not require opening any ports on Kubernetes nodes. As a result,
# the service can be a simple ClusterIP service which we access with
# `kubectl port-forward`.
service_type = kubernetes_utils.KubernetesServiceType.CLUSTERIP
else:
# This should never happen because we check for this in from_str above.
raise ValueError(f'Unsupported networking mode: {network_mode_str}')
# Setup service for SSH jump pod. We create the SSH jump service here
# because we need to know the service IP address and port to set the
# ssh_proxy_command in the autoscaler config.
namespace = kubernetes_utils.get_current_kube_config_context_namespace()
kubernetes_utils.setup_ssh_jump_svc(ssh_jump_name, namespace, service_type)

ssh_proxy_cmd = kubernetes_utils.get_ssh_proxy_command(
PRIVATE_SSH_KEY_PATH, ssh_jump_name, network_mode, namespace,
clouds.Kubernetes.PORT_FORWARD_PROXY_CMD_PATH,
clouds.Kubernetes.PORT_FORWARD_PROXY_CMD_TEMPLATE)

config['auth']['ssh_proxy_command'] = ssh_proxy_cmd

return config
6 changes: 5 additions & 1 deletion sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1353,7 +1353,7 @@ def wait_until_ray_cluster_ready(

def ssh_credential_from_yaml(cluster_yaml: str,
docker_user: Optional[str] = None
) -> Dict[str, str]:
) -> Dict[str, Any]:
"""Returns ssh_user, ssh_private_key and ssh_control name."""
config = common_utils.read_yaml(cluster_yaml)
auth_section = config['auth']
Expand All @@ -1369,6 +1369,10 @@ def ssh_credential_from_yaml(cluster_yaml: str,
}
if docker_user is not None:
credentials['docker_user'] = docker_user
ssh_provider_module = config['provider']['module']
# If we are running ssh command on kubernetes node.
if 'kubernetes' in ssh_provider_module:
credentials['disable_control_master'] = True
return credentials


Expand Down
49 changes: 6 additions & 43 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -2319,23 +2319,12 @@ def _update_cluster_region(self):
self.launched_resources = self.launched_resources.copy(region=region)

def update_ssh_ports(self, max_attempts: int = 1) -> None:
"""Updates the cluster SSH ports cached in the handle."""
# TODO(romilb): Replace this with a call to the cloud class to get ports
# Use port 22 for everything except Kubernetes
if not isinstance(self.launched_resources.cloud, clouds.Kubernetes):
head_ssh_port = 22
else:
svc_name = f'{self.cluster_name_on_cloud}-ray-head-ssh'
retry_cnt = 0
while True:
try:
head_ssh_port = clouds.Kubernetes.get_port(svc_name)
break
except Exception: # pylint: disable=broad-except
retry_cnt += 1
if retry_cnt >= max_attempts:
raise
Comment on lines -2328 to -2337
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does removing this mean the NodePort mode will not work?

Copy link
Collaborator Author

@romilbhardwaj romilbhardwaj Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, NodePort would still work - it's just that now everything goes through a SSH Jump Pod, so the SSH port remains fixed at 22 and we don't need to get port here. Note that the jump port is dynamic and is fetched in kubernetes_utils.get_ssh_proxy_command at provisioning time.

# TODO(romilb): Multinode doesn't work with Kubernetes yet.
"""Fetches and sets the SSH ports for the cluster nodes.

Use this method to use any cloud-specific port fetching logic.
"""
del max_attempts # Unused.
head_ssh_port = 22
self.stable_ssh_ports = ([head_ssh_port] + [22] *
(self.num_node_ips - 1))

Expand Down Expand Up @@ -3011,37 +3000,12 @@ def _sync_file_mounts(
self._execute_file_mounts(handle, all_file_mounts)
self._execute_storage_mounts(handle, storage_mounts)

def _update_envs_for_k8s(self, handle: CloudVmRayResourceHandle,
task: task_lib.Task) -> None:
"""Update envs with env vars from Kubernetes if cloud is Kubernetes.

Kubernetes automatically populates containers with critical environment
variables, such as those for discovering services running in the
cluster and CUDA/nvidia environment variables. We need to update task
environment variables with these env vars. This is needed for GPU
support and service discovery.

See https://github.com/skypilot-org/skypilot/issues/2287 for
more details.
"""
if isinstance(handle.launched_resources.cloud, clouds.Kubernetes):
temp_envs = copy.deepcopy(task.envs)
cloud_env_vars = handle.launched_resources.cloud.query_env_vars(
handle.cluster_name_on_cloud)
task.update_envs(cloud_env_vars)

# Re update the envs with the original envs to give priority to
# the original envs.
task.update_envs(temp_envs)

def _setup(self, handle: CloudVmRayResourceHandle, task: task_lib.Task,
detach_setup: bool) -> None:
start = time.time()
style = colorama.Style
fore = colorama.Fore

self._update_envs_for_k8s(handle, task)

if task.setup is None:
return

Expand Down Expand Up @@ -3350,7 +3314,6 @@ def _execute(
# Check the task resources vs the cluster resources. Since `sky exec`
# will not run the provision and _check_existing_cluster
self.check_resources_fit_cluster(handle, task)
self._update_envs_for_k8s(handle, task)

resources_str = backend_utils.get_task_resources_str(task)

Expand Down
51 changes: 13 additions & 38 deletions sky/clouds/kubernetes.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,15 +20,18 @@

logger = sky_logging.init_logger(__name__)

_CREDENTIAL_PATH = '~/.kube/config'
CREDENTIAL_PATH = '~/.kube/config'


@clouds.CLOUD_REGISTRY.register
class Kubernetes(clouds.Cloud):
"""Kubernetes."""

SKY_SSH_KEY_SECRET_NAME = f'sky-ssh-{common_utils.get_user_hash()}'

SKY_SSH_JUMP_NAME = f'sky-ssh-jump-{common_utils.get_user_hash()}'
PORT_FORWARD_PROXY_CMD_TEMPLATE = \
'kubernetes-port-forward-proxy-command.sh.j2'
PORT_FORWARD_PROXY_CMD_PATH = '~/.sky/port-forward-proxy-cmd.sh'
# Timeout for resource provisioning. This timeout determines how long to
# wait for pod to be in pending status before giving up.
# Larger timeout may be required for autoscaling clusters, since autoscaler
Expand Down Expand Up @@ -209,6 +212,9 @@ def make_deploy_resources_variables(
assert image_id.startswith('skypilot:')
image_id = service_catalog.get_image_id_from_tag(image_id,
clouds='kubernetes')
# TODO(romilb): Create a lightweight image for SSH jump host
ssh_jump_image = service_catalog.get_image_id_from_tag(
self.IMAGE_CPU, clouds='kubernetes')

k8s_acc_label_key = None
k8s_acc_label_value = None
Expand All @@ -229,6 +235,8 @@ def make_deploy_resources_variables(
'k8s_ssh_key_secret_name': self.SKY_SSH_KEY_SECRET_NAME,
'k8s_acc_label_key': k8s_acc_label_key,
'k8s_acc_label_value': k8s_acc_label_value,
'k8s_ssh_jump_name': self.SKY_SSH_JUMP_NAME,
'k8s_ssh_jump_image': ssh_jump_image,
# TODO(romilb): Allow user to specify custom images
'image_id': image_id,
}
Expand Down Expand Up @@ -298,7 +306,7 @@ def _make(instance_list):

@classmethod
def check_credentials(cls) -> Tuple[bool, Optional[str]]:
if os.path.exists(os.path.expanduser(_CREDENTIAL_PATH)):
if os.path.exists(os.path.expanduser(CREDENTIAL_PATH)):
# Test using python API
try:
return kubernetes_utils.check_credentials()
Expand All @@ -307,10 +315,10 @@ def check_credentials(cls) -> Tuple[bool, Optional[str]]:
f'{common_utils.format_exception(e)}')
else:
return (False, 'Credentials not found - '
f'check if {_CREDENTIAL_PATH} exists.')
f'check if {CREDENTIAL_PATH} exists.')

def get_credential_file_mounts(self) -> Dict[str, str]:
return {_CREDENTIAL_PATH: _CREDENTIAL_PATH}
return {CREDENTIAL_PATH: CREDENTIAL_PATH}

def instance_type_exists(self, instance_type: str) -> bool:
return kubernetes_utils.KubernetesInstanceType.is_valid_instance_type(
Expand Down Expand Up @@ -368,39 +376,6 @@ def query_status(cls, name: str, tag_filters: Dict[str, str],
# If pods are not found, we don't add them to the return list
return cluster_status

@classmethod
def query_env_vars(cls, name: str) -> Dict[str, str]:
namespace = kubernetes_utils.get_current_kube_config_context_namespace()
pod = kubernetes.core_api().list_namespaced_pod(
namespace,
label_selector=f'skypilot-cluster={name},ray-node-type=head'
).items[0]
response = kubernetes.stream()(
kubernetes.core_api().connect_get_namespaced_pod_exec,
pod.metadata.name,
namespace,
command=['env'],
stderr=True,
stdin=False,
stdout=True,
tty=False,
_request_timeout=kubernetes.API_TIMEOUT)
# Split response by newline and filter lines containing '='
raw_lines = response.split('\n')
filtered_lines = [line for line in raw_lines if '=' in line]

# Split each line at the first '=' occurrence
lines = [line.split('=', 1) for line in filtered_lines]

# Construct the dictionary using only valid environment variable names
env_vars = {}
for line in lines:
key = line[0]
if common_utils.is_valid_env_var(key):
env_vars[key] = line[1]

return env_vars

@classmethod
def get_current_user_identity(cls) -> Optional[List[str]]:
k8s = kubernetes.get_kubernetes()
Expand Down
36 changes: 36 additions & 0 deletions sky/skylet/providers/kubernetes/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,8 @@ def bootstrap_kubernetes(config: Dict[str, Any]) -> Dict[str, Any]:

_configure_services(namespace, config['provider'])

config = _configure_ssh_jump(namespace, config)

if not config['provider'].get('_operator'):
# These steps are unecessary when using the Operator.
_configure_autoscaler_service_account(namespace, config['provider'])
Expand Down Expand Up @@ -257,6 +259,40 @@ def _configure_autoscaler_role_binding(namespace: str,
logger.info(log_prefix + created_msg(binding_field, name))


def _configure_ssh_jump(namespace, config):
romilbhardwaj marked this conversation as resolved.
Show resolved Hide resolved
"""Creates a SSH jump pod to connect to the cluster.

Also updates config['auth']['ssh_proxy_command'] to use the newly created
jump pod.
"""
pod_cfg = config['available_node_types']['ray_head_default']['node_config']

ssh_jump_name = pod_cfg['metadata']['labels']['skypilot-ssh-jump']
ssh_jump_image = config['provider']['ssh_jump_image']

volumes = pod_cfg['spec']['volumes']
# find 'secret-volume' and get the secret name
secret_volume = next(filter(lambda x: x['name'] == 'secret-volume',
volumes))
ssh_key_secret_name = secret_volume['secret']['secretName']

# TODO(romilb): We currently split SSH jump pod and svc creation. Service
# is first created in authentication.py::setup_kubernetes_authentication
# and then SSH jump pod creation happens here. This is because we need to
# set the ssh_proxy_command in the ray YAML before we pass it to the
# autoscaler. If in the future if we can write the ssh_proxy_command to the
# cluster yaml through this method, then we should move the service
# creation here.

# TODO(romilb): We should add a check here to make sure the service is up
# and available before we create the SSH jump pod. If for any reason the
# service is missing, we should raise an error.

kubernetes_utils.setup_ssh_jump_pod(ssh_jump_name, ssh_jump_image,
ssh_key_secret_name, namespace)
return config


def _configure_services(namespace: str, provider_config: Dict[str,
Any]) -> None:
service_field = 'services'
Expand Down
Loading