Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] SkyServe on Kubernetes #3377

Merged
merged 89 commits into from
May 8, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
34598a2
playing around
romilbhardwaj Feb 1, 2024
95b1eb0
wip with hacks
romilbhardwaj Feb 2, 2024
f93e6dd
wip refactor get_endpoints
romilbhardwaj Feb 5, 2024
c8276bb
working get_endpoints
romilbhardwaj Feb 6, 2024
c713293
wip
romilbhardwaj Feb 6, 2024
aa286e8
fixed circular import
romilbhardwaj Feb 6, 2024
9f768a2
Working for ingress and loadbalancer svc
romilbhardwaj Feb 6, 2024
1848b3a
lint
romilbhardwaj Feb 7, 2024
ba0ff60
add purging from #3094
romilbhardwaj Feb 15, 2024
13f7241
Use local catalog on the controller too
romilbhardwaj Feb 15, 2024
996742a
use externalip if available
romilbhardwaj Feb 19, 2024
7c8b5cd
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
romilbhardwaj Feb 20, 2024
b222186
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
romilbhardwaj Feb 23, 2024
d852c61
add dshm_size_limit
romilbhardwaj Feb 23, 2024
b9014df
optimize dependency installation
romilbhardwaj Feb 23, 2024
282bcef
Add todo
romilbhardwaj Feb 24, 2024
d10fb24
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
romilbhardwaj Feb 28, 2024
3cb948f
optimize ingress
romilbhardwaj Mar 2, 2024
4a7e10a
fix
romilbhardwaj Mar 2, 2024
9e0d910
fix
romilbhardwaj Mar 2, 2024
9e51793
Merge branch 'k8s_ingress_optimization' of https://github.com/skypilo…
romilbhardwaj Mar 3, 2024
68bb6e8
remove autostop timing
romilbhardwaj Mar 4, 2024
b72a814
Fix URLs for raw IP:ports
romilbhardwaj Mar 5, 2024
a22b6d3
fixes
romilbhardwaj Mar 7, 2024
4637f91
wip
romilbhardwaj Mar 7, 2024
d27ae5f
SA wip
romilbhardwaj Mar 7, 2024
cdc2e3f
Allow use of service accounts through remote_identity field
romilbhardwaj Mar 8, 2024
643fb77
Make purge work for no clusters in kubeconfig
romilbhardwaj Mar 8, 2024
7649fcb
Handle ingress namespace not present
romilbhardwaj Mar 8, 2024
8c76eb6
setup optimizations and critical SA key fix
romilbhardwaj Mar 8, 2024
2e20600
fix docs
romilbhardwaj Mar 16, 2024
de6a99c
fix docs
romilbhardwaj Mar 16, 2024
03e44eb
Add support for skypilot.co/external-ip annotation for ingress
romilbhardwaj Mar 24, 2024
7436a7e
Remove dshm_size_limit
romilbhardwaj Mar 25, 2024
3cc5352
Undo kind changes
romilbhardwaj Mar 25, 2024
d836f17
Update service account docs
romilbhardwaj Mar 25, 2024
1b559d2
minor docs
romilbhardwaj Mar 25, 2024
cda27b2
update comment
romilbhardwaj Mar 26, 2024
23d730b
is_same_cloud to cloud_in_list
romilbhardwaj Mar 27, 2024
cba92e3
refactor query_ports to use head_ip
romilbhardwaj Mar 27, 2024
789cefe
autodown + http prefixing in callers
romilbhardwaj Mar 27, 2024
55b63de
fix ssh key issues when user hash is reused
romilbhardwaj Mar 27, 2024
73ad2e6
linting
romilbhardwaj Mar 27, 2024
31418d2
lint
romilbhardwaj Mar 27, 2024
0684f2b
lint, HOST_CONTROLLERS
romilbhardwaj Mar 27, 2024
e7f232e
add serve smoke tests for k8s
romilbhardwaj Mar 28, 2024
2a1b916
disallow file_mounts and workdir if no storage cloud is enabled
romilbhardwaj Mar 28, 2024
f242037
minor
romilbhardwaj Mar 28, 2024
9b12cf1
lint
romilbhardwaj Mar 28, 2024
87330bd
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
romilbhardwaj Mar 28, 2024
3ca5d47
update fastchat to use --host 127.0.0.1
romilbhardwaj Mar 29, 2024
02ef50a
extend timeout
romilbhardwaj Mar 29, 2024
259917f
docs comments
romilbhardwaj Apr 9, 2024
8066a2a
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
romilbhardwaj Apr 9, 2024
665a12e
rename to port
romilbhardwaj Apr 10, 2024
fcc63fe
add to core.py
romilbhardwaj Apr 10, 2024
df21a0b
docstrs
romilbhardwaj Apr 10, 2024
5d8c9c7
add docs on exec based auth
romilbhardwaj Apr 10, 2024
937feb6
expand elif
romilbhardwaj Apr 10, 2024
33dfb01
add lb comment
romilbhardwaj Apr 10, 2024
87c77df
refactor
romilbhardwaj Apr 11, 2024
99547db
refactor
romilbhardwaj Apr 11, 2024
3d24b41
fix docs build
romilbhardwaj Apr 11, 2024
750d4d4
add PODIP mode support
romilbhardwaj Apr 8, 2024
87d4d25
make ssh services optional
romilbhardwaj Apr 12, 2024
a31a3bc
nits
romilbhardwaj Apr 29, 2024
e3bb4d7
Revert "make ssh services optional"
romilbhardwaj Apr 29, 2024
12096fd
Revert "add PODIP mode support"
romilbhardwaj Apr 29, 2024
c33572e
nits
romilbhardwaj Apr 29, 2024
803df4a
use 0.0.0.0 when on k8s; use common impl for other clouds
romilbhardwaj Apr 29, 2024
a0ba0d1
return dict instead of raising errors in core.endpoints()
romilbhardwaj Apr 29, 2024
fb88218
lint
romilbhardwaj Apr 29, 2024
1d968a6
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
romilbhardwaj Apr 29, 2024
e22908a
merge fixes
romilbhardwaj Apr 29, 2024
34a719c
merge fixes
romilbhardwaj Apr 29, 2024
017dcb5
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
romilbhardwaj May 7, 2024
cd08d0f
merge fixes
romilbhardwaj May 7, 2024
05f1996
lint
romilbhardwaj May 7, 2024
97b1d56
fix smoke tests
romilbhardwaj May 7, 2024
9a8ccd1
fix smoke tests
romilbhardwaj May 7, 2024
3b7d33f
comment
romilbhardwaj May 7, 2024
e029ebc
add enum for remote identity
romilbhardwaj May 7, 2024
7b201f2
lint
romilbhardwaj May 7, 2024
8a54faf
add skip_status_check
romilbhardwaj May 8, 2024
5fb4232
remove zone requirement
romilbhardwaj May 8, 2024
2e8afe9
fix timings for test
romilbhardwaj May 8, 2024
736e087
silence curl download
romilbhardwaj May 8, 2024
e3768b4
move jq from yaml to test_minimal
romilbhardwaj May 8, 2024
26cc380
move jq from yaml to test_minimal
romilbhardwaj May 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 11 additions & 3 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -810,7 +810,7 @@ def write_cluster_config(
if fnmatch.fnmatchcase(cluster_name, list(profile.keys())[0]):
remote_identity = list(profile.values())[0]
break
if remote_identity != 'LOCAL_CREDENTIALS':
if remote_identity != schemas.RemoteIdentityOptions.LOCAL_CREDENTIALS.value:
if not cloud.supports_service_account_on_remote():
raise exceptions.InvalidCloudConfigs(
'remote_identity: SERVICE_ACCOUNT is specified in '
Expand Down Expand Up @@ -2721,13 +2721,20 @@ def check_stale_runtime_on_remote(returncode: int, stderr: str,


def get_endpoints(cluster: str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have this functionality in core, e.g. being part of core.status, to keep the core functionality align with cli?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I've added def endpoints() to sky.core. However, I'm letting serve/core.py directly call backend_utils.get_endpoints because from sky import core in serve/core.py causes a circular import.

port: Optional[Union[int, str]] = None) -> Dict[int, str]:
port: Optional[Union[int, str]] = None,
skip_status_check: bool = False) -> Dict[int, str]:
"""Gets the endpoint for a given cluster and port number (endpoint).

Args:
cluster: The name of the cluster.
port: The port number to get the endpoint for. If None, endpoints
for all ports are returned.
skip_status_check: Whether to skip the status check for the cluster.
This is useful when the cluster is known to be in a INIT state
and the caller wants to query the endpoints. Used by serve
controller to query endpoints during cluster launch when multiple
services may be getting launched in parallel (and as a result,
the controller may be in INIT status due to a concurrent launch).

Returns: A dictionary of port numbers to endpoints. If endpoint is None,
the dictionary will contain all ports:endpoints exposed on the cluster.
Expand All @@ -2751,7 +2758,8 @@ def get_endpoints(cluster: str,
refresh=False,
cluster_names=[cluster])
cluster_record = cluster_records[0]
if cluster_record['status'] != status_lib.ClusterStatus.UP:
if (not skip_status_check and
cluster_record['status'] != status_lib.ClusterStatus.UP):
with ux_utils.print_exception_no_traceback():
raise exceptions.ClusterNotUpError(
f'Cluster {cluster_record["name"]!r} '
Expand Down
1 change: 0 additions & 1 deletion sky/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,6 @@
from sky import exceptions
from sky import global_user_state
from sky import jobs as managed_jobs
from sky import provision as provision_lib
from sky import serve as serve_lib
from sky import sky_logging
from sky import status_lib
Expand Down
6 changes: 4 additions & 2 deletions sky/clouds/kubernetes.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,11 +271,13 @@ def make_deploy_resources_variables(

remote_identity = skypilot_config.get_nested(
('kubernetes', 'remote_identity'), schemas.REMOTE_IDENTITY_DEFAULT)
if remote_identity == 'LOCAL_CREDENTIALS':
if (remote_identity ==
schemas.RemoteIdentityOptions.LOCAL_CREDENTIALS.value):
# SA name doesn't matter since automounting credentials is disabled
k8s_service_account_name = 'default'
k8s_automount_sa_token = 'false'
elif remote_identity == 'SERVICE_ACCOUNT':
elif (remote_identity ==
schemas.RemoteIdentityOptions.SERVICE_ACCOUNT.value):
# Use the default service account
k8s_service_account_name = self.SKY_DEFAULT_SERVICE_ACCOUNT_NAME
k8s_automount_sa_token = 'true'
Expand Down
4 changes: 2 additions & 2 deletions sky/provision/kubernetes/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -635,8 +635,8 @@ def is_kubeconfig_exec_auth() -> Tuple[bool, Optional[str]]:

remote_identity = skypilot_config.get_nested(
('kubernetes', 'remote_identity'), schemas.REMOTE_IDENTITY_DEFAULT)
if ('exec' in user_details.get('user', {}) and
remote_identity == 'LOCAL_CREDENTIALS'):
if ('exec' in user_details.get('user', {}) and remote_identity
== schemas.RemoteIdentityOptions.LOCAL_CREDENTIALS.value):
ctx_name = current_context['name']
exec_msg = ('exec-based authentication is used for '
f'Kubernetes context {ctx_name!r}.'
Expand Down
4 changes: 3 additions & 1 deletion sky/serve/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,7 +255,9 @@ def up(
lb_port = serve_utils.load_service_initialization_result(
lb_port_payload)
endpoint = backend_utils.get_endpoints(
controller_handle.cluster_name, lb_port)[lb_port]
controller_handle.cluster_name, lb_port,
skip_status_check=True).get(lb_port)
assert endpoint is not None, 'Did not get endpoint for controller.'

sky_logging.print(
f'{fore.CYAN}Service name: '
Expand Down
4 changes: 3 additions & 1 deletion sky/utils/common_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,9 @@ def _is_valid_user_hash(user_hash: Optional[str]) -> bool:
os.makedirs(os.path.dirname(_USER_HASH_FILE), exist_ok=True)
if not force_fresh_hash:
# Do not cache to file if force_fresh_hash is True since the file may
# be intentionally using a different hash.
# be intentionally using a different hash, e.g. we want to keep the
# user_hash for usage collection the same on the jobs/serve controller
# as users' local client.
with open(_USER_HASH_FILE, 'w', encoding='utf-8') as f:
f.write(user_hash)
return user_hash
Expand Down
5 changes: 3 additions & 2 deletions sky/utils/controller_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -242,8 +242,9 @@ def _get_cloud_dependencies_installation_commands(
'fi" && '
# Install kubectl
'(command -v kubectl &>/dev/null || '
'(curl -LO "https://dl.k8s.io/release/$(curl -L -s '
'https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" && '
'(curl -LO "https://dl.k8s.io/release/'
'$(curl -L -s https://dl.k8s.io/release/stable.txt)'
'/bin/linux/amd64/kubectl" && '
'sudo install -o root -g root -m 0755 '
'kubectl /usr/local/bin/kubectl))')
if controller == Controllers.JOBS_CONTROLLER:
Expand Down
21 changes: 18 additions & 3 deletions sky/utils/schemas.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
Schemas conform to the JSON Schema specification as defined at
https://json-schema.org/
"""
import enum


def _check_not_both_fields_present(field1: str, field2: str):
Expand Down Expand Up @@ -522,10 +523,26 @@ def get_cluster_schema():
}
}


class RemoteIdentityOptions(enum.Enum):
"""Enum for remote identity types.

Some clouds (e.g., AWS, Kubernetes) also allow string values for remote
identity, which map to the service account/role to use. Those are not
included in this enum.
"""
LOCAL_CREDENTIALS = 'LOCAL_CREDENTIALS'
SERVICE_ACCOUNT = 'SERVICE_ACCOUNT'


REMOTE_IDENTITY_DEFAULT = RemoteIdentityOptions.LOCAL_CREDENTIALS.value

_REMOTE_IDENTITY_SCHEMA = {
'remote_identity': {
'type': 'string',
'case_insensitive_enum': ['LOCAL_CREDENTIALS', 'SERVICE_ACCOUNT']
'case_insensitive_enum': [
option.value for option in RemoteIdentityOptions
]
}
}

Expand Down Expand Up @@ -562,8 +579,6 @@ def get_cluster_schema():
},
}

REMOTE_IDENTITY_DEFAULT = 'LOCAL_CREDENTIALS'


def get_config_schema():
# pylint: disable=import-outside-toplevel
Expand Down
1 change: 0 additions & 1 deletion tests/skyserve/auto_restart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ service:
resources:
ports: 8080
cloud: gcp
zone: us-central1-a
cpus: 2+

workdir: examples/serve/http_server
Expand Down
7 changes: 4 additions & 3 deletions tests/test_smoke.py
Original file line number Diff line number Diff line change
Expand Up @@ -3244,7 +3244,7 @@ def test_skyserve_azure_http():


@pytest.mark.kubernetes
@pytest.mark.sky_serve
@pytest.mark.serve
def test_skyserve_kubernetes_http():
"""Test skyserve on Kubernetes"""
name = _get_service_name()
Expand Down Expand Up @@ -3407,6 +3407,7 @@ def test_skyserve_user_bug_restart(generic_cloud: str):


@pytest.mark.serve
@pytest.mark.no_kubernetes # Replicas on k8s may be running on the same node and have the same public IP
def test_skyserve_load_balancer(generic_cloud: str):
"""Test skyserve load balancer round-robin policy"""
name = _get_service_name()
Expand Down Expand Up @@ -3573,7 +3574,7 @@ def test_skyserve_fast_update(generic_cloud: str):
f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; curl -L http://$endpoint | grep "Hi, SkyPilot here"',
f'sky serve update {name} --cloud {generic_cloud} --mode blue_green -y tests/skyserve/update/bump_version_after.yaml',
# sleep to wait for update to be registered.
'sleep 120',
'sleep 30',
# 2 on-deamnd (ready) + 1 on-demand (provisioning).
(
_check_replica_in_status(
Expand All @@ -3587,7 +3588,7 @@ def test_skyserve_fast_update(generic_cloud: str):
# Test rolling update
f'sky serve update {name} --cloud {generic_cloud} -y tests/skyserve/update/bump_version_before.yaml',
# sleep to wait for update to be registered.
'sleep 30',
'sleep 15',
# 2 on-deamnd (ready) + 1 on-demand (shutting down).
_check_replica_in_status(name, [(2, False, 'READY'),
(1, False, 'SHUTTING_DOWN')]),
Expand Down
1 change: 1 addition & 0 deletions tests/test_yamls/minimal.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ name: min

setup: |
echo "running setup"
sudo apt-get install -y jq
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to move this to test_minimal in test_smoke.py only. : )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, since this minimal.yaml is also used in a bunch of other tests. moved jq installation to test_minimal.yaml.


run: |
conda env list
Loading