Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] SkyServe on Kubernetes #3377

Merged
merged 89 commits into from
May 8, 2024
Merged

[k8s] SkyServe on Kubernetes #3377

merged 89 commits into from
May 8, 2024

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Mar 28, 2024

Adds SkyServe support on Kubernetes. Replicas and controller can now be run on Kubernetes. Supersedes #3109.

Changes in this PR:

Tested:

  • Code formatting: bash format.sh
  • HTTP Serve Example on GKE with ingress
  • HTTP Serve Example on GKE with LoadBalancer
  • LLM Example on GKE with ingress
  • HTTP Serve Example on EKS with ingress
  • HTTP Serve Example on EKS with LoadBalancer
  • Onprem testing through design partner
  • All smoke tests, including serve tests now enabled for k8s (in progress)

@romilbhardwaj
Copy link
Collaborator Author

Thanks for the reviews @Michaelvll! Ready for another look. If it looks ok at a high level, I'll start running smoke tests.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update @romilbhardwaj and sorry for the delay! LGTM. Could you help resolve the conflict? If the tests work fine, I think it is good to go.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/utils/common_utils.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
@@ -308,11 +308,12 @@ Let's bring up a real LLM chat service with FastChat + Vicuna. We'll use the `Vi
conda activate chatbot

echo 'Starting controller...'
python -u -m fastchat.serve.controller > ~/controller.log 2>&1 &
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we also mention serving on kubernetes a bit in the serving doc and refer to the cloud-permissions for the setup?
We may also want to explicitly list out why people should use skyserve on kubernetes than just using deployments, such as:

  1. easier customization for load balancer and autoscaling policy
  2. simple interface and easy to setup
  3. spillover to clouds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - filed #3518, will submit a separate PR for docs.

…o serve_k8s_final

# Conflicts:
#	sky/backends/backend_utils.py
#	sky/cli.py
#	sky/utils/controller_utils.py
#	sky/utils/schemas.py
@romilbhardwaj
Copy link
Collaborator Author

romilbhardwaj commented May 7, 2024

Thanks @Michaelvll! Addressed comments. Running smoke and backcompat tests now:

  • Cloud smoke tests, including serve tests - pytest -v tests/test_smoke.py
  • Kubernetes smoke tests, including serve tests - pytest -v tests/test_smoke.py --kubernetes
  • Backward compatibility - run a service from master, switch branch and launch a new svc. Try to access both services.

@romilbhardwaj
Copy link
Collaborator Author

Update - had to add skip_status_check arg to backend_utils.get_endpoints() to disable checks for cluster status (i.e., do not fail early if cluster is not UP). This is similar to the previous behavior, where head_ip:port was used without checking cluster status. This is required when multiple sky serve up are run concurrently (e.g., in smoke tests), since each call would put the cluster in INIT state and get_endpoints call at the end of serve.core.up would fail:

  File "/home/gcpuser/.sky/file_mounts/sky_repo/sky/serve/core.py", line 257, in up
    endpoint = backend_utils.get_endpoints(
  File "/home/gcpuser/.sky/file_mounts/sky_repo/sky/backends/backend_utils.py", line 2756, in get_endpoints
    raise exceptions.ClusterNotUpError(
sky.exceptions.ClusterNotUpError: Cluster 'sky-serve-controller-c3f7de47' is not in UP status.

Smoke tests pass now, this is ready to be merged.

@@ -2,6 +2,7 @@ name: min

setup: |
echo "running setup"
sudo apt-get install -y jq
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to move this to test_minimal in test_smoke.py only. : )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, since this minimal.yaml is also used in a bunch of other tests. moved jq installation to test_minimal.yaml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants