[Serve/Spot] Allow spot queue/cancel/logs during controller INIT state #3288

Michaelvll · 2024-03-08T09:49:42Z

Fixes #1592, #3285

TODO:

Handle the case, when the controller is actually abnormal (autostopped or manually terminated)

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- for i in `seq 1 100`; do sky spot launch -n test-$i --cloud gcp --cpus 2 -y -d "echo hi; sleep 120"; done; during the submission
- sky spot queue
- sky spot logs 10
- sky spot cancel 4 5 6
All smoke tests: pytest tests/test_smoke.py --managed-spot
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

concretevitamin · 2024-03-08T19:41:51Z

Thanks, this is great @Michaelvll!

Q: I have existing controllers. After upgrading to this branch and running a spot launch, it shows

Controller's latest status is INIT; jobs will not be shown until it becomes UP.

Is this expected for an old controller?

Michaelvll · 2024-03-08T20:20:25Z

Thanks, this is great @Michaelvll!

Q: I have existing controllers. After upgrading to this branch and running a spot launch, it shows
Controller's latest status is INIT; jobs will not be shown until it becomes UP.
Is this expected for an old controller?

No, this is not expected. Do you see this after the job is submitted, i.e. skypilot is trying to tail the log of the job?

I tried to start a spot controller on master and switch to this PR, but failed to reproduce the issue. Could you share what the commit you were using for the spot controller before upgrading?

concretevitamin

Thanks, some questions.

sky/backends/backend_utils.py

concretevitamin · 2024-03-08T21:34:51Z

sky/cli.py

        controller_type=controller_utils.Controllers.SPOT_CONTROLLER,
        stopped_message='All managed spot jobs should have finished.')
-    if handle is None:
+    if controller_status in [status_lib.ClusterStatus.STOPPED, None]:


Add a comment: Allow INIT to proceed.

(same below?)

What if there's some genuine problems? E.g., ssh into an UP controller, kill Ray, local status refresh to make ctrl go into INIT, then call sky cancel?

Added the comment here.

Will test the those cases more : )

When the controller is actually abnormal, we will still try to ssh into the controller and try to run the same command and retrieve the results.

sky/backends/backend_utils.py

concretevitamin

Thanks, some comments.

sky/backends/backend_utils.py

sky/core.py

sky/backends/backend_utils.py

sky/core.py

sky/utils/controller_utils.py

sky/utils/command_runner.py

concretevitamin · 2024-03-10T17:23:51Z

A potential thing to investigate:

UP spot controller
SSH in, pkill ray
status -r shows INIT
sky start sky-spot-controller-8a3968f2 succeeded
status -r still shows INIT

Refreshing status for 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--D 03-10 10:23:25 backend_utils.py:1586] Querying GCP cluster 'sky-spot-controller-8a3968f2' status:
D 03-10 10:23:25 backend_utils.py:1586] {'sky-spot-controller-8a3968f2-8a39-head-en3sz25n-compute': <ClusterStatus.UP: 'UP'>}
Refreshing status for 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--D 03-10 10:23:31 backend_utils.py:1803] Refreshing status ('sky-spot-controller-8a3968f2'): ray status not showing all nodes (0/1); output: <sky-payload>{"ray_port": 6380}</sky-payload>
D 03-10 10:23:31 backend_utils.py:1803] No cluster status. It may take a few seconds for the Ray internal services to start up.
D 03-10 10:23:31 backend_utils.py:1803] ; stderr:
D 03-10 10:23:31 backend_utils.py:1888] The cluster is abnormal. Setting to INIT status. node_statuses: [<ClusterStatus.UP: 'UP'>]
NAME                          LAUNCHED    RESOURCES                            STATUS  AUTOSTOP  COMMAND
sky-spot-controller-8a3968f2  2 mins ago  1x GCP(n2-standard-2, disk_size=50)  INIT    -         sky start sky-spot-contro...

Michaelvll · 2024-03-10T23:54:10Z

A potential thing to investigate:

UP spot controller
SSH in, pkill ray
status -r shows INIT
sky start sky-spot-controller-8a3968f2 succeeded
status -r still shows INIT

Refreshing status for 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--D 03-10 10:23:25 backend_utils.py:1586] Querying GCP cluster 'sky-spot-controller-8a3968f2' status:
D 03-10 10:23:25 backend_utils.py:1586] {'sky-spot-controller-8a3968f2-8a39-head-en3sz25n-compute': <ClusterStatus.UP: 'UP'>}
Refreshing status for 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--D 03-10 10:23:31 backend_utils.py:1803] Refreshing status ('sky-spot-controller-8a3968f2'): ray status not showing all nodes (0/1); output: <sky-payload>{"ray_port": 6380}</sky-payload>
D 03-10 10:23:31 backend_utils.py:1803] No cluster status. It may take a few seconds for the Ray internal services to start up.
D 03-10 10:23:31 backend_utils.py:1803] ; stderr:
D 03-10 10:23:31 backend_utils.py:1888] The cluster is abnormal. Setting to INIT status. node_statuses: [<ClusterStatus.UP: 'UP'>]
NAME                          LAUNCHED    RESOURCES                            STATUS  AUTOSTOP  COMMAND
sky-spot-controller-8a3968f2  2 mins ago  1x GCP(n2-standard-2, disk_size=50)  INIT    -         sky start sky-spot-contro...

This seems to be an unrelated issue to this PR. Could you help raise an issue for this?

sky/backends/backend_utils.py

sky/serve/core.py

sky/utils/command_runner.py

sky/backends/backend_utils.py

concretevitamin · 2024-03-11T03:28:58Z

Still going through the PR. But just realized a UX issue.

No controller
spot launch in another terminal
watch spot queue
Before, it was "No in-progress managed spot jobs" but when controller's being set up, it changes to "No in progress tasks."

concretevitamin

Sending comments first, haven't read cli/core.

Minor UX issue (missing space):

Services
No in-progress services.(See: sky serve -h)

sky/backends/backend_utils.py

sky/utils/controller_utils.py

concretevitamin

LGTM, thanks for the major UX improvement @Michaelvll!

Reminder #3288 (review) which seems to happen on the latest commit.

sky/serve/core.py

sky/backends/backend_utils.py

sky/core.py

concretevitamin · 2024-03-11T04:41:47Z

sky/core.py

-    if (refresh and controller_status in [
-            status_lib.ClusterStatus.STOPPED, status_lib.ClusterStatus.INIT
-    ]):
+    if refresh and handle is None:


Is this right: When refresh and handle is not None, we have performed a refresh in is_controller_accessible() and don't refresh again here.

If so, worth a comment.

No, when handle is not None we directly try to ssh into the controller to get the spot queue, but if refresh and handle is None here, we will try to restart the controller.

concretevitamin · 2024-03-11T05:12:35Z

Problem: After waiting for ~10 mins, the spot controller got autostopped (verified in console). However, local sky spot queue and sky status prints:

Fetching managed spot job statuses...
ssh: connect to host 35.188.201.107 port 22: Operation timed out
Managed spot jobs
Failed to connect to spot controller, please try again later.

and

...
Managed spot jobs
ssh: connect to host 35.188.201.107 port 22: Operation timed out
Failed to connect to spot controller, please try again later.

Services
No live services.(See: sky serve -h)

Its status became INIT. Running status refresh shows extra unexpected logging too:

» sst -r                                                       1 ↵
Clusters
NAME                          LAUNCHED     RESOURCES                            STATUS   AUTOSTOP  COMMAND
smoke                         4 weeks ago  1x GCP(n2-standard-8)                STOPPED  0m        sky start -i0 smoke -y
sky-spot-controller-8a3968f2  17 mins ago  1x GCP(n2-standard-2, disk_size=50)  STOPPED  -         sky spot launch -n test-2...

Managed spot jobs
⠸ Checking spot jobsssh: connect to host 35.188.201.107 port 22: Operation timed out
Failed to connect to spot controller, please try again later.

Services
No live services.(See: sky serve -h)

Co-authored-by: Zongheng Yang <[email protected]>

…into faster-spot-queue

Co-authored-by: Zongheng Yang <[email protected]>

…into faster-spot-queue

Michaelvll · 2024-03-11T06:38:28Z

Problem: After waiting for ~10 mins, the spot controller got autostopped (verified in console). However, local sky spot queue and sky status prints:

Fetching managed spot job statuses...
ssh: connect to host 35.188.201.107 port 22: Operation timed out
Managed spot jobs
Failed to connect to spot controller, please try again later.

and

...
Managed spot jobs
ssh: connect to host 35.188.201.107 port 22: Operation timed out
Failed to connect to spot controller, please try again later.

Services
No live services.(See: sky serve -h)

Its status became INIT. Running status refresh shows extra unexpected logging too:

» sst -r                                                       1 ↵
Clusters
NAME                          LAUNCHED     RESOURCES                            STATUS   AUTOSTOP  COMMAND
smoke                         4 weeks ago  1x GCP(n2-standard-8)                STOPPED  0m        sky start -i0 smoke -y
sky-spot-controller-8a3968f2  17 mins ago  1x GCP(n2-standard-2, disk_size=50)  STOPPED  -         sky spot launch -n test-2...

Managed spot jobs
⠸ Checking spot jobsssh: connect to host 35.188.201.107 port 22: Operation timed out
Failed to connect to spot controller, please try again later.

Services
No live services.(See: sky serve -h)

Good catch! Just to confirm, the unexpected logging you mentioned is the ssh: connect to host 35.188.201.107 port 22: Operation timed out. I have turned it off : )

concretevitamin · 2024-03-11T22:14:41Z

~~It seems like we still need to fix the issue of autostop -> auto update status into INIT (commit 36e4865) without refresh. Which is a bit surprising?~~

EDIT: false alert (I probably messed up the branch!).

Michaelvll · 2024-03-11T23:36:33Z

Fixed the UX:

$ sky status
Clusters
NAME                           LAUNCHED     RESOURCES                                                                  STATUS   AUTOSTOP  COMMAND                          
sky-spot-controller-084e3d6c   16 mins ago  1x GCP(n2-standard-8, disk_size=50)                                        STOPPED  10m       sky spot launch -n test echo...     

Managed spot jobs
No in-progress spot jobs.

Services
No existing services.

* To see detailed service status: sky serve status -a

$ sky spot queue
Fetching managed spot job statuses...
Managed spot jobs
No in-progress spot jobs. (See finished jobs: sky spot queue --refresh)

Michaelvll · 2024-03-12T01:09:29Z

Tested again with pytest tests/test_smoke.py --managed-spot and it works.

`sky jobs queue` used to output a temporary "waiting" message while the managed jobs controller was still being provisioned/starting. Since skypilot-org#3288 this is not shown, and instead the queued jobs themselves will show PENDING/STARTING. This also requires some changes to tests to permit the PENDING and STARTING states for managed jobs.

* [test] don't wait for old pending jobs controller messages `sky jobs queue` used to output a temporary "waiting" message while the managed jobs controller was still being provisioned/starting. Since #3288 this is not shown, and instead the queued jobs themselves will show PENDING/STARTING. This also requires some changes to tests to permit the PENDING and STARTING states for managed jobs. * fix default aws region * [test] wait for RECOVERING more quickly Smoke tests were failing because some managed jobs were fulling recovering back to the RUNNING state before the smoke test could catch the RECOVERING case (see e.g. #4192 `test_managed_jobs_cancellation_gcp`). Change tests that manually terminate a managed job instance, so that they will wait for the managed job to change away from the RUNNING state, checking every 10s. * address PR comments * fix

* [test] don't wait for old pending jobs controller messages `sky jobs queue` used to output a temporary "waiting" message while the managed jobs controller was still being provisioned/starting. Since skypilot-org#3288 this is not shown, and instead the queued jobs themselves will show PENDING/STARTING. This also requires some changes to tests to permit the PENDING and STARTING states for managed jobs. * fix default aws region * [test] wait for RECOVERING more quickly Smoke tests were failing because some managed jobs were fulling recovering back to the RUNNING state before the smoke test could catch the RECOVERING case (see e.g. skypilot-org#4192 `test_managed_jobs_cancellation_gcp`). Change tests that manually terminate a managed job instance, so that they will wait for the managed job to change away from the RUNNING state, checking every 10s. * address PR comments * fix

…f options (#4061) * user can select load balancing policies * some fixes * linting * Fixes according to comments * Linting * Linting * Fixed according to comments * fix * removed line from examples * Reverted changes * Reverted changes * Fixed according to comments * Linting * Update sky/serve/load_balancer.py Co-authored-by: Tian Xia <[email protected]> * [Catalog] Silently ignore TPU price not found. (#4134) * [Catalog] Silently ignore TPU price not found. * assert for non tpu v6e * format * [docs] Update GPUs used in docs (#4138) * Change V100 to H100 * updates * update * [k8s] Fix GPU labeling for EKS (#4146) Fix GPU labelling * [k8s] Handle @ in context name (#4147) Handle @ in context name * [Docs] Typo in distributed jobs docs (#4149) minor typo * [Performance] Refactor Azure SDK usage (#4139) * [Performance] Refactor Azure SDK usage * lazy import and address comments * address comments * fixes * fixes * nits * fixes * Fix OCI import issue (#4178) * Fix OCI import issue * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <[email protected]> * edit comments --------- Co-authored-by: Zhanghao Wu <[email protected]> * [k8s] Add retry for apparmor failures (#4176) * Add retry for apparmor failures * add comment * [Docs] Update Managed Jobs page. (#4177) * [Docs] Update Managed Jobs page. * Lint * Updates * Minor: Jobs docs fix. (#4183) * [Docs] Update Managed Jobs page. * Lint * Updates * reword * [UX] remove all uses of deprecated `sky jobs` (#4173) * [UX] remove all uses of deprecated `sky jobs` * Apply suggestions from code review Co-authored-by: Romil Bhardwaj <[email protected]> * fix other mentions of "spot jobs" --------- Co-authored-by: Romil Bhardwaj <[email protected]> * [Azure] Support fractional A10 instance types (#3877) * fix * change catalog to float gpu num * support print float point gpu in sky launch. TODO: test if the ray deployment group works for fractional one * fix unittest * format * patch ray resources to ceil value * support launch from --gpus A10 * only allow strictly match fractional gpu counts * address comment * change back condition * fix * apply suggestions from code review * fix * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zhanghao Wu <[email protected]> * format * fix display of fuzzy candidates * fix precision issue * fix num gpu required * refactor in check_resources_fit_cluster * change type annotation of acc_count * enable fuzzy fp acc count * fix k8s * Update sky/clouds/service_catalog/common.py Co-authored-by: Zhanghao Wu <[email protected]> * fix integer gpus * format --------- Co-authored-by: Zhanghao Wu <[email protected]> * [Jobs] Refactor: Extract task failure state update helper (#4185) refactor: a unified exception handling utility * [Core] Remove backward compatibility code for 0.6.0 & 0.7.0 (#4175) * [Core] Remove backward compatibility code for 0.6.0 * remove backwards compatibility for 0.7.0 release * Update sky/serve/serve_state.py Co-authored-by: Romil Bhardwaj <[email protected]> * remove more * Revert "remove more" This reverts commit 34c28e9. * remove more but not instance tags --------- Co-authored-by: Christopher Cooper <[email protected]> Co-authored-by: Romil Bhardwaj <[email protected]> * Remove outdated pylint disabling comments (#4196) Update cloud_vm_ray_backend.py * [test] update default clouds for smoke tests (#4182) * [k8s] Show all kubernetes clusters in optimizer table (#4013) * Show all kubernetes clusters in optimizer table * format * Add comment * [Azure] Allow resource group specifiation for Azure instance provisioning (#3764) * Allow resource group specifiation for Azure instance provisioning * Add 'use_external_resource_group' under provider config * nit * attached resources deletion * support deployment removal when terminating * nit * delete RoleAssignment when terminating * update ARM config template * nit * nit * delete role assignment with guid * update role assignment removal logic * Separate resource group region and VM, attached resources * nit * nit * nit * nit * add error handling for deletion * format * deployment naming update * test * nit * update deployment constant names * update open_ports to wait for the nsg creation corresponding to the VM being provisioned * format * nit * format * update docstring * add back deleted snippet * format * delete nic with retries * error handle update * [dev] restrict pylint to changed files (#4184) * [dev] restrict pylint to changed files * fix glob * avoid use of xargs -d * Update packer scripts (#4203) * Update custom image packer script to exclude .sky and include python sys packages * add comments * Upgrade Azure SDK version requirement (#4204) * [Jobs] Add option to specify `max_restarts_on_errors` (#4169) * Add option to specify `max_retry_on_failure` * fix recover counts * fix log streaming * fix docs * fix * fix * fix * fix * fix default value * Fix spinner * Add unit test for default strategy * fix test * format * Update docs/source/examples/managed-jobs.rst Co-authored-by: Zongheng Yang <[email protected]> * rename to restarts * Update docs/source/examples/managed-jobs.rst Co-authored-by: Zongheng Yang <[email protected]> * update docs * warning instead of error out * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <[email protected]> * rename * add comment * fix * rename * Update sky/execution.py Co-authored-by: Romil Bhardwaj <[email protected]> * Update sky/execution.py Co-authored-by: Romil Bhardwaj <[email protected]> * address comments * format * commit changes for docs * Format --------- Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: Romil Bhardwaj <[email protected]> * [Core] Fix job race condition. (#4193) * [Core] Fix job race condition. * fix * simplify url * change to list_jobs * upd ray comments * only store jobs in ray_id_set * [Core] Fix issue with the wrong path of setup logs (#4209) * fix issue with a getting setup logs * More conservative * print error * comment * [Jobs] Fix jobs name (#4213) * fix issue with a getting setup logs * More conservative * print error * comment * Fix job name * [Performance] Speed up Azure A10 instance creation (#4205) * Use date instead of timestamp in skypilot image names * Speed up Azure A10 VM creation * disable nouveau and use smaller instance * address comments * address comments * add todo * [Tests] Fix public bucket tests (#4216) fix * [Catalog] Add TPU V6e. (#4218) * [Catalog] Add TPU V6e. * swap if else branch * [test] smoke test fixes for managed jobs (#4217) * [test] don't wait for old pending jobs controller messages `sky jobs queue` used to output a temporary "waiting" message while the managed jobs controller was still being provisioned/starting. Since #3288 this is not shown, and instead the queued jobs themselves will show PENDING/STARTING. This also requires some changes to tests to permit the PENDING and STARTING states for managed jobs. * fix default aws region * [test] wait for RECOVERING more quickly Smoke tests were failing because some managed jobs were fulling recovering back to the RUNNING state before the smoke test could catch the RECOVERING case (see e.g. #4192 `test_managed_jobs_cancellation_gcp`). Change tests that manually terminate a managed job instance, so that they will wait for the managed job to change away from the RUNNING state, checking every 10s. * address PR comments * fix * Add user toolkits to all sky custom images and fix PyTorch issue on A10 (#4219) * Add user toolkits to all sky custom images * address comments * [Core] Support TPU v6 (#4220) * init * fix * nit * format * add readme * add inference example * nit * add multi-host training * rephrase catalog doc * Update examples/tpu/v6e/README.md Co-authored-by: Zhanghao Wu <[email protected]> --------- Co-authored-by: Zhanghao Wu <[email protected]> * [Core] Make home address replacement more robust (#4227) * Make home address replacement more robust * format * [UX] sky launch --fast (#4159) * [UX] skip provisioning stages if cluster is already available * add new --skip-setup flag and further limit stages to match sky exec * rename flag to --fast * add smoke test for sky launch --fast * changes stages for --fast * fix --fast help message * add api test for fast param (outside CLI) * lint * explicitly specify stages * [Docs] Tpu v6 docs (#4221) * Update TPU v6 docs * tpu v6 docs * add TPU v6 * update * Fix tpu docs * fix indents * restructure TPU doc * Fix * Fix * fix * Fix TPU * fix docs * Update docs/source/reference/tpu.rst Co-authored-by: Tian Xia <[email protected]> --------- Co-authored-by: Tian Xia <[email protected]> * [ux] add sky jobs launch --fast (#4231) * [ux] add sky jobs launch --fast This flag will make the jobs controller launch use `sky launch --fast`. There are a few known situations where this can cause misbehavior in the jobs controller: - The SkyPilot wheel is outdated (due to changes in the SkyPilot code or a version upgrade). - The user's cloud credentials have changed. In this case the new credentials will not be synced, and if there are new clouds available in `sky check`, the cloud depedencies may not be correctly installed. However, this does speed up `jobs launch` _significantly_, so provide it as a dangerous option. Soon we will add robustness checks to `sky launch --fast` that will fix the above caveats, and we can remove this flag and just enable the behavior by default. * Apply suggestions from code review Co-authored-by: Romil Bhardwaj <[email protected]> * fix lint --------- Co-authored-by: Romil Bhardwaj <[email protected]> * [UX] Show 0.25 on controller queue (#4230) * Show 0.25 on controller queue * format * [Storage] Avoid opt-in regions for S3 (#4239) * S3 fix + timeout * S3 fix + timeout * lint * Update K8s docker image build and the source artifact registry (#4224) * Attempt at improving performance of k8s cluster launch * remove conda env creation * add multiple regions * K8s sky launch pulls the new docker images * Move k8s script * use us region only * typo * Remove --system-site-packages when setup sky cluster (#4168) * Remove --system-site-packages when setup sky cluster * add comments * [AWS/Azure] Avoid error out during image size check (#4244) * Avoid error out during image size check * Avoid error for azure * lint * [AWS] Disable additional auto update services for ubuntu image with cloud-init (#4252) * Disable additional auto update services for ubuntu image * simplify the commands * [Dashboard] Add a simple status filter. (#4253) * Disable more potential unattended upgrade sources for AWS (#4246) * Fix AWS unattended upgrade issue * more commands * add retry and disable all unattended * remove retry * disable unattended upgrades and add retry in aws default image * [docs]: OCI key_file path clarrification (#4262) * [docs]: OCI key_file path clarrification * Update installation.rst * [k8s] Parallelize setup for faster multi-node provisioning (#4240) * parallelize setup * lint * Add retries * lint * retry for get_remote_home_dir * optimize privilege check * parallelize termination * increase num threads * comments * lint * do not redirect stderr to /dev/null when submitting job (#4247) * do not redirect stderr to /dev/null when submitting job Should fix #4199. * remove grep, add worker_maximum_startup_concurrency override * [tests] Exclude runpod from smoke tests unless specified (#4238) Add runpod * Update comments pointing to Lambda's docs (#4272) * [Core] Avoid PENDING job to be set to FAILED and speed up job scheduling (#4264) * fix race condition for setting job status to FAILED during INIT * Fix * fix * format * Add smoke tests * revert pending submit * remove update entirely for the job schedule step * wait for job 32 to finish * fix smoke * move and rename * Add comment * minor * Set minimum port number a Ray worker can listen on to 11002 (#4278) Set worker minimum port number * [docs] use k8s instead of kubernetes in the CLI (#4164) * [docs] use k8s instead of kubernetes in the CLI * fix docs build script for linux * Update docs/source/reference/kubernetes/kubernetes-getting-started.rst Co-authored-by: Romil Bhardwaj <[email protected]> --------- Co-authored-by: Romil Bhardwaj <[email protected]> * [jobs] autodown managed job clusters (#4267) * [jobs] autodown managed job clusters If all goes correctly, the managed job controller should tear down a managed job cluster once the managed job completes. However, if the controller fails somehow (e.g. crashes, is terminated, etc), we don't want to leak resources. As a failsafe, set autodown on the job cluster. This is not foolproof, since the skylet on the cluster can also crash, but it's likely to catch many cases. * add comment about autodown duration * add leading _ * [UX] Improve Formatting of Post Job Creation Logs (#4198) * Update cloud_vm_ray_backend.py * Update cloud_vm_ray_backend.py * format * Fix `stream_logs` Duplicate Job Handling and TypeError (#4274) fix: multiple `job_id` * Update sky/serve/load_balancer.py Co-authored-by: Tian Xia <[email protected]> * feat(serve): Improve load balancing policy error message and display 1. Add available policies to schema validation 2. Show available policies in error message when invalid policy is specified 3. Display load balancing policy in service spec repr when explicitly set * fix(serve): Update load balancing policy schema to match implemented policies Only 'round_robin' is currently implemented in LoadBalancingPolicy class * linting * refactor(serve): Remove policy enum from schema Move policy validation to code to avoid duplication and make it easier to maintain when adding new policies * fix * linting * Update sky/serve/service_spec.py Co-authored-by: Tian Xia <[email protected]> * Fix circular import in schemas.py by moving load_balancing_policies import inside function * linting --------- Co-authored-by: Tian Xia <[email protected]> Co-authored-by: Romil Bhardwaj <[email protected]> Co-authored-by: Yika <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: Christopher Cooper <[email protected]> Co-authored-by: Romil Bhardwaj <[email protected]> Co-authored-by: Andy Lee <[email protected]> Co-authored-by: landscapepainter <[email protected]> Co-authored-by: Hysun He <[email protected]> Co-authored-by: Cody Brownstein <[email protected]>

Michaelvll added 4 commits March 8, 2024 08:25

Make spot queue faster

96c1be3

fix

ef05184

remove INIT message

b23475a

Avoid showing INIT

2937738

Michaelvll requested review from cblmemo and concretevitamin March 8, 2024 18:28

concretevitamin reviewed Mar 8, 2024

View reviewed changes

Michaelvll changed the title ~~[Spot] Allow spot queue/cancel/logs during controller INIT state~~ [Serve/Spot] Allow spot queue/cancel/logs during controller INIT state Mar 8, 2024

Michaelvll added 2 commits March 9, 2024 00:55

address comments

0499b70

Refactor and fix the case when controller is abnormal

230fb22

Michaelvll requested a review from concretevitamin March 10, 2024 07:50

concretevitamin reviewed Mar 10, 2024

View reviewed changes

Michaelvll added 8 commits March 10, 2024 19:51

address comments

c2a1783

fix

2e62c37

fix caller

6c82e03

format

556f014

Add tests

ac0d59e

fix test

cb07f9f

format

699e00b

fix name

51c6641

Michaelvll requested a review from concretevitamin March 11, 2024 02:49

concretevitamin reviewed Mar 11, 2024

View reviewed changes

address comments

b3ff5d9

concretevitamin approved these changes Mar 11, 2024

View reviewed changes

concretevitamin mentioned this pull request Mar 11, 2024

[Spot/Serve] Messed up controller does not get reset to UP after sky start #3295

Closed

Michaelvll and others added 7 commits March 11, 2024 05:42

remove output

dd56f94

Update sky/serve/core.py

61d05b8

Co-authored-by: Zongheng Yang <[email protected]>

fix

cfb5e61

Merge branch 'faster-spot-queue' of github.com:skypilot-org/skypilot …

8a4fac0

…into faster-spot-queue

Update sky/backends/backend_utils.py

c1e224d

Co-authored-by: Zongheng Yang <[email protected]>

format

6e34249

Merge branch 'faster-spot-queue' of github.com:skypilot-org/skypilot …

36e4865

…into faster-spot-queue

fix ux

cac7093

Michaelvll merged commit 1c32bbb into master Mar 12, 2024
19 checks passed

Michaelvll deleted the faster-spot-queue branch March 12, 2024 01:09

Michaelvll mentioned this pull request Mar 12, 2024

[Serve] Allow termination of serve controllers in INIT state #3272

Closed

cg505 mentioned this pull request Oct 30, 2024

[test] smoke test fixes for managed jobs #4217

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve/Spot] Allow spot queue/cancel/logs during controller INIT state #3288

[Serve/Spot] Allow spot queue/cancel/logs during controller INIT state #3288

Michaelvll commented Mar 8, 2024 •

edited

Loading

concretevitamin commented Mar 8, 2024

Michaelvll commented Mar 8, 2024

concretevitamin left a comment

concretevitamin Mar 8, 2024

Michaelvll Mar 9, 2024

Michaelvll Mar 10, 2024

concretevitamin left a comment

concretevitamin commented Mar 10, 2024

Michaelvll commented Mar 10, 2024

concretevitamin commented Mar 11, 2024

concretevitamin left a comment

concretevitamin left a comment

concretevitamin Mar 11, 2024

Michaelvll Mar 11, 2024

concretevitamin commented Mar 11, 2024 •

edited

Loading

Michaelvll commented Mar 11, 2024

concretevitamin commented Mar 11, 2024 •

edited

Loading

Michaelvll commented Mar 11, 2024

Michaelvll commented Mar 12, 2024

[Serve/Spot] Allow spot queue/cancel/logs during controller INIT state #3288

[Serve/Spot] Allow spot queue/cancel/logs during controller INIT state #3288

Conversation

Michaelvll commented Mar 8, 2024 • edited Loading

concretevitamin commented Mar 8, 2024

Michaelvll commented Mar 8, 2024

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Mar 8, 2024

Choose a reason for hiding this comment

Michaelvll Mar 9, 2024

Choose a reason for hiding this comment

Michaelvll Mar 10, 2024

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin commented Mar 10, 2024

Michaelvll commented Mar 10, 2024

concretevitamin commented Mar 11, 2024

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Mar 11, 2024

Choose a reason for hiding this comment

Michaelvll Mar 11, 2024

Choose a reason for hiding this comment

concretevitamin commented Mar 11, 2024 • edited Loading

Michaelvll commented Mar 11, 2024

concretevitamin commented Mar 11, 2024 • edited Loading

Michaelvll commented Mar 11, 2024

Michaelvll commented Mar 12, 2024

Michaelvll commented Mar 8, 2024 •

edited

Loading

concretevitamin commented Mar 11, 2024 •

edited

Loading

concretevitamin commented Mar 11, 2024 •

edited

Loading