Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SkyServe][Do Not Merge] Spot Policy #2783

Closed
wants to merge 250 commits into from
Closed

[SkyServe][Do Not Merge] Spot Policy #2783

wants to merge 250 commits into from

Conversation

MaoZiming
Copy link
Collaborator

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

infwinston and others added 30 commits July 13, 2023 22:51
* Add service schema

* use new serve YAML

* change to qpm

* change to fix node

* refactor init of SkyServiceSpec

* change http example to new yaml format

* update default value of from_yaml_config and handle service in task

* Launching successfully

* use argument in controller & redirector

* resolve comments

* use qps instead

* raise when multiple task found

* change to qps

* introduce constants

* introduce constants & fix bugs

* add sky down

* add Services
No existing services. without STATUS (but with #healthy replica

* format

* add llama2 example

* add fields to service db

* status with replica information

* fix policy parsing bug

* add auth todo

* add replica status todo

* change cluster name prefix and order of the column

* minor fixes

* reorder status

* change name: controller --> control plane

* change name: middleware --> controller

* clean code

* rename default service name

* env vars

* add purge and skip identity check on serve controller

* upload filemounts and workdir to storage & enhance --purge
… `sky serve logs` prototype (#2311)

* introducing multiprocessing prototype

* add run env to controller & redirector

* reefactor and format

* add control-plane and redirector logs

* minor

* minor

* Refactor: move  to infra provider

* Refactor: move load balancer to redirector

* refactor, add more logging

* add replica status

* resolve some TODOs

* add post data feature

* rename, format

* add error message handling

* bug fix & logging

* fix a bug in continuous unhealthy

* add error when user port is same with control plane

* fix None post_data bug

* add stable diffusion example

* remove response body when code == 200

* add some TODOs and change RUNNING to READY

* add failed status

* add TODO for return failed replica info

* fix sky serve status --help error

* add console help messages

* remove redundant stable diffusion setup files

* rename healthy_replica --> ready_replica

* adopt advice from code review

* rename to service_name

* adopt advice from comment
… logs` for replica info (#2353)

* introducing multiprocessing prototype

* add run env to controller & redirector

* reefactor and format

* add control-plane and redirector logs

* minor

* minor

* Refactor: move  to infra provider

* Refactor: move load balancer to redirector

* refactor, add more logging

* add replica status

* resolve some TODOs

* add post data feature

* rename, format

* add error message handling

* bug fix & logging

* fix a bug in continuous unhealthy

* add error when user port is same with control plane

* fix None post_data bug

* add stable diffusion example

* remove response body when code == 200

* add some TODOs and change RUNNING to READY

* add failed status

* add TODO for return failed replica info

* fix sky serve status --help error

* add console help messages

* remove redundant stable diffusion setup files

* rename healthy_replica --> ready_replica

* finish replica info & num

* finish

* adopt advice from code review

* rename to service_name

* finish state machine; TODO property based implementation

* adopt advice from comment

* adopt comments in #2311

* finish new replica status

* modify http example more resonable

* UX details & set default controller resources to VCPU=4

* add spinner for launching contorl plane & redirector process

* add sky serve logs CLI for replicas

* add uptime section for service table

* relaunch replicas which terminated by exceeding consecutive failure threshold

* UX details

* code style

* move serve dependency to controller yaml setup section

* add launch log for replica

* add resources preview

* stop jupyter service to avoid port conflict

* Apply suggestions from code review

Co-authored-by: Wei-Lin Chiang <[email protected]>

* fix userjob failed and launch failed not terminate replica; replica status FAILED --> CLEANUP_FAILED since we terminate all FAILED replica immediately now; remove --purge in termination

* ux nits

* 0.0.0.0 -> localhost

* new log logic: use cluster status == UP instead of waiting 10s; early quit for replica not exist; skip all detailed file sync log

* ux nits

* change readiness timeout to initial delay seconds

* disable some logging when SKYPILOT_DEBUG is not set

* restore debug yaml

* remove debug message

* sync down log before teardown

* rename failed status name (replica)

* change controller resources vcpu to 4+ to avoid no 4 vcpu cloud

* disable -c, -r, -i in sky serve logs CLI

* add REPLICA column in service status

* add CONTROLLER_FAILED status; wait until control plane & redirector job to be running.

* add color for CONTROLLER_FAILED and a prompt to cleanup first if re-up a failed service

* change uptime to first time ready

* format

* add comment for replica/service status in sky serve status -h

* simplify yaml design

* remove controller resources cloud=gcp

* remove controller resources cloud=gcpsome comment

* redirect setup logs to devnull

* redirector listen on 0.0.0.0 & add app_port to controller resources

* ux

* fix readiness suffix

* fix

* fix

* remove cloud=gcp

* ux: remove reduncant str

* disable launch & down & stop with reserved prefix controller-

* support sky serve down service-*

* ux

* cleanup cloud storage when terminate

* enable customized controller resources

* abort if ports specified in resources

* reorder service status column

* new sky serve status: show replica all the time; refresh in parallel; check network first

* remove name since we have service name column

* at least one replica is ready -> service ready

* Update sky/cli.py

Co-authored-by: Wei-Lin Chiang <[email protected]>

* Update sky/backends/backend_utils.py

Co-authored-by: Wei-Lin Chiang <[email protected]>

* Update sky/status_lib.py

Co-authored-by: Wei-Lin Chiang <[email protected]>

* Update sky/backends/backend_utils.py

Co-authored-by: Wei-Lin Chiang <[email protected]>

* Update sky/backends/backend_utils.py

Co-authored-by: Wei-Lin Chiang <[email protected]>

* Update sky/serve/redirector.py

Co-authored-by: Wei-Lin Chiang <[email protected]>

* Update sky/serve/redirector.py

Co-authored-by: Wei-Lin Chiang <[email protected]>

* add vllm example

* upd http example

* change uptime to None and merge get_uptime and get_replica_info

* restore debug comment out code

* add comment for DEFAULT_INITIAL_DELAY_SECONDS

* min_replica -> min_replcias

* format

* Apply suggestions from code review

Co-authored-by: Wei-Lin Chiang <[email protected]>

* upd tgi example

* upd examples

* format, remove unnecessary refresh in sky serve logs, raise valueerror instead of click.secho red

* add minimal http example

* Apply suggestions from code review

Co-authored-by: Wei-Lin Chiang <[email protected]>

* fix typo

* Apply suggestions from code review

---------

Co-authored-by: Wei-Lin Chiang <[email protected]>
* add vicuna v1.5 example

* add replica ip in table; rename some vars

* warning if sky launch a service yaml

* format

* start progress after error log

* fix type name

* log format

* logger with skylogging format

* dump user app fail to control plane log

* ux

* add launched_at and service_yaml to local DB; delete cloud storage locally

* rapid bootstraping

* format

* move skyserve controller to separate section in sky status

* add hint to see detailed sky serve status

* restore example

* rename control plane to controller

* rename to hello_skyserve

* rename to hello_skyserve

* change port to align doc

* inline controller failed checking

* override user resources parameter

* format

* add some todos

* remove redundant return

* use handle to store information

* fix error const name

* simplify resources representation

* check cluster status earlier

* minor

* minor

* add back service section since we still need it in controller

* restore vicuna example

* print all info when use sky serve status -a

* better handling of unknown status

* add warning for status that cannot be sky.down

* minor comment fixes

* remove Tip: to reuse an existing cluster

* enable extra port on controller

* more detailed info when acc is None

* Apply suggestions from code review

Co-authored-by: Wei-Lin Chiang <[email protected]>

* add doc string

---------

Co-authored-by: Wei-Lin Chiang <[email protected]>
* add msg

* shorten

* fix

* add msg
* add gcp tests

* add azure and aws test

* fix cloud dependencies

* use larger disk size to enable azure controller

* mixed cloud test & install gcloud cli

* format

* fix

* add prehook

* minor & add smoke test function
* add cancel and gorilla example

* update yaml & add readme

* add CLI request cancel

* Update sky/serve/examples/gorilla/gorilla.yaml

Co-authored-by: Wei-Lin Chiang <[email protected]>

* Update sky/serve/examples/misc/cancel/service.yaml

Co-authored-by: Wei-Lin Chiang <[email protected]>

* advice from code review

* upd fschat installation

---------

Co-authored-by: Wei-Lin Chiang <[email protected]>
* fix

* format

* update skyserve prompt

* resolve comments

* fix vllm
Base automatically changed from serve-dev to master November 15, 2023 18:14
@MaoZiming MaoZiming closed this Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants