-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SkyServe][Do Not Merge] Spot Policy #2783
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
add http server example
* Add service schema * use new serve YAML * change to qpm * change to fix node * refactor init of SkyServiceSpec * change http example to new yaml format * update default value of from_yaml_config and handle service in task * Launching successfully * use argument in controller & redirector * resolve comments * use qps instead * raise when multiple task found * change to qps * introduce constants * introduce constants & fix bugs * add sky down * add Services No existing services. without STATUS (but with #healthy replica * format * add llama2 example * add fields to service db * status with replica information * fix policy parsing bug * add auth todo * add replica status todo * change cluster name prefix and order of the column * minor fixes * reorder status * change name: controller --> control plane * change name: middleware --> controller * clean code * rename default service name * env vars * add purge and skip identity check on serve controller * upload filemounts and workdir to storage & enhance --purge
… `sky serve logs` prototype (#2311) * introducing multiprocessing prototype * add run env to controller & redirector * reefactor and format * add control-plane and redirector logs * minor * minor * Refactor: move to infra provider * Refactor: move load balancer to redirector * refactor, add more logging * add replica status * resolve some TODOs * add post data feature * rename, format * add error message handling * bug fix & logging * fix a bug in continuous unhealthy * add error when user port is same with control plane * fix None post_data bug * add stable diffusion example * remove response body when code == 200 * add some TODOs and change RUNNING to READY * add failed status * add TODO for return failed replica info * fix sky serve status --help error * add console help messages * remove redundant stable diffusion setup files * rename healthy_replica --> ready_replica * adopt advice from code review * rename to service_name * adopt advice from comment
… logs` for replica info (#2353) * introducing multiprocessing prototype * add run env to controller & redirector * reefactor and format * add control-plane and redirector logs * minor * minor * Refactor: move to infra provider * Refactor: move load balancer to redirector * refactor, add more logging * add replica status * resolve some TODOs * add post data feature * rename, format * add error message handling * bug fix & logging * fix a bug in continuous unhealthy * add error when user port is same with control plane * fix None post_data bug * add stable diffusion example * remove response body when code == 200 * add some TODOs and change RUNNING to READY * add failed status * add TODO for return failed replica info * fix sky serve status --help error * add console help messages * remove redundant stable diffusion setup files * rename healthy_replica --> ready_replica * finish replica info & num * finish * adopt advice from code review * rename to service_name * finish state machine; TODO property based implementation * adopt advice from comment * adopt comments in #2311 * finish new replica status * modify http example more resonable * UX details & set default controller resources to VCPU=4 * add spinner for launching contorl plane & redirector process * add sky serve logs CLI for replicas * add uptime section for service table * relaunch replicas which terminated by exceeding consecutive failure threshold * UX details * code style * move serve dependency to controller yaml setup section * add launch log for replica * add resources preview * stop jupyter service to avoid port conflict * Apply suggestions from code review Co-authored-by: Wei-Lin Chiang <[email protected]> * fix userjob failed and launch failed not terminate replica; replica status FAILED --> CLEANUP_FAILED since we terminate all FAILED replica immediately now; remove --purge in termination * ux nits * 0.0.0.0 -> localhost * new log logic: use cluster status == UP instead of waiting 10s; early quit for replica not exist; skip all detailed file sync log * ux nits * change readiness timeout to initial delay seconds * disable some logging when SKYPILOT_DEBUG is not set * restore debug yaml * remove debug message * sync down log before teardown * rename failed status name (replica) * change controller resources vcpu to 4+ to avoid no 4 vcpu cloud * disable -c, -r, -i in sky serve logs CLI * add REPLICA column in service status * add CONTROLLER_FAILED status; wait until control plane & redirector job to be running. * add color for CONTROLLER_FAILED and a prompt to cleanup first if re-up a failed service * change uptime to first time ready * format * add comment for replica/service status in sky serve status -h * simplify yaml design * remove controller resources cloud=gcp * remove controller resources cloud=gcpsome comment * redirect setup logs to devnull * redirector listen on 0.0.0.0 & add app_port to controller resources * ux * fix readiness suffix * fix * fix * remove cloud=gcp * ux: remove reduncant str * disable launch & down & stop with reserved prefix controller- * support sky serve down service-* * ux * cleanup cloud storage when terminate * enable customized controller resources * abort if ports specified in resources * reorder service status column * new sky serve status: show replica all the time; refresh in parallel; check network first * remove name since we have service name column * at least one replica is ready -> service ready * Update sky/cli.py Co-authored-by: Wei-Lin Chiang <[email protected]> * Update sky/backends/backend_utils.py Co-authored-by: Wei-Lin Chiang <[email protected]> * Update sky/status_lib.py Co-authored-by: Wei-Lin Chiang <[email protected]> * Update sky/backends/backend_utils.py Co-authored-by: Wei-Lin Chiang <[email protected]> * Update sky/backends/backend_utils.py Co-authored-by: Wei-Lin Chiang <[email protected]> * Update sky/serve/redirector.py Co-authored-by: Wei-Lin Chiang <[email protected]> * Update sky/serve/redirector.py Co-authored-by: Wei-Lin Chiang <[email protected]> * add vllm example * upd http example * change uptime to None and merge get_uptime and get_replica_info * restore debug comment out code * add comment for DEFAULT_INITIAL_DELAY_SECONDS * min_replica -> min_replcias * format * Apply suggestions from code review Co-authored-by: Wei-Lin Chiang <[email protected]> * upd tgi example * upd examples * format, remove unnecessary refresh in sky serve logs, raise valueerror instead of click.secho red * add minimal http example * Apply suggestions from code review Co-authored-by: Wei-Lin Chiang <[email protected]> * fix typo * Apply suggestions from code review --------- Co-authored-by: Wei-Lin Chiang <[email protected]>
* add vicuna v1.5 example * add replica ip in table; rename some vars * warning if sky launch a service yaml * format * start progress after error log * fix type name * log format * logger with skylogging format * dump user app fail to control plane log * ux * add launched_at and service_yaml to local DB; delete cloud storage locally * rapid bootstraping * format * move skyserve controller to separate section in sky status * add hint to see detailed sky serve status * restore example * rename control plane to controller * rename to hello_skyserve * rename to hello_skyserve * change port to align doc * inline controller failed checking * override user resources parameter * format * add some todos * remove redundant return * use handle to store information * fix error const name * simplify resources representation * check cluster status earlier * minor * minor * add back service section since we still need it in controller * restore vicuna example * print all info when use sky serve status -a * better handling of unknown status * add warning for status that cannot be sky.down * minor comment fixes * remove Tip: to reuse an existing cluster * enable extra port on controller * more detailed info when acc is None * Apply suggestions from code review Co-authored-by: Wei-Lin Chiang <[email protected]> * add doc string --------- Co-authored-by: Wei-Lin Chiang <[email protected]>
* add msg * shorten * fix * add msg
* add gcp tests * add azure and aws test * fix cloud dependencies * use larger disk size to enable azure controller * mixed cloud test & install gcloud cli * format * fix * add prehook * minor & add smoke test function
* add cancel and gorilla example * update yaml & add readme * add CLI request cancel * Update sky/serve/examples/gorilla/gorilla.yaml Co-authored-by: Wei-Lin Chiang <[email protected]> * Update sky/serve/examples/misc/cancel/service.yaml Co-authored-by: Wei-Lin Chiang <[email protected]> * advice from code review * upd fschat installation --------- Co-authored-by: Wei-Lin Chiang <[email protected]>
* fix * format * update skyserve prompt * resolve comments * fix vllm
…into serve-policy
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh