diff --git a/docs/source/images/sky-serve-architecture.png b/docs/source/images/sky-serve-architecture.png index 9312ccad25c..6f690d05552 100644 Binary files a/docs/source/images/sky-serve-architecture.png and b/docs/source/images/sky-serve-architecture.png differ diff --git a/docs/source/images/sky-serve-status-full.png b/docs/source/images/sky-serve-status-full.png new file mode 100644 index 00000000000..c0f575b9b65 Binary files /dev/null and b/docs/source/images/sky-serve-status-full.png differ diff --git a/docs/source/images/sky-serve-status-output-provisioning.png b/docs/source/images/sky-serve-status-output-provisioning.png new file mode 100644 index 00000000000..62ee21a4fd9 Binary files /dev/null and b/docs/source/images/sky-serve-status-output-provisioning.png differ diff --git a/docs/source/images/sky-serve-status-tgi.png b/docs/source/images/sky-serve-status-tgi.png new file mode 100644 index 00000000000..23d5bc7bb25 Binary files /dev/null and b/docs/source/images/sky-serve-status-tgi.png differ diff --git a/docs/source/images/sky-serve-status-vicuna-ready.png b/docs/source/images/sky-serve-status-vicuna-ready.png new file mode 100644 index 00000000000..9af1b807649 Binary files /dev/null and b/docs/source/images/sky-serve-status-vicuna-ready.png differ diff --git a/docs/source/index.rst b/docs/source/index.rst index a4cfcdb6319..e2ecd18d2ed 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -110,6 +110,13 @@ Documentation reference/kubernetes/index running-jobs/index +.. toctree:: + :maxdepth: 1 + :caption: SkyServe: Model Serving + + serving/sky-serve + serving/service-yaml-spec + .. toctree:: :maxdepth: 1 :caption: Cutting Cloud Costs diff --git a/docs/source/reference/cli.rst b/docs/source/reference/cli.rst index 3a40a3d920a..412cf04bb1a 100644 --- a/docs/source/reference/cli.rst +++ b/docs/source/reference/cli.rst @@ -59,6 +59,24 @@ Job Queue CLI :prog: sky cancel :nested: full +Sky Serve CLI +------------- + +.. click:: sky.cli:serve_up + :prog: sky serve up + :nested: full + +.. click:: sky.cli:serve_down + :prog: sky serve down + :nested: full + +.. click:: sky.cli:serve_status + :prog: sky serve status + :nested: full + +.. click:: sky.cli:serve_logs + :prog: sky serve logs + :nested: full Managed Spot Jobs CLI --------------------------- diff --git a/docs/source/serving/service-yaml-spec.rst b/docs/source/serving/service-yaml-spec.rst new file mode 100644 index 00000000000..43757e086cf --- /dev/null +++ b/docs/source/serving/service-yaml-spec.rst @@ -0,0 +1,76 @@ +.. _service-yaml-spec: + +Service YAML Specification +========================== + +SkyServe provides an intuitive YAML interface to specify a service. It is highly similar to the :ref:`SkyPilot task YAML `: with an additional service section in your original task YAML, you could change it to a service YAML. + +Available fields: + + +.. code-block:: yaml + + # Additional section to turn your skypilot task.yaml to a service + service: + + # Readiness probe (required). This describe how SkyServe determine your + # service is ready for accepting traffic. If the readiness probe get a 200, + # SkyServe will start routing traffic to your service. + readiness_probe: + # Path to probe (required). + path: /v1/models + # Post data (optional). If this is specified, the readiness probe will use + # POST instead of GET, and the post data will be sent as the request body. + post_data: {'model_name': 'model'} + # Initial delay in seconds (optional). Defaults to 1200 seconds (20 minutes). + # Any readiness probe failures during this period will be ignored. This is + # highly related to your service, so it is recommended to set this value + # based on your service's startup time. + initial_delay_seconds: 1200 + + # We have a simplified version of readiness probe that only contains the + # readiness probe path. If you want to use GET method for readiness probe + # and the default initial delay, you can use the following syntax: + readiness_probe: /v1/models + + # One of the two following fields (replica_policy or replicas) is required. + + # Replica autoscaling policy. This describes how SkyServe autoscales + # your service based on the QPS (queries per second) of your service. + replica_policy: + # Minimum number of replicas (required). + min_replicas: 1 + # Maximum number of replicas (optional). If not specified, SkyServe will + # use fixed number of replicas same as min_replicas and ignore any QPS + # threshold specified below. + max_replicas: 3 + # Following thresholds describe when to scale up or down. + # QPS threshold for scaling up (optional). If the QPS of your service + # exceeds this threshold, SkyServe will scale up your service by one + # replica. If not specified, SkyServe will **NOT** scale up your service. + qps_upper_threshold: 10 + # QPS threshold for scaling down (optional). If the QPS of your service + # is below this threshold, SkyServe will scale down your service by one + # replica. If not specified, SkyServe will **NOT** scale down your service. + qps_lower_threshold: 2 + + # Also, for convenience, we have a simplified version of replica policy that + # use fixed number of replicas. Just use the following syntax: + replicas: 2 + + # Controller resources (optional). This describe the resources to use for + # the controller. Default to a 4+ vCPU instance with 100GB disk. + controller_resources: + cloud: aws + region: us-east-1 + instance_type: p3.2xlarge + disk_size: 256 + + resources: + # Port to run your service (required). This port will be automatically exposed + # by SkyServe. You can access your service at http://:. + ports: 8080 + # Other resources config... + + # Then comes your SkyPilot task YAML... + diff --git a/docs/source/serving/sky-serve.rst b/docs/source/serving/sky-serve.rst new file mode 100644 index 00000000000..7a9034c85be --- /dev/null +++ b/docs/source/serving/sky-serve.rst @@ -0,0 +1,447 @@ +.. _sky-serve: + +Quickstart: Serving Models +========================== + +SkyServe is SkyPilot's model serving library. SkyServe (short for SkyPilot Serving) takes an existing serving +framework and deploys it across one or more regions or clouds. + +.. * Serve on scarce resources (e.g., A100; spot) with **reduced costs and increased availability** + +Why SkyServe? + +* **Bring any serving framework** (vLLM, TGI, FastAPI, ...) and scale it across regions/clouds +* **Reduce costs and increase availability** of service replicas by leveraging multiple/cheaper locations and hardware (spot instances) +* **Out-of-the-box load-balancing and autoscaling** of service replicas +* Manage multi-cloud, multi-region deployments with a single control plane +* **Privacy**: Everything is launched inside your cloud accounts and VPCs + +.. * Allocate scarce resources (e.g., A100) **across regions and clouds** +.. * Autoscale your endpoint deployment with load balancing +.. * Manage your multi-cloud resources with a single control plane + +How it works: + +- Each service gets an endpoint that automatically redirects requests to its underlying replicas. +- The replicas of the same service can run in different regions and clouds — reducing cloud costs and increasing availability. +- SkyServe transparently handles the load balancing, recovery, and autoscaling of the replicas. + +.. GPU availability has become a critical bottleneck for many AI services. With Sky +.. Serve, we offer a lightweight control plane that simplifies deployment across +.. many cloud providers. By consolidating availability and pricing data across +.. clouds, we ensure **timely execution at optimal costs**, addressing the +.. complexities of managing resources in a multi-cloud environment. + + +.. SkyServe provides a simple CLI interface to deploy and manage your services. It +.. features a simple YAML spec to describe your services (referred to as a *service +.. YAML* in the following) and a centralized controller to manage the deployments. + + +.. tip:: + + To get started with SkyServe, use the nightly build of SkyPilot: ``pip install -U skypilot-nightly`` + +Quick tour: LLM serving +------------------------------- + +Here is a simple example of serving a Vicuna-13B LLM model on TGI with SkyServe: + +.. code-block:: yaml + + service: + readiness_probe: /health + replicas: 2 + + # Fields below describe each replica. + resources: + ports: 8080 + accelerators: A100:1 + + run: | + docker run --gpus all --shm-size 1g -p 8080:80 -v ~/data:/data \ + ghcr.io/huggingface/text-generation-inference \ + --model-id lmsys/vicuna-13b-v1.5 + +Use :code:`sky serve status` to check the status of the service: + +.. image:: ../images/sky-serve-status-tgi.png + :width: 800 + :align: center + :alt: sky-serve-status-tgi + +.. raw:: html + +
+ +If you see the :code:`STATUS` column becomes :code:`READY`, then the service is ready to accept traffic! + +Simply ``curl`` the service endpoint --- for the above example, use +``44.211.131.51:30001`` which automatically load-balances across the two replicas: + +.. code-block:: console + + $ curl -L /generate \ + -X POST \ + -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ + -H 'Content-Type: application/json' + # Example output: + {"generated_text":"\n\nDeep learning is a subset of machine learning that uses artificial neural networks to model and solve"} + +Tutorial: Hello, SkyServe! +--------------------------- + +Here we will go through an example to deploy a simple HTTP server with SkyServe. To spin up a service, you can simply reuse your task YAML with the two following requirements: + +#. An HTTP endpoint (launched in ``run`` commands) and the port on which it listens; +#. An extra :code:`service` section in your task YAML to describe the service configuration. + +It is recommended to test it with :code:`sky launch` first. For example, we have the following task YAML works with :code:`sky launch`: + +.. code-block:: yaml + + resources: + ports: 8080 + cpus: 2 + + workdir: . + + run: python -m http.server 8080 + +And under the same directory, we have an :code:`index.html`: + +.. code-block:: html + + + + My First SkyServe Service + + +

Hello, SkyServe!

+ + + + +.. note:: + + :ref:`workdir ` and :ref:`file mounts with local files ` will be automatically uploaded to + :ref:`SkyPilot Storage `. Cloud bucket will be created, and cleaned up after the service is terminated. + +Notice that task YAML already has a running HTTP endpoint at 8080, and exposed +through the :code:`ports` section under :code:`resources`. Suppose we want to +scale it to multiple replicas across multiple regions/clouds with SkyServe. We +can simply add a :code:`service` section to the YAML: + +.. code-block:: yaml + :emphasize-lines: 2-4 + + # hello-sky-serve.yaml + service: + readiness_probe: / + replicas: 2 + + resources: + ports: 8080 + cpus: 2 + + workdir: . + + run: python -m http.server 8080 + +This example will spin up two replicas of the service, +each listening on port 8080. A replica is considered ready when it responds to +:code:`GET /` with a 200 status code. You can customize the readiness +probe by specifying a different path in the :code:`readiness_probe` field. +You can find more configurations at :ref:`Service YAML Specification +`. + +Use ``sky serve up`` to spin up the service: + +.. code-block:: console + + $ sky serve up hello-sky-serve.yaml + +SkyServe will start (or reuse) a centralized controller/load balancer and deploy the +service replicas to the cloud location(s) with the best price and +availability. SkyServe will also monitor the service status and re-launch a new +replica if one of them fails. + +Under the hood, :code:`sky serve up`: + +#. Launches a controller which handles autoscaling, monitoring and load balancing; +#. Returns a Service Endpoint which will be used to accept traffic; +#. Meanwhile, the controller provisions replica VMs which later run the services; +#. Once any replica is ready, the requests sent to the Service Endpoint will be **HTTP-redirect** to one of the endpoint replicas. + +After the controller is provisioned, you'll see the following in :code:`sky serve status` output: + +.. image:: ../images/sky-serve-status-output-provisioning.png + :width: 800 + :align: center + :alt: sky-serve-status-output-provisioning + +.. raw:: html + +
+ +You can use ``watch`` to monitor the service status: + +.. code-block:: console + + $ watch -n10 sky serve status + +Once any of the replicas becomes ready to serve (``READY``), you can start +sending requests to :code:`` (e.g., ``44.201.119.3:30001``): + +.. code-block:: console + + $ curl -L + + + My First SkyServe Service + + +

Hello, SkyServe!

+ + + +.. note:: + + Since we are using HTTP-redirect, we need to use :code:`curl -L + `. The :code:`curl` command by default won't follow the + redirect. + +Tutorial: Serve a Chatbot LLM! +------------------------------ + +Let's bring up a real LLM chat service with FastChat + Vicuna. We'll use the `Vicuna OpenAI API Endpoint YAML `_ as an example: + +.. code-block:: yaml + + resources: + ports: 8080 + accelerators: A100:1 + disk_size: 1024 + disk_tier: high + + setup: | + conda activate chatbot + if [ $? -ne 0 ]; then + conda create -n chatbot python=3.9 -y + conda activate chatbot + fi + + # Install dependencies + pip install "fschat[model_worker,webui]==0.2.24" + pip install protobuf + + run: | + conda activate chatbot + + echo 'Starting controller...' + python -u -m fastchat.serve.controller > ~/controller.log 2>&1 & + sleep 10 + echo 'Starting model worker...' + python -u -m fastchat.serve.model_worker \ + --model-path lmsys/vicuna-${MODEL_SIZE}b-v1.3 2>&1 \ + | tee model_worker.log & + + echo 'Waiting for model worker to start...' + while ! `cat model_worker.log | grep -q 'Uvicorn running on'`; do sleep 1; done + + echo 'Starting openai api server...' + python -u -m fastchat.serve.openai_api_server \ + --host 0.0.0.0 --port 8080 | tee ~/openai_api_server.log + + envs: + MODEL_SIZE: 7 + +The above SkyPilot Task YAML will launch an OpenAI API endpoint with a Vicuna 7B +model. This YAML can be used with regular :code:`sky launch` to launch a single +replica of the service. + +However, by adding a :code:`service` section to the YAML, we can scale it +to multiple replicas across multiple regions/clouds: + +.. code-block:: yaml + :emphasize-lines: 2-4 + + # vicuna.yaml + service: + readiness_probe: /v1/models + replicas: 3 + + resources: + ports: 8080 + # Here goes other resources config + + # Here goes other task config + +Now we have a Service YAML that can be used with SkyServe! Simply run + +.. code-block:: console + + $ sky serve up vicuna.yaml -n vicuna + +to deploy the service (use :code:`-n` to give your service a name!). After a while, there will be an OpenAI Compatible API endpoint ready to accept traffic (:code:`44.201.113.28:30001` in the following example): + +.. image:: ../images/sky-serve-status-vicuna-ready.png + :width: 800 + :align: center + :alt: sky-serve-status-vicuna-ready + +.. raw:: html + +
+ +Send a request using the following cURL command: + +.. code-block:: console + + $ curl -L http:///v1/chat/completions \ + -X POST \ + -d '{"model":"vicuna-13b-v1.3","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Who are you?"}],"temperature":0}' \ + -H 'Content-Type: application/json' + # Example output: + {"id":"chatcmpl-gZ8SfgUwcm9Xjbuv4xfefq","object":"chat.completion","created":1702082533,"model":"vicuna-13b-v1.3","choices":[{"index":0,"message":{"role":"assistant","content":"I am Vicuna, a language model trained by researchers from Large Model Systems Organization (LMSYS)."},"finish_reason":"stop"}],"usage":{"prompt_tokens":19,"total_tokens":43,"completion_tokens":24}} + +You can also use a simple chatbot Python script to send requests: + +.. code-block:: python + + import openai + + stream = True + model = 'vicuna-7b-v1.3' # This is aligned with the MODEL_SIZE env in the YAML + init_prompt = 'You are a helpful assistant.' + history = [{'role': 'system', 'content': init_prompt}] + endpoint = input('Endpoint: ') + openai.api_base = f'http://{endpoint}/v1' + openai.api_key = 'placeholder' + + try: + while True: + user_input = input('[User] ') + history.append({'role': 'user', 'content': user_input}) + resp = openai.ChatCompletion.create(model=model, + messages=history, + stream=True) + print('[Chatbot]', end='', flush=True) + tot = '' + for i in resp: + dlt = i['choices'][0]['delta'] + if 'content' not in dlt: + continue + print(dlt['content'], end='', flush=True) + tot += dlt['content'] + print() + history.append({'role': 'assistant', 'content': tot}) + except KeyboardInterrupt: + print('\nBye!') + +Useful CLIs +----------- + +Here are some commands for SkyServe. Check :code:`sky serve --help` for more details. + +See all running services: + +.. code-block:: console + + $ sky serve status + +.. image:: ../images/sky-serve-status-full.png + :width: 800 + :align: center + :alt: sky-serve-status-full + +.. raw:: html + +
+ +Stream the logs of a service: + +.. code-block:: console + + $ sky serve logs vicuna 1 # tail logs of replica 1, including provisioning and running logs + $ sky serve logs vicuna --controller # tail controller logs + $ sky serve logs vicuna --load-balancer --no-follow # print the load balancer logs so far, and exit + +Terminate services: + +.. code-block:: console + + $ sky serve down http-server # terminate the http-server service + $ sky serve down --all # terminate all services + +SkyServe Architecture +--------------------- + +.. image:: ../images/sky-serve-architecture.png + :width: 800 + :align: center + :alt: SkyServe Architecture + +.. raw:: html + +
+ +SkyServe has a centralized controller VM that manages the deployment of your service. Each service will have a process group to manage its replicas and route traffic to them. + +It is composed of the following components: + +#. **Controller**: The controller will monitor the status of the replicas and re-launch a new replica if one of them fails. It also autoscales the number of replicas if autoscaling config is set (see :ref:`Service YAML spec ` for more information). +#. **Load Balancer**: The load balancer will route the traffic to all ready replicas. It is a lightweight HTTP server that listens on the service endpoint and **HTTP-redirects** the requests to one of the replicas. + +All of the process group shares a single controller VM. The controller VM will be launched in the cloud with the best price/performance ratio. You can also :ref:`customize the controller resources ` based on your needs. + +SkyServe controller +------------------- + +The SkyServe controller is a small on-demand CPU VM running in the cloud that: + +#. Manages the deployment of your service; +#. Monitors the status of your service; +#. Routes traffic to your service replicas. + +It is automatically launched when the first service is deployed, and it is autostopped after it has been idle for 10 minutes (i.e., after all services are terminated). +Thus, **no user action is needed** to manage its lifecycle. + +You can see the controller with :code:`sky status` and refresh its status by using the :code:`-r/--refresh` flag. + +.. _customizing-sky-serve-controller-resources: + +Customizing SkyServe controller resources +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You may want to customize the resources of the SkyServe controller for several reasons: + +1. Use a lower-cost controller. (if you have a few services running) +2. Enforcing the controller to run on a specific location. This is particularly useful when you want the service endpoint within specific geographical region. (Default: cheapest location) +3. Changing the maximum number of services that can be run concurrently, which is the minimum number between 4x the vCPUs of the controller and the memory in GiB of the controller. (Default: 16) +4. Changing the disk_size of the controller to store more logs. (Default: 200GB) + +To achieve the above, you can specify custom configs in :code:`~/.sky/config.yaml` with the following fields: + +.. code-block:: yaml + + serve: + # NOTE: these settings only take effect for a new SkyServe controller, not if + # you have an existing one. + controller: + resources: + # All configs below are optional. + # Specify the location of the SkyServe controller. + cloud: gcp + region: us-central1 + # Specify the maximum number of services that can be run concurrently. + cpus: 2+ # number of vCPUs, max concurrent services = min(4 * cpus, memory in GiB) + # Specify the disk_size in GB of the SkyServe controller. + disk_size: 1024 + +The :code:`resources` field has the same spec as a normal SkyPilot job; see `here `__. + +.. note:: + These settings will not take effect if you have an existing controller (either + stopped or live). For them to take effect, tear down the existing controller + first, which requires all services to be terminated.