Skip to content

Commit

Permalink
[Serve] Proxy w/ retry (#3395)
Browse files Browse the repository at this point in the history
* init

* support streaming

* max reetry num

* upd comments

* remove -L in documentations

* streaming smoke test. TODO: debug and make sure it works

* Apply suggestions from code review

Co-authored-by: Zhanghao Wu <[email protected]>

* comments and expose exceptions in smoke test

* upd smoke test and passed

* timeout

* yield error

* remove -L

* apply suggestions from code review

* add threading lock

* apply suggestions from code review

* comments for limit on client

* Update sky/serve/load_balancer.py

Co-authored-by: Zhanghao Wu <[email protected]>

* Update sky/serve/load_balancer.py

Co-authored-by: Zhanghao Wu <[email protected]>

* Update sky/serve/load_balancer.py

Co-authored-by: Zhanghao Wu <[email protected]>

* format

* retry for no replicas as well

* check disconnect if no replcias

* format

* minor

* async probe controller; close clients in the background

* async

* comments

* Update sky/serve/load_balancer.py

Co-authored-by: Zhanghao Wu <[email protected]>

* format

* fix

---------

Co-authored-by: Zhanghao Wu <[email protected]>
  • Loading branch information
cblmemo and Michaelvll authored May 14, 2024
1 parent 8a0a34d commit 5a2f1b8
Show file tree
Hide file tree
Showing 26 changed files with 348 additions and 131 deletions.
22 changes: 8 additions & 14 deletions docs/source/serving/sky-serve.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Why SkyServe?
How it works:

- Each service gets an endpoint that automatically redirects requests to its replicas.
- Each service gets an endpoint that automatically distributes requests to its replicas.
- Replicas of the same service can run in different regions and clouds — reducing cloud costs and increasing availability.
- SkyServe handles the load balancing, recovery, and autoscaling of the replicas.

Expand Down Expand Up @@ -127,7 +127,7 @@ Run :code:`sky serve up service.yaml` to deploy the service with automatic price

If you see the :code:`STATUS` column becomes :code:`READY`, then the service is ready to accept traffic!

Simply ``curl -L`` the service endpoint, which automatically load-balances across the two replicas:
Simply ``curl`` the service endpoint, which automatically load-balances across the two replicas:

.. tab-set::

Expand All @@ -136,7 +136,7 @@ Simply ``curl -L`` the service endpoint, which automatically load-balances acros

.. code-block:: console
$ curl -L 3.84.15.251:30001/v1/chat/completions \
$ curl 3.84.15.251:30001/v1/chat/completions \
-X POST \
-d '{"model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "messages": [{"role": "user", "content": "Who are you?"}]}' \
-H 'Content-Type: application/json'
Expand All @@ -149,7 +149,7 @@ Simply ``curl -L`` the service endpoint, which automatically load-balances acros

.. code-block:: console
$ curl -L 44.211.131.51:30001/generate \
$ curl 44.211.131.51:30001/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
Expand Down Expand Up @@ -240,7 +240,7 @@ Under the hood, :code:`sky serve up`:
#. Launches a controller which handles autoscaling, monitoring and load balancing;
#. Returns a Service Endpoint which will be used to accept traffic;
#. Meanwhile, the controller provisions replica VMs which later run the services;
#. Once any replica is ready, the requests sent to the Service Endpoint will be **HTTP-redirect** to one of the endpoint replicas.
#. Once any replica is ready, the requests sent to the Service Endpoint will be distributed to one of the endpoint replicas.

After the controller is provisioned, you'll see the following in :code:`sky serve status` output:

Expand All @@ -264,7 +264,7 @@ sending requests to :code:`<endpoint-url>` (e.g., ``44.201.119.3:30001``):

.. code-block:: console
$ curl -L <endpoint-url>
$ curl <endpoint-url>
<html>
<head>
<title>My First SkyServe Service</title>
Expand All @@ -274,12 +274,6 @@ sending requests to :code:`<endpoint-url>` (e.g., ``44.201.119.3:30001``):
</body>
</html>
.. note::

Since we are using HTTP-redirect, we need to use :code:`curl -L
<endpoint-url>`. The :code:`curl` command by default won't follow the
redirect.

Tutorial: Serve a Chatbot LLM!
------------------------------

Expand Down Expand Up @@ -368,7 +362,7 @@ Send a request using the following cURL command:

.. code-block:: console
$ curl -L http://<endpoint-url>/v1/chat/completions \
$ curl http://<endpoint-url>/v1/chat/completions \
-X POST \
-d '{"model":"vicuna-7b-v1.3","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Who are you?"}],"temperature":0}' \
-H 'Content-Type: application/json'
Expand Down Expand Up @@ -468,7 +462,7 @@ SkyServe has a centralized controller VM that manages the deployment of your ser
It is composed of the following components:

#. **Controller**: The controller will monitor the status of the replicas and re-launch a new replica if one of them fails. It also autoscales the number of replicas if autoscaling config is set (see :ref:`Service YAML spec <service-yaml-spec>` for more information).
#. **Load Balancer**: The load balancer will route the traffic to all ready replicas. It is a lightweight HTTP server that listens on the service endpoint and **HTTP-redirects** the requests to one of the replicas.
#. **Load Balancer**: The load balancer will route the traffic to all ready replicas. It is a lightweight HTTP server that listens on the service endpoint and distribute the requests to one of the replicas.

All of the process group shares a single controller VM. The controller VM will be launched in the cloud with the best price/performance ratio. You can also :ref:`customize the controller resources <customizing-sky-serve-controller-resources>` based on your needs.

Expand Down
2 changes: 1 addition & 1 deletion examples/cog/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ After the service is launched, access the deployment with the following:
```console
ENDPOINT=$(sky serve status --endpoint cog)

curl -L http://$ENDPOINT/predictions -X POST \
curl http://$ENDPOINT/predictions -X POST \
-H 'Content-Type: application/json' \
-d '{"input": {"image": "https://blog.skypilot.co/introducing-sky-serve/images/sky-serve-thumbnail.png"}}' \
| jq -r '.output | split(",")[1]' | base64 --decode > output.png
Expand Down
4 changes: 2 additions & 2 deletions examples/serve/misc/cancel/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# SkyServe cancel example

This example demonstrates the redirect support canceling a request.
This example demonstrates the SkyServe load balancer support canceling a request.

## Running the example

Expand Down Expand Up @@ -33,7 +33,7 @@ Client disconnected, stopping computation.
You can also run

```bash
curl -L http://<endpoint>/
curl http://<endpoint>/
```

and manually Ctrl + C to cancel the request and see logs.
2 changes: 1 addition & 1 deletion examples/serve/stable_diffusion_service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ file_mounts:
/stable_diffusion: examples/stable_diffusion

setup: |
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo curl "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
cd stable-diffusion-webui-docker
sudo rm -r stable-diffusion-webui-docker
Expand Down
2 changes: 1 addition & 1 deletion examples/stable_diffusion/stable_diffusion_docker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ file_mounts:
/stable_diffusion: .

setup: |
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo curl "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
cd stable-diffusion-webui-docker
sudo rm -r stable-diffusion-webui-docker
Expand Down
6 changes: 3 additions & 3 deletions llm/codellama/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ Launching a cluster 'code-llama'. Proceed? [Y/n]:
```bash
IP=$(sky status --ip code-llama)

curl -L http://$IP:8000/v1/completions \
curl http://$IP:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "codellama/CodeLlama-70b-Instruct-hf",
Expand Down Expand Up @@ -131,7 +131,7 @@ availability of the service while minimizing the cost.
```bash
ENDPOINT=$(sky serve status --endpoint code-llama)

curl -L http://$ENDPOINT/v1/completions \
curl http://$ENDPOINT/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "codellama/CodeLlama-70b-Instruct-hf",
Expand All @@ -146,7 +146,7 @@ We can also access the Code Llama service with the openAI Chat API.
```bash
ENDPOINT=$(sky serve status --endpoint code-llama)

curl -L http://$ENDPOINT/v1/chat/completions \
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "codellama/CodeLlama-70b-Instruct-hf",
Expand Down
2 changes: 1 addition & 1 deletion llm/dbrx/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,7 @@ ENDPOINT=$(sky serve status --endpoint dbrx)
To curl the endpoint:
```console
curl -L $ENDPOINT/v1/chat/completions \
curl $ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "databricks/dbrx-instruct",
Expand Down
8 changes: 4 additions & 4 deletions llm/gemma/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ After the cluster is launched, we can access the model with the following comman
```bash
IP=$(sky status --ip gemma)

curl -L http://$IP:8000/v1/completions \
curl http://$IP:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-7b-it",
Expand All @@ -50,7 +50,7 @@ Chat API is also supported:
```bash
IP=$(sky status --ip gemma)

curl -L http://$IP:8000/v1/chat/completions \
curl http://$IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-7b-it",
Expand Down Expand Up @@ -78,7 +78,7 @@ After the cluster is launched, we can access the model with the following comman
```bash
ENDPOINT=$(sky serve status --endpoint gemma)

curl -L http://$ENDPOINT/v1/completions \
curl http://$ENDPOINT/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-7b-it",
Expand All @@ -89,7 +89,7 @@ curl -L http://$ENDPOINT/v1/completions \

Chat API is also supported:
```bash
curl -L http://$ENDPOINT/v1/chat/completions \
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-7b-it",
Expand Down
8 changes: 4 additions & 4 deletions llm/mixtral/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ We can now access the model through the OpenAI API with the IP and port:
```bash
IP=$(sky status --ip mixtral)

curl -L http://$IP:8000/v1/completions \
curl http://$IP:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
Expand All @@ -66,7 +66,7 @@ Chat API is also supported:
```bash
IP=$(sky status --ip mixtral)

curl -L http://$IP:8000/v1/chat/completions \
curl http://$IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
Expand Down Expand Up @@ -119,7 +119,7 @@ After the `sky serve up` command, there will be a single endpoint for the servic
```bash
ENDPOINT=$(sky serve status --endpoint mixtral)
curl -L http://$ENDPOINT/v1/completions \
curl http://$ENDPOINT/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
Expand All @@ -132,7 +132,7 @@ Chat API is also supported:
```bash
ENDPOINT=$(sky serve status --endpoint mixtral)
curl -L http://$ENDPOINT/v1/chat/completions \
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
Expand Down
8 changes: 4 additions & 4 deletions llm/qwen/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ sky launch -c qwen serve-110b.yaml
```bash
IP=$(sky status --ip qwen)

curl -L http://$IP:8000/v1/completions \
curl http://$IP:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen1.5-110B-Chat",
Expand All @@ -45,7 +45,7 @@ curl -L http://$IP:8000/v1/completions \

3. Send a request for chat completion:
```bash
curl -L http://$IP:8000/v1/chat/completions \
curl http://$IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen1.5-110B-Chat",
Expand Down Expand Up @@ -92,11 +92,11 @@ As shown, the service is now backed by 2 replicas, one on Azure and one on GCP,
type is chosen to be **the cheapest available one** on the clouds. That said, it maximizes the
availability of the service while minimizing the cost.

3. To access the model, we use a `curl -L` command (`-L` to follow redirect) to send the request to the endpoint:
3. To access the model, we use a `curl` command to send the request to the endpoint:
```bash
ENDPOINT=$(sky serve status --endpoint qwen)

curl -L http://$ENDPOINT/v1/chat/completions \
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen1.5-72B-Chat",
Expand Down
4 changes: 2 additions & 2 deletions llm/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ ENDPOINT=$(sky serve status --endpoint sglang-llava)
</figure>

```bash
curl -L $ENDPOINT/v1/chat/completions \
curl $ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "liuhaotian/llava-v1.6-vicuna-7b",
Expand Down Expand Up @@ -149,7 +149,7 @@ ENDPOINT=$(sky serve status --endpoint sglang-llama2)
4. Once it status is `READY`, you can use the endpoint to interact with the model:

```bash
curl -L $ENDPOINT/v1/chat/completions \
curl $ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
Expand Down
4 changes: 2 additions & 2 deletions llm/tgi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ A user can access the model with the following command:
```bash
ENDPOINT=$(sky status --endpoint 8080 tgi)

curl -L $(sky serve status tgi --endpoint)/generate \
curl $(sky serve status tgi --endpoint)/generate \
-H 'Content-Type: application/json' \
-d '{
"inputs": "What is Deep Learning?",
Expand Down Expand Up @@ -51,7 +51,7 @@ After the service is launched, we can access the model with the following comman
```bash
ENDPOINT=$(sky serve status --endpoint tgi)

curl -L $ENDPOINT/generate \
curl $ENDPOINT/generate \
-H 'Content-Type: application/json' \
-d '{
"inputs": "What is Deep Learning?",
Expand Down
4 changes: 2 additions & 2 deletions llm/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ ENDPOINT=$(sky serve status --endpoint vllm-llama2)
4. Once it status is `READY`, you can use the endpoint to interact with the model:

```bash
curl -L $ENDPOINT/v1/chat/completions \
curl $ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
Expand All @@ -171,7 +171,7 @@ curl -L $ENDPOINT/v1/chat/completions \
}'
```

Notice that it is the same with previously curl command, except for thr `-L` argument. You should get a similar response as the following:
Notice that it is the same with previously curl command. You should get a similar response as the following:

```console
{
Expand Down
6 changes: 3 additions & 3 deletions sky/serve/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Serving library for SkyPilot.

The goal of Sky Serve is simple - expose one endpoint, that redirects to serving endpoints running on different resources, regions and clouds.
The goal of Sky Serve is simple - exposing one endpoint, that distributes any incoming traffic to serving endpoints running on different resources, regions, and clouds.

Sky Serve transparently handles load balancing, failover and autoscaling of the serving endpoints.

Expand All @@ -11,8 +11,8 @@ Sky Serve transparently handles load balancing, failover and autoscaling of the
![Architecture](../../docs/source/images/sky-serve-architecture.png)

Sky Serve has four key components:
1. Redirector - receiving requests and redirecting them to healthy endpoints.
2. Load balancers - spread requests across healthy endpoints according to different policies.
1. Load Balancers - receiving requests and distributing them to healthy endpoints.
2. Load Balancing Policies - spread requests across healthy endpoints according to different policies.
3. Autoscalers - scale up and down the number of serving endpoints according to different policies.
4. Replica Managers - monitoring replica status and handle recovery of unhealthy endpoints.

Expand Down
12 changes: 12 additions & 0 deletions sky/serve/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,18 @@
# interval.
LB_CONTROLLER_SYNC_INTERVAL_SECONDS = 20

# The maximum retry times for load balancer for each request. After changing to
# proxy implementation, we do retry for failed requests.
# TODO(tian): Expose this option to users in yaml file.
LB_MAX_RETRY = 3

# The timeout in seconds for load balancer to wait for a response from replica.
# Large LLMs like Llama2-70b is able to process the request within ~30 seconds.
# We set the timeout to 120s to be safe. For reference, FastChat uses 100s:
# https://github.com/lm-sys/FastChat/blob/f2e6ca964af7ad0585cadcf16ab98e57297e2133/fastchat/constants.py#L39 # pylint: disable=line-too-long
# TODO(tian): Expose this option to users in yaml file.
LB_STREAM_TIMEOUT = 120

# Interval in seconds to probe replica endpoint.
ENDPOINT_PROBE_INTERVAL_SECONDS = 10

Expand Down
2 changes: 1 addition & 1 deletion sky/serve/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@ def up(
f'{backend_utils.BOLD}watch -n10 sky serve status {service_name}'
f'{backend_utils.RESET_BOLD}'
'\nTo send a test request:\t\t'
f'{backend_utils.BOLD}curl -L {endpoint}'
f'{backend_utils.BOLD}curl {endpoint}'
f'{backend_utils.RESET_BOLD}'
'\n'
f'\n{fore.GREEN}SkyServe is spinning up your service now.'
Expand Down
Loading

0 comments on commit 5a2f1b8

Please sign in to comment.