Skip to content

Commit

Permalink
Docs: update SkyServe docs. (#2894)
Browse files Browse the repository at this point in the history
* Docs: update SkyServe docs.

* Rewording
  • Loading branch information
concretevitamin authored Dec 24, 2023
1 parent d6f57cc commit 7c514ba
Show file tree
Hide file tree
Showing 2 changed files with 29 additions and 26 deletions.
40 changes: 18 additions & 22 deletions docs/source/serving/service-yaml-spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@ Available fields:

.. code-block:: yaml
# Additional section to turn your skypilot task.yaml to a service
# The `service` section turns a skypilot task yaml into a service yaml.
service:
# Readiness probe (required). This describe how SkyServe determine your
# service is ready for accepting traffic. If the readiness probe get a 200,
# SkyServe will start routing traffic to your service.
# Readiness probe (required). Used by SkyServe to check if your service
# replicas are ready for accepting traffic. If the readiness probe returns
# a 200, SkyServe will start routing traffic to that replica.
readiness_probe:
# Path to probe (required).
path: /v1/models
Expand All @@ -28,9 +28,9 @@ Available fields:
# based on your service's startup time.
initial_delay_seconds: 1200
# We have a simplified version of readiness probe that only contains the
# readiness probe path. If you want to use GET method for readiness probe
# and the default initial delay, you can use the following syntax:
# Simplified version of readiness probe that only contains the readiness
# probe path. If you want to use GET method for readiness probe and the
# default initial delay, you can use the following syntax:
readiness_probe: /v1/models
# One of the two following fields (replica_policy or replicas) is required.
Expand All @@ -41,10 +41,10 @@ Available fields:
# Minimum number of replicas (required).
min_replicas: 1
# Maximum number of replicas (optional). If not specified, SkyServe will
# use fixed number of replicas same as min_replicas and ignore any QPS
# threshold specified below.
# use a fixed number of replicas (the same as min_replicas) and ignore
# any QPS threshold specified below.
max_replicas: 3
# Following thresholds describe when to scale up or down.
# Thresholds below describe when to scale up or down.
# QPS threshold for scaling up (optional). If the QPS of your service
# exceeds this threshold, SkyServe will scale up your service by one
# replica. If not specified, SkyServe will **NOT** scale up your service.
Expand All @@ -54,23 +54,19 @@ Available fields:
# replica. If not specified, SkyServe will **NOT** scale down your service.
qps_lower_threshold: 2
# Also, for convenience, we have a simplified version of replica policy that
# use fixed number of replicas. Just use the following syntax:
# Simplified version of replica policy that uses a fixed number of
# replicas:
replicas: 2
# Controller resources (optional). This describe the resources to use for
# the controller. Default to a 4+ vCPU instance with 100GB disk.
controller_resources:
cloud: aws
region: us-east-1
instance_type: p3.2xlarge
disk_size: 256
##### Fields below describe each replica #####
# Besides the `service` section, the rest is a regular SkyPilot task YAML.
resources:
# Port to run your service (required). This port will be automatically exposed
# by SkyServe. You can access your service at http://<endpoint-ip>:<port>.
# Port to run your service on each replica (required). This port will be
# automatically exposed to the public internet by SkyServe.
ports: 8080
# Other resources config...
# Then comes your SkyPilot task YAML...
# Other fields of your SkyPilot task YAML...
15 changes: 11 additions & 4 deletions docs/source/serving/sky-serve.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@ Why SkyServe?
How it works:

- Each service gets an endpoint that automatically redirects requests to its underlying replicas.
- The replicas of the same service can run in different regions and clouds — reducing cloud costs and increasing availability.
- SkyServe transparently handles the load balancing, recovery, and autoscaling of the replicas.
- Each service gets an endpoint that automatically redirects requests to its replicas.
- Replicas of the same service can run in different regions and clouds — reducing cloud costs and increasing availability.
- SkyServe handles the load balancing, recovery, and autoscaling of the replicas.

.. GPU availability has become a critical bottleneck for many AI services. With Sky
.. Serve, we offer a lightweight control plane that simplifies deployment across
Expand Down Expand Up @@ -74,9 +74,14 @@ Use :code:`sky serve status` to check the status of the service:

<div style="height: 20px;"></div>

.. tip::

Notice that the two replicas are launched in different regions/clouds for the lowest cost and highest GPU availability.
This is performed automatically, like a regular ``sky launch``.

If you see the :code:`STATUS` column becomes :code:`READY`, then the service is ready to accept traffic!

Simply ``curl`` the service endpoint --- for the above example, use
Simply ``curl -L`` the service endpoint --- for the above example, use
``44.211.131.51:30001`` which automatically load-balances across the two replicas:

.. code-block:: console
Expand All @@ -85,6 +90,7 @@ Simply ``curl`` the service endpoint --- for the above example, use
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
# Example output:
{"generated_text":"\n\nDeep learning is a subset of machine learning that uses artificial neural networks to model and solve"}
Expand Down Expand Up @@ -302,6 +308,7 @@ Send a request using the following cURL command:
-X POST \
-d '{"model":"vicuna-13b-v1.3","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Who are you?"}],"temperature":0}' \
-H 'Content-Type: application/json'
# Example output:
{"id":"chatcmpl-gZ8SfgUwcm9Xjbuv4xfefq","object":"chat.completion","created":1702082533,"model":"vicuna-13b-v1.3","choices":[{"index":0,"message":{"role":"assistant","content":"I am Vicuna, a language model trained by researchers from Large Model Systems Organization (LMSYS)."},"finish_reason":"stop"}],"usage":{"prompt_tokens":19,"total_tokens":43,"completion_tokens":24}}
Expand Down

0 comments on commit 7c514ba

Please sign in to comment.