From d59f82b8cd85c8e3a6419ea07860dfd01cbe7475 Mon Sep 17 00:00:00 2001 From: cblmemo Date: Tue, 16 Jan 2024 00:23:53 -0800 Subject: [PATCH 01/10] init --- docs/source/serving/autoscaling.rst | 6 ++++++ docs/source/serving/service-yaml-spec.rst | 2 +- docs/source/serving/sky-serve.rst | 2 +- 3 files changed, 8 insertions(+), 2 deletions(-) create mode 100644 docs/source/serving/autoscaling.rst diff --git a/docs/source/serving/autoscaling.rst b/docs/source/serving/autoscaling.rst new file mode 100644 index 00000000000..8dcac4b568b --- /dev/null +++ b/docs/source/serving/autoscaling.rst @@ -0,0 +1,6 @@ +.. _serve-autoscaling: + +Autoscaling in SkyServe +======================= + +SkyServe provides Out-of-the-box autoscaling for your services. diff --git a/docs/source/serving/service-yaml-spec.rst b/docs/source/serving/service-yaml-spec.rst index dd89e694ece..2a1ee29df11 100644 --- a/docs/source/serving/service-yaml-spec.rst +++ b/docs/source/serving/service-yaml-spec.rst @@ -1,7 +1,7 @@ .. _service-yaml-spec: Service YAML -========================== +============ SkyServe provides an intuitive YAML interface to specify a service. It is highly similar to the :ref:`SkyPilot task YAML `: with an additional service section in your original task YAML, you could change it to a service YAML. diff --git a/docs/source/serving/sky-serve.rst b/docs/source/serving/sky-serve.rst index fd327d3e489..4dfef414645 100644 --- a/docs/source/serving/sky-serve.rst +++ b/docs/source/serving/sky-serve.rst @@ -12,7 +12,7 @@ Why SkyServe? * **Bring any serving framework** (vLLM, TGI, FastAPI, ...) and scale it across regions/clouds * **Reduce costs and increase availability** of service replicas by leveraging multiple/cheaper locations and hardware (spot instances) -* **Out-of-the-box load-balancing and autoscaling** of service replicas +* **Out-of-the-box load-balancing and :ref:`autoscaling `** of service replicas * Manage multi-cloud, multi-region deployments with a single control plane * **Privacy**: Everything is launched inside your cloud accounts and VPCs From c2ca109978d807249e9da676074241a924f907b8 Mon Sep 17 00:00:00 2001 From: cblmemo Date: Tue, 16 Jan 2024 02:16:04 -0800 Subject: [PATCH 02/10] add doc --- docs/source/index.rst | 1 + docs/source/serving/autoscaling.rst | 64 +++++++++++++++++++++++++++-- docs/source/serving/sky-serve.rst | 2 +- 3 files changed, 63 insertions(+), 4 deletions(-) diff --git a/docs/source/index.rst b/docs/source/index.rst index 0ca5d5f3410..b132046dbf6 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -116,6 +116,7 @@ Documentation serving/sky-serve serving/service-yaml-spec + serving/autoscaling .. toctree:: :maxdepth: 1 diff --git a/docs/source/serving/autoscaling.rst b/docs/source/serving/autoscaling.rst index 8dcac4b568b..572a794bfba 100644 --- a/docs/source/serving/autoscaling.rst +++ b/docs/source/serving/autoscaling.rst @@ -1,6 +1,64 @@ .. _serve-autoscaling: -Autoscaling in SkyServe -======================= +Autoscaling +=========== -SkyServe provides Out-of-the-box autoscaling for your services. +SkyServe provides out-of-the-box autoscaling for your services. In a regular SkyServe Service, number of replica to launch is specified in the service section: + +.. code-block:: yaml + :emphasize-lines: 3 + + service: + readiness_probe: / + replicas: 2 + + # ... + +In this case, SkyServe will launch 2 replicas of your service. However, this deployment is fixed and cannot response to dynamic traffics. SkyServe provides autoscaling feature to help you scale your service up and down based on the traffic. + +Minimal Example +--------------- + +Following is a minimal example to enable autoscaling for your service: + +.. code-block:: yaml + :emphasize-lines: 3-6 + + service: + readiness_probe: / + replica_policy: + min_replicas: 2 + max_replicas: 10 + target_qps_per_replica: 3 + + # ... + +In this example, SkyServe will launch 2 replicas of your service and scale up to 10 replicas if the traffic is high. The autoscaling is based on the QPS (Queries Per Second) of your service. SkyServe will scale your service so that, ultimately, each replica manages approximately :code:`target_qps_per_replica` queries per second. If the QPS is higher than 3 per replica, SkyServe will launch more replicas and scale up to 10 replicas. If the QPS is lower than 3 per replica, SkyServe will scale down the replicas to 2. Specifically, the current target number of replicas is calculated as: + +.. code-block:: python + + current_target_replicas = ceil(current_qps / target_qps_per_replica) + final_target_replicas = min(max_replicas, max(min_replicas, current_target_replicas)) + +.. tip:: + + :code:`replica` is a shortcut for :code:`replica_policy.min_replicas`. These two fields cannot be specified at the same time. + +Scaling Delay +------------- + +SkyServe will not scale up or down immediately. Instead, SkyServe will wait for a period of time before scaling up or down. This is to avoid scaling up and down too aggressive. SkyServe will only upscale or downscale your service if the QPS of your service is higher or lower than the target QPS for a period of time. The default scaling delay is 300s for upscale and 1200s for downscale. You can change the scaling delay by specifying the :code:`upscale_delay_seconds` and :code:`downscale_delay_seconds` field in the autoscaling section: + +.. code-block:: yaml + :emphasize-lines: 7-8 + + service: + readiness_probe: / + replica_policy: + min_replicas: 2 + max_replicas: 10 + target_qps_per_replica: 3 + upscale_delay_seconds: 600 + downscale_delay_seconds: 1800 + + # ... diff --git a/docs/source/serving/sky-serve.rst b/docs/source/serving/sky-serve.rst index 4dfef414645..67531a17a1c 100644 --- a/docs/source/serving/sky-serve.rst +++ b/docs/source/serving/sky-serve.rst @@ -12,7 +12,7 @@ Why SkyServe? * **Bring any serving framework** (vLLM, TGI, FastAPI, ...) and scale it across regions/clouds * **Reduce costs and increase availability** of service replicas by leveraging multiple/cheaper locations and hardware (spot instances) -* **Out-of-the-box load-balancing and :ref:`autoscaling `** of service replicas +* **Out-of-the-box** load-balancing and :ref:`autoscaling ` of service replicas * Manage multi-cloud, multi-region deployments with a single control plane * **Privacy**: Everything is launched inside your cloud accounts and VPCs From a0ccd491719cc55845c9191bc00777d2a9d56413 Mon Sep 17 00:00:00 2001 From: Tian Xia Date: Thu, 18 Jan 2024 00:51:11 +0800 Subject: [PATCH 03/10] Apply suggestions from code review Co-authored-by: Ziming Mao --- docs/source/serving/autoscaling.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/serving/autoscaling.rst b/docs/source/serving/autoscaling.rst index 572a794bfba..9c980b8125b 100644 --- a/docs/source/serving/autoscaling.rst +++ b/docs/source/serving/autoscaling.rst @@ -14,7 +14,7 @@ SkyServe provides out-of-the-box autoscaling for your services. In a regular Sky # ... -In this case, SkyServe will launch 2 replicas of your service. However, this deployment is fixed and cannot response to dynamic traffics. SkyServe provides autoscaling feature to help you scale your service up and down based on the traffic. +In this case, SkyServe will launch 2 replicas of your service. However, this deployment is fixed and cannot respond to dynamic traffics. SkyServe provides autoscaling feature to help you scale your service up and down based on the traffic. Minimal Example --------------- @@ -47,7 +47,7 @@ In this example, SkyServe will launch 2 replicas of your service and scale up to Scaling Delay ------------- -SkyServe will not scale up or down immediately. Instead, SkyServe will wait for a period of time before scaling up or down. This is to avoid scaling up and down too aggressive. SkyServe will only upscale or downscale your service if the QPS of your service is higher or lower than the target QPS for a period of time. The default scaling delay is 300s for upscale and 1200s for downscale. You can change the scaling delay by specifying the :code:`upscale_delay_seconds` and :code:`downscale_delay_seconds` field in the autoscaling section: +SkyServe will not scale up or down immediately. Instead, SkyServe will wait for a period of time before scaling up or down. This is to avoid scaling up and down too aggressively. SkyServe will only upscale or downscale your service if the QPS of your service is higher or lower than the target QPS for a period of time. The default scaling delay is 300s for upscale and 1200s for downscale. You can change the scaling delay by specifying the :code:`upscale_delay_seconds` and :code:`downscale_delay_seconds` field in the autoscaling section: .. code-block:: yaml :emphasize-lines: 7-8 From 64a9f5fabb07e0c38a66d8955708d879723e6e68 Mon Sep 17 00:00:00 2001 From: cblmemo Date: Wed, 17 Jan 2024 09:12:48 -0800 Subject: [PATCH 04/10] apply suggestions from code review --- docs/source/serving/autoscaling.rst | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-) diff --git a/docs/source/serving/autoscaling.rst b/docs/source/serving/autoscaling.rst index 9c980b8125b..6765010efe5 100644 --- a/docs/source/serving/autoscaling.rst +++ b/docs/source/serving/autoscaling.rst @@ -29,11 +29,11 @@ Following is a minimal example to enable autoscaling for your service: replica_policy: min_replicas: 2 max_replicas: 10 - target_qps_per_replica: 3 + target_qps_per_replica: 2.5 # ... -In this example, SkyServe will launch 2 replicas of your service and scale up to 10 replicas if the traffic is high. The autoscaling is based on the QPS (Queries Per Second) of your service. SkyServe will scale your service so that, ultimately, each replica manages approximately :code:`target_qps_per_replica` queries per second. If the QPS is higher than 3 per replica, SkyServe will launch more replicas and scale up to 10 replicas. If the QPS is lower than 3 per replica, SkyServe will scale down the replicas to 2. Specifically, the current target number of replicas is calculated as: +In this example, SkyServe will launch 2 replicas of your service and scale up to 10 replicas if the traffic is high. The autoscaling is based on the QPS (Queries Per Second) of your service. SkyServe will scale your service so that, ultimately, each replica manages approximately :code:`target_qps_per_replica` queries per second; while in the same time, the final decision of replica numbers will be clipped in the range :code:`[min_replicas, max_replicas]`. This value could be a floating point as specified in the YAML above. If the QPS is higher than 2.5 per replica, SkyServe will launch more replicas (but no more than 10 replicas); if the QPS is lower than 2.5 per replica, SkyServe will scale down the replicas (but no less than 2 replicas). Specifically, the current target number of replicas is calculated as: .. code-block:: python @@ -44,6 +44,10 @@ In this example, SkyServe will launch 2 replicas of your service and scale up to :code:`replica` is a shortcut for :code:`replica_policy.min_replicas`. These two fields cannot be specified at the same time. +.. tip:: + + :code:`target_qps_per_replica` could be any positive floating point number. If process one request takes two seconds in one replica, using :code:`target_qps_per_replica=0.5`. + Scaling Delay ------------- @@ -62,3 +66,22 @@ SkyServe will not scale up or down immediately. Instead, SkyServe will wait for downscale_delay_seconds: 1800 # ... + +Scale Down to 0 +=============== + +If your service has a consecutive time period with no traffic, consider using :code:`min_replicas=0`: + +.. code-block:: yaml + :emphasize-lines: 4 + + service: + readiness_probe: / + replica_policy: + min_replicas: 0 + max_replicas: 3 + target_qps_per_replica: 6.3 + + # ... + +The service will scale down all replicas when there is no traffic to the system and will save costs on idle replicas. In this case, the scale up will be faster when the system has no replicas: it will **scale up immediately if any traffic detected**. From fc04be7eea74be18b7d26a853a3ca41059b46c9a Mon Sep 17 00:00:00 2001 From: Tian Xia Date: Thu, 18 Jan 2024 17:30:07 +0800 Subject: [PATCH 05/10] Apply suggestions from code review Co-authored-by: Ziming Mao --- docs/source/serving/autoscaling.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/serving/autoscaling.rst b/docs/source/serving/autoscaling.rst index 6765010efe5..2539f5caeb5 100644 --- a/docs/source/serving/autoscaling.rst +++ b/docs/source/serving/autoscaling.rst @@ -46,7 +46,7 @@ In this example, SkyServe will launch 2 replicas of your service and scale up to .. tip:: - :code:`target_qps_per_replica` could be any positive floating point number. If process one request takes two seconds in one replica, using :code:`target_qps_per_replica=0.5`. + :code:`target_qps_per_replica` could be any positive floating point number. If processing one request takes two seconds in one replica, we can use :code:`target_qps_per_replica=0.5`. Scaling Delay ------------- @@ -70,7 +70,7 @@ SkyServe will not scale up or down immediately. Instead, SkyServe will wait for Scale Down to 0 =============== -If your service has a consecutive time period with no traffic, consider using :code:`min_replicas=0`: +If your service might experience long period of time with no traffic, consider using :code:`min_replicas=0`: .. code-block:: yaml :emphasize-lines: 4 From bca2a1a3e759c4f65f35f2cb3a8cbb42b90b5518 Mon Sep 17 00:00:00 2001 From: Zongheng Yang Date: Fri, 19 Jan 2024 10:09:44 -0800 Subject: [PATCH 06/10] Updates --- docs/source/serving/autoscaling.rst | 58 ++++++++++++++++++++++------- 1 file changed, 45 insertions(+), 13 deletions(-) diff --git a/docs/source/serving/autoscaling.rst b/docs/source/serving/autoscaling.rst index 2539f5caeb5..ab2ccb304da 100644 --- a/docs/source/serving/autoscaling.rst +++ b/docs/source/serving/autoscaling.rst @@ -3,7 +3,12 @@ Autoscaling =========== -SkyServe provides out-of-the-box autoscaling for your services. In a regular SkyServe Service, number of replica to launch is specified in the service section: +SkyServe provides out-of-the-box autoscaling for your services. + +Fixed Replicas +-------------- + +In a service YAML, the number of replicas to launch is specified in the ``service`` section's ``replicas`` field: .. code-block:: yaml :emphasize-lines: 3 @@ -14,12 +19,13 @@ SkyServe provides out-of-the-box autoscaling for your services. In a regular Sky # ... -In this case, SkyServe will launch 2 replicas of your service. However, this deployment is fixed and cannot respond to dynamic traffics. SkyServe provides autoscaling feature to help you scale your service up and down based on the traffic. +In this case, SkyServe will launch 2 replicas of your service. However, this deployment is fixed and cannot adjust to dynamic traffic. +SkyServe provides autoscaling to help you scale your service up and down based on traffic, as shown below. -Minimal Example ---------------- +Enabling Autoscaling +-------------------- -Following is a minimal example to enable autoscaling for your service: +Here is a minimal example to enable autoscaling for your service: .. code-block:: yaml :emphasize-lines: 3-6 @@ -33,7 +39,24 @@ Following is a minimal example to enable autoscaling for your service: # ... -In this example, SkyServe will launch 2 replicas of your service and scale up to 10 replicas if the traffic is high. The autoscaling is based on the QPS (Queries Per Second) of your service. SkyServe will scale your service so that, ultimately, each replica manages approximately :code:`target_qps_per_replica` queries per second; while in the same time, the final decision of replica numbers will be clipped in the range :code:`[min_replicas, max_replicas]`. This value could be a floating point as specified in the YAML above. If the QPS is higher than 2.5 per replica, SkyServe will launch more replicas (but no more than 10 replicas); if the QPS is lower than 2.5 per replica, SkyServe will scale down the replicas (but no less than 2 replicas). Specifically, the current target number of replicas is calculated as: +In this example, SkyServe will: + +- Initially, launch 2 replicas of your service (``min_replicas``) +- Scale up gradually if the traffic is high, up to 10 replicas (``max_replicas``) +- Scale down gradually if the traffic is low, with a minimum of 2 replicas (``min_replicas``) + +The replica count will always be in the range +:code:`[min_replicas, max_replicas]`. + +Autoscaling is performed based on the QPS (Queries Per Second) of your service. +SkyServe will scale your service so that, ultimately, each replica receives +approximately :code:`target_qps_per_replica` queries per second. +This value can be a float; for example: + +- If the QPS is higher than 2.5 per replica, SkyServe will launch more replicas (but no more than 10 replicas) +- If the QPS is lower than 2.5 per replica, SkyServe will scale down the replicas (but no less than 2 replicas) + +Specifically, the current target number of replicas is calculated as: .. code-block:: python @@ -42,16 +65,23 @@ In this example, SkyServe will launch 2 replicas of your service and scale up to .. tip:: - :code:`replica` is a shortcut for :code:`replica_policy.min_replicas`. These two fields cannot be specified at the same time. + :code:`replicas` is a shortcut for :code:`replica_policy.min_replicas`. These two fields cannot be specified at the same time. .. tip:: - :code:`target_qps_per_replica` could be any positive floating point number. If processing one request takes two seconds in one replica, we can use :code:`target_qps_per_replica=0.5`. + :code:`target_qps_per_replica` can be any positive floating point number. If processing one request takes two seconds in one replica, we can use :code:`target_qps_per_replica=0.5`. Scaling Delay ------------- -SkyServe will not scale up or down immediately. Instead, SkyServe will wait for a period of time before scaling up or down. This is to avoid scaling up and down too aggressively. SkyServe will only upscale or downscale your service if the QPS of your service is higher or lower than the target QPS for a period of time. The default scaling delay is 300s for upscale and 1200s for downscale. You can change the scaling delay by specifying the :code:`upscale_delay_seconds` and :code:`downscale_delay_seconds` field in the autoscaling section: +SkyServe will not scale up or down immediately. Instead, SkyServe will only +upscale or downscale your service if the QPS of your service is higher or lower +than the target QPS for a period of time. This is to avoid scaling up and down +too aggressively. + +The default scaling delay is 300s for upscale and 1200s for downscale. You can +change the scaling delay by specifying the :code:`upscale_delay_seconds` and +:code:`downscale_delay_seconds` fields: .. code-block:: yaml :emphasize-lines: 7-8 @@ -62,15 +92,17 @@ SkyServe will not scale up or down immediately. Instead, SkyServe will wait for min_replicas: 2 max_replicas: 10 target_qps_per_replica: 3 - upscale_delay_seconds: 600 - downscale_delay_seconds: 1800 + upscale_delay_seconds: 300 + downscale_delay_seconds: 1200 # ... -Scale Down to 0 +Scale-to-Zero =============== -If your service might experience long period of time with no traffic, consider using :code:`min_replicas=0`: +SkyServe supports scale-to-zero. + +If your service might experience long periods of time with no traffic, consider using :code:`min_replicas: 0`: .. code-block:: yaml :emphasize-lines: 4 From 7373639af735795b60e65a35296e3dd43253d5b7 Mon Sep 17 00:00:00 2001 From: Zongheng Yang Date: Fri, 19 Jan 2024 10:13:37 -0800 Subject: [PATCH 07/10] Updates --- docs/source/serving/sky-serve.rst | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/docs/source/serving/sky-serve.rst b/docs/source/serving/sky-serve.rst index 67531a17a1c..50a7cd0c8ee 100644 --- a/docs/source/serving/sky-serve.rst +++ b/docs/source/serving/sky-serve.rst @@ -12,9 +12,9 @@ Why SkyServe? * **Bring any serving framework** (vLLM, TGI, FastAPI, ...) and scale it across regions/clouds * **Reduce costs and increase availability** of service replicas by leveraging multiple/cheaper locations and hardware (spot instances) -* **Out-of-the-box** load-balancing and :ref:`autoscaling ` of service replicas +* Out-of-the-box **load-balancing** and **autoscaling** of service replicas +* **Privacy and Control**: Everything is launched inside your cloud accounts and VPCs * Manage multi-cloud, multi-region deployments with a single control plane -* **Privacy**: Everything is launched inside your cloud accounts and VPCs .. * Allocate scarce resources (e.g., A100) **across regions and clouds** .. * Autoscale your endpoint deployment with load balancing @@ -444,6 +444,12 @@ Terminate services: $ sky serve down http-server # terminate the http-server service $ sky serve down --all # terminate all services +Autoscaling +----------- + +See :ref:`Autoscaling ` for more information. + + SkyServe Architecture --------------------- From 456db46056087fac35600423bfcb9ac9bd88c414 Mon Sep 17 00:00:00 2001 From: cblmemo Date: Fri, 19 Jan 2024 18:27:10 -0800 Subject: [PATCH 08/10] add desc on scaling delay --- docs/source/serving/autoscaling.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/serving/autoscaling.rst b/docs/source/serving/autoscaling.rst index ab2ccb304da..4c0c86b08af 100644 --- a/docs/source/serving/autoscaling.rst +++ b/docs/source/serving/autoscaling.rst @@ -97,6 +97,8 @@ change the scaling delay by specifying the :code:`upscale_delay_seconds` and # ... +If you want more aggressive scaling, set those values to a lower number and vise versa. + Scale-to-Zero =============== From 5d1e127c543b76f23a585f61f0f256c21a2c9f22 Mon Sep 17 00:00:00 2001 From: Tian Xia Date: Sat, 20 Jan 2024 11:48:49 +0800 Subject: [PATCH 09/10] Update docs/source/serving/autoscaling.rst Co-authored-by: Zongheng Yang --- docs/source/serving/autoscaling.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/serving/autoscaling.rst b/docs/source/serving/autoscaling.rst index 4c0c86b08af..f9992a1fed8 100644 --- a/docs/source/serving/autoscaling.rst +++ b/docs/source/serving/autoscaling.rst @@ -97,7 +97,7 @@ change the scaling delay by specifying the :code:`upscale_delay_seconds` and # ... -If you want more aggressive scaling, set those values to a lower number and vise versa. +If you want more aggressive scaling, set those values to a lower number and vice versa. Scale-to-Zero =============== From e273d78f10f3896720f6c5da52ffb6c2ea70d3d7 Mon Sep 17 00:00:00 2001 From: cblmemo Date: Fri, 19 Jan 2024 19:49:53 -0800 Subject: [PATCH 10/10] rephrase --- docs/source/serving/autoscaling.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/serving/autoscaling.rst b/docs/source/serving/autoscaling.rst index f9992a1fed8..d9f2d859924 100644 --- a/docs/source/serving/autoscaling.rst +++ b/docs/source/serving/autoscaling.rst @@ -118,4 +118,4 @@ If your service might experience long periods of time with no traffic, consider # ... -The service will scale down all replicas when there is no traffic to the system and will save costs on idle replicas. In this case, the scale up will be faster when the system has no replicas: it will **scale up immediately if any traffic detected**. +The service will scale down all replicas when there is no traffic to the system and will save costs on idle replicas. When upscaling from zero, the upscale delay will be ignored in order to bring up the service faster.