diff --git a/docs/source/serving/fast-replica-startup.rst b/docs/source/serving/fast-replica-startup.rst new file mode 100644 index 00000000000..4093de2c3c7 --- /dev/null +++ b/docs/source/serving/fast-replica-startup.rst @@ -0,0 +1,28 @@ +Speeding Up Replica Setup +========================= + +When serving AI models, the setup process like dependencies installation and model weights downloading may take a lot of time. To speed up this process, you can use the :code:`ultra` disk tier: + +.. code-block:: yaml + :emphasize-lines: 7 + + service: + replicas: 2 + readiness_probe: /v1/models + resources: + ports: 8080 + accelerators: A10G:8 + disk_tier: ultra + +We find that when loading large models, the performance is sometime limited by the disk speed. By using the `ultra` disk tier, you can significantly reduce the time it takes to set up your replicas, allowing for faster response times and improved overall performance. Here is a comparison of disk tiers and their respective speeds. All tests are running on AWS and the result is the end-to-end execution time for launching a Llama 2 70b endpoint with the latest version of vLLM, on an :code:`A10G:8` instance (:code:`g5.48xlarge`). + +.. list-table:: + :widths: 10 10 + :header-rows: 1 + + * - Disk Tier + - Speed + * - :code:`ultra` + - 410s + * - :code:`high` + - 524s \ No newline at end of file diff --git a/docs/source/serving/user-guides.rst b/docs/source/serving/user-guides.rst index 8b9cba92b45..416649b92be 100644 --- a/docs/source/serving/user-guides.rst +++ b/docs/source/serving/user-guides.rst @@ -7,3 +7,4 @@ Serving User Guides update auth spot-policy + fast-replica-startup