address comments

Signed-off-by: Yuan Zhou <[email protected]>
vllm-project · Oct 11, 2024 · fc1f394 · fc1f394
1 parent 9036d5e
commit fc1f394
Show file tree

Hide file tree

Showing 3 changed files with 6 additions and 5 deletions.
diff --git a/docs/source/getting_started/cpu-installation.rst b/docs/source/getting_started/cpu-installation.rst
@@ -8,7 +8,8 @@ vLLM initially supports basic model inferencing and serving on x86 CPU platform,
 - Tensor Parallel (``-tp = N``)
 - Quantization (``INT8 W8A8, AWQ``)
 
-FP16 data type and more advanced features on `chunked-prefill`, `prefix-caching` and `FP8 KV cache` are under development and will be available soon.
+.. note::
+    FP16 data type and more advanced features on `chunked-prefill`, `prefix-caching` and `FP8 KV cache` are under development and will be available soon.
 
 Table of contents:
 
@@ -176,4 +177,4 @@ CPU Backend Considerations
          $ VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
 
 
-  * Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like `Nginx <nginx-loadbalancer.html>`_ or HAProxy are recommended. Anyscale Ray project provides the feature on LLM `serving <https://docs.ray.io/en/latest/serve/index.html>`_. Here is the example to setup a scalable LLM serving with `Ray Serve <https://github.com/intel/llm-on-ray/blob/main/docs/setup.md>`_.
+  * Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like `Nginx <../serving/deploying_with_nginx.html>`_ or HAProxy are recommended. Anyscale Ray project provides the feature on LLM `serving <https://docs.ray.io/en/latest/serve/index.html>`_. Here is the example to setup a scalable LLM serving with `Ray Serve <https://github.com/intel/llm-on-ray/blob/main/docs/setup.md>`_.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -71,7 +71,6 @@ Documentation
    getting_started/xpu-installation
    getting_started/quickstart
    getting_started/debugging
-   getting_started/nginx-loadbalancer
    getting_started/examples/examples_index
 
 .. toctree::
@@ -81,6 +80,7 @@ Documentation
    serving/openai_compatible_server
    serving/deploying_with_docker
    serving/deploying_with_k8s
+   serving/deploying_with_nginx
    serving/distributed_serving
    serving/metrics
    serving/env_vars

diff --git a/...ce/getting_started/nginx-loadbalancer.rst → docs/source/serving/deploying_with_nginx.rst b/...ce/getting_started/nginx-loadbalancer.rst → docs/source/serving/deploying_with_nginx.rst
@@ -1,7 +1,7 @@
 .. _nginxloadbalancer:
 
-Nginx Loadbalancer
-========================
+Deploying with Nginx Loadbalancer
+=================================
 
 This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.