Skip to content

Commit

Permalink
address comments
Browse files Browse the repository at this point in the history
Signed-off-by: Yuan Zhou <[email protected]>
  • Loading branch information
zhouyuan committed Oct 18, 2024
1 parent afdbd51 commit b4b6400
Show file tree
Hide file tree
Showing 3 changed files with 6 additions and 5 deletions.
5 changes: 3 additions & 2 deletions docs/source/getting_started/cpu-installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ vLLM initially supports basic model inferencing and serving on x86 CPU platform,
- Tensor Parallel (``-tp = N``)
- Quantization (``INT8 W8A8, AWQ``)

FP16 data type and more advanced features on `chunked-prefill`, `prefix-caching` and `FP8 KV cache` are under development and will be available soon.
.. note::
FP16 data type and more advanced features on `chunked-prefill`, `prefix-caching` and `FP8 KV cache` are under development and will be available soon.

Table of contents:

Expand Down Expand Up @@ -162,4 +163,4 @@ CPU Backend Considerations
$ VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
* Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like `Nginx <nginx-loadbalancer.html>`_ or HAProxy are recommended. Anyscale Ray project provides the feature on LLM `serving <https://docs.ray.io/en/latest/serve/index.html>`_. Here is the example to setup a scalable LLM serving with `Ray Serve <https://github.com/intel/llm-on-ray/blob/main/docs/setup.md>`_.
* Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like `Nginx <../serving/deploying_with_nginx.html>`_ or HAProxy are recommended. Anyscale Ray project provides the feature on LLM `serving <https://docs.ray.io/en/latest/serve/index.html>`_. Here is the example to setup a scalable LLM serving with `Ray Serve <https://github.com/intel/llm-on-ray/blob/main/docs/setup.md>`_.
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,6 @@ Documentation
getting_started/xpu-installation
getting_started/quickstart
getting_started/debugging
getting_started/nginx-loadbalancer
getting_started/examples/examples_index

.. toctree::
Expand All @@ -81,6 +80,7 @@ Documentation
serving/openai_compatible_server
serving/deploying_with_docker
serving/deploying_with_k8s
serving/deploying_with_nginx
serving/distributed_serving
serving/metrics
serving/env_vars
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.. _nginxloadbalancer:

Nginx Loadbalancer
========================
Deploying with Nginx Loadbalancer
=================================

This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.

Expand Down

0 comments on commit b4b6400

Please sign in to comment.