Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc][CI/Build] Update docs and tests to use vllm serve #6431

Merged
merged 12 commits into from
Jul 17, 2024
7 changes: 2 additions & 5 deletions docs/source/getting_started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,16 +73,13 @@ Start the server:

.. code-block:: console

$ python -m vllm.entrypoints.openai.api_server \
$ --model facebook/opt-125m
$ vllm serve facebook/opt-125m

By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using the ``--chat-template`` argument:

.. code-block:: console

$ python -m vllm.entrypoints.openai.api_server \
$ --model facebook/opt-125m \
$ --chat-template ./examples/template_chatml.jinja
$ vllm serve facebook/opt-125m --chat-template ./examples/template_chatml.jinja

This server can be queried in the same format as OpenAI API. For example, list the models:

Expand Down
4 changes: 2 additions & 2 deletions docs/source/models/adding_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ Just add the following lines in your code:
from your_code import YourModelForCausalLM
ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM)

If you are running api server with `python -m vllm.entrypoints.openai.api_server args`, you can wrap the entrypoint with the following code:
If you are running api server with :code:`vllm serve <args>`, you can wrap the entrypoint with the following code:

.. code-block:: python

Expand All @@ -124,4 +124,4 @@ If you are running api server with `python -m vllm.entrypoints.openai.api_server
import runpy
runpy.run_module('vllm.entrypoints.openai.api_server', run_name='__main__')

Save the above code in a file and run it with `python your_file.py args`.
Save the above code in a file and run it with :code:`python your_file.py <args>`.
4 changes: 2 additions & 2 deletions docs/source/models/engine_args.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Below, you can find an explanation of every engine argument for vLLM:
.. argparse::
:module: vllm.engine.arg_utils
:func: _engine_args_parser
:prog: -m vllm.entrypoints.openai.api_server
:prog: vllm serve
:nodefaultconst:

Async Engine Arguments
Expand All @@ -19,5 +19,5 @@ Below are the additional arguments related to the asynchronous engine:
.. argparse::
:module: vllm.engine.arg_utils
:func: _async_engine_args_parser
:prog: -m vllm.entrypoints.openai.api_server
:prog: vllm serve
:nodefaultconst:
3 changes: 1 addition & 2 deletions docs/source/models/lora.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,7 @@ LoRA adapted models can also be served with the Open-AI compatible vLLM server.

.. code-block:: bash

python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
vllm serve meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/

Expand Down
4 changes: 1 addition & 3 deletions docs/source/models/vlm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,9 +94,7 @@ Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with

.. code-block:: bash

python -m vllm.entrypoints.openai.api_server \
--model llava-hf/llava-1.5-7b-hf \
--chat-template template_llava.jinja
vllm serve llava-hf/llava-1.5-7b-hf --chat-template template_llava.jinja

.. important::
We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
Expand Down
2 changes: 1 addition & 1 deletion docs/source/serving/deploying_with_dstack.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7
gpu: 24GB
commands:
- pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
- vllm serve $MODEL --port 8000
model:
format: openai
type: chat
Expand Down
6 changes: 2 additions & 4 deletions docs/source/serving/distributed_serving.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,16 +35,14 @@ To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument wh

.. code-block:: console

$ python -m vllm.entrypoints.openai.api_server \
$ --model facebook/opt-13b \
$ vllm serve facebook/opt-13b \
$ --tensor-parallel-size 4

You can also additionally specify :code:`--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:

.. code-block:: console

$ python -m vllm.entrypoints.openai.api_server \
$ --model gpt2 \
$ vllm serve gpt2 \
$ --tensor-parallel-size 4 \
$ --pipeline-parallel-size 2 \
$ --distributed-executor-backend ray
Expand Down
8 changes: 3 additions & 5 deletions docs/source/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ vLLM provides an HTTP server that implements OpenAI's [Completions](https://plat

You can start the server using Python, or using [Docker](deploying_with_docker.rst):
```bash
python -m vllm.entrypoints.openai.api_server --model NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
```

To call the server, you can use the official OpenAI Python client library, or any other HTTP client.
Expand Down Expand Up @@ -97,9 +97,7 @@ template, or the template in string form. Without a chat template, the server wi
and all chat requests will error.

```bash
python -m vllm.entrypoints.openai.api_server \
--model ... \
--chat-template ./path-to-chat-template.jinja
vllm serve <model> --chat-template ./path-to-chat-template.jinja
```

vLLM community provides a set of chat templates for popular models. You can find them in the examples
Expand All @@ -110,7 +108,7 @@ directory [here](https://github.com/vllm-project/vllm/tree/main/examples/)
```{argparse}
:module: vllm.entrypoints.openai.cli_args
:func: create_parser_for_docs
:prog: -m vllm.entrypoints.openai.api_server
:prog: vllm serve
```

## Tool calling in the chat completion API
Expand Down
5 changes: 2 additions & 3 deletions examples/api_client.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
"""Example Python client for vllm.entrypoints.api_server
"""Example Python client for `vllm.entrypoints.api_server`
NOTE: The API server is used only for demonstration and simple performance
benchmarks. It is not intended for production use.
For production use, we recommend vllm.entrypoints.openai.api_server
and the OpenAI client API
For production use, we recommend `vllm serve` and the OpenAI client API.
"""

import argparse
Expand Down
12 changes: 3 additions & 9 deletions examples/logging_configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,9 +95,7 @@ to the path of the custom logging configuration JSON file:

```bash
VLLM_LOGGING_CONFIG_PATH=/path/to/logging_config.json \
python3 -m vllm.entrypoints.openai.api_server \
--max-model-len 2048 \
--model mistralai/Mistral-7B-v0.1
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
```


Expand Down Expand Up @@ -152,9 +150,7 @@ to the path of the custom logging configuration JSON file:

```bash
VLLM_LOGGING_CONFIG_PATH=/path/to/logging_config.json \
python3 -m vllm.entrypoints.openai.api_server \
--max-model-len 2048 \
--model mistralai/Mistral-7B-v0.1
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
```


Expand All @@ -167,9 +163,7 @@ loggers.

```bash
VLLM_CONFIGURE_LOGGING=0 \
python3 -m vllm.entrypoints.openai.api_server \
--max-model-len 2048 \
--model mistralai/Mistral-7B-v0.1
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
```


Expand Down
4 changes: 1 addition & 3 deletions examples/openai_vision_api_client.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
"""An example showing how to use vLLM to serve VLMs.

Launch the vLLM server with the following command:
python -m vllm.entrypoints.openai.api_server \
--model llava-hf/llava-1.5-7b-hf \
--chat-template template_llava.jinja
vllm serve llava-hf/llava-1.5-7b-hf --chat-template template_llava.jinja
"""
import base64

Expand Down
6 changes: 3 additions & 3 deletions examples/production_monitoring/Otel.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
```
export OTEL_SERVICE_NAME="vllm-server"
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
python -m vllm.entrypoints.openai.api_server --model="facebook/opt-125m" --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
```

1. In a new shell, send requests with trace context from a dummy client
Expand All @@ -62,7 +62,7 @@ By default, `grpc` is used. To set `http/protobuf` as the protocol, configure th
```
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://$JAEGER_IP:4318/v1/traces
python -m vllm.entrypoints.openai.api_server --model="facebook/opt-125m" --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
```

## Instrumentation of FastAPI
Expand All @@ -74,7 +74,7 @@ OpenTelemetry allows automatic instrumentation of FastAPI.

1. Run vLLM with `opentelemetry-instrument`
```
opentelemetry-instrument python -m vllm.entrypoints.openai.api_server --model="facebook/opt-125m"
opentelemetry-instrument vllm serve facebook/opt-125m
```

1. Send a request to vLLM and find its trace in Jaeger. It should contain spans from FastAPI.
Expand Down
3 changes: 1 addition & 2 deletions examples/production_monitoring/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,7 @@ Install:

Prometheus metric logging is enabled by default in the OpenAI-compatible server. Launch via the entrypoint:
```bash
python3 -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-v0.1 \
vllm serve mistralai/Mistral-7B-v0.1 \
--max-model-len 2048 \
--disable-log-requests
```
Expand Down
22 changes: 11 additions & 11 deletions tests/async_engine/test_openapi_server_ray.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,17 @@

@pytest.fixture(scope="module")
def server():
with RemoteOpenAIServer([
"--model",
MODEL_NAME,
# use half precision for speed and memory savings in CI environment
"--dtype",
"float16",
"--max-model-len",
"2048",
"--enforce-eager",
"--engine-use-ray"
]) as remote_server:
args = [
# use half precision for speed and memory savings in CI environment
"--dtype",
"float16",
"--max-model-len",
"2048",
"--enforce-eager",
"--engine-use-ray"
]

with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
yield remote_server


Expand Down
5 changes: 2 additions & 3 deletions tests/distributed/test_pipeline_parallel.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,6 @@
@pytest.fixture(scope="module")
def server():
args = [
"--model",
MODEL_NAME,
# use half precision for speed and memory savings in CI environment
"--dtype",
"bfloat16",
Expand All @@ -40,7 +38,8 @@ def server():
args += [
"--enforce-eager",
]
with RemoteOpenAIServer(args) as remote_server:

with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
yield remote_server


Expand Down
42 changes: 21 additions & 21 deletions tests/entrypoints/openai/test_chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,27 +27,27 @@ def zephyr_lora_files():

@pytest.fixture(scope="module")
def server(zephyr_lora_files):
with RemoteOpenAIServer([
"--model",
MODEL_NAME,
# use half precision for speed and memory savings in CI environment
"--dtype",
"bfloat16",
"--max-model-len",
"8192",
"--enforce-eager",
# lora config below
"--enable-lora",
"--lora-modules",
f"zephyr-lora={zephyr_lora_files}",
f"zephyr-lora2={zephyr_lora_files}",
"--max-lora-rank",
"64",
"--max-cpu-loras",
"2",
"--max-num-seqs",
"128",
]) as remote_server:
args = [
# use half precision for speed and memory savings in CI environment
"--dtype",
"bfloat16",
"--max-model-len",
"8192",
"--enforce-eager",
# lora config below
"--enable-lora",
"--lora-modules",
f"zephyr-lora={zephyr_lora_files}",
f"zephyr-lora2={zephyr_lora_files}",
"--max-lora-rank",
"64",
"--max-cpu-loras",
"2",
"--max-num-seqs",
"128",
]

with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
yield remote_server


Expand Down
60 changes: 30 additions & 30 deletions tests/entrypoints/openai/test_completion.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,36 +37,36 @@ def zephyr_pa_files():

@pytest.fixture(scope="module")
def server(zephyr_lora_files, zephyr_pa_files):
with RemoteOpenAIServer([
"--model",
MODEL_NAME,
# use half precision for speed and memory savings in CI environment
"--dtype",
"bfloat16",
"--max-model-len",
"8192",
"--max-num-seqs",
"128",
"--enforce-eager",
# lora config
"--enable-lora",
"--lora-modules",
f"zephyr-lora={zephyr_lora_files}",
f"zephyr-lora2={zephyr_lora_files}",
"--max-lora-rank",
"64",
"--max-cpu-loras",
"2",
# pa config
"--enable-prompt-adapter",
"--prompt-adapters",
f"zephyr-pa={zephyr_pa_files}",
f"zephyr-pa2={zephyr_pa_files}",
"--max-prompt-adapters",
"2",
"--max-prompt-adapter-token",
"128",
]) as remote_server:
args = [
# use half precision for speed and memory savings in CI environment
"--dtype",
"bfloat16",
"--max-model-len",
"8192",
"--max-num-seqs",
"128",
"--enforce-eager",
# lora config
"--enable-lora",
"--lora-modules",
f"zephyr-lora={zephyr_lora_files}",
f"zephyr-lora2={zephyr_lora_files}",
"--max-lora-rank",
"64",
"--max-cpu-loras",
"2",
# pa config
"--enable-prompt-adapter",
"--prompt-adapters",
f"zephyr-pa={zephyr_pa_files}",
f"zephyr-pa2={zephyr_pa_files}",
"--max-prompt-adapters",
"2",
"--max-prompt-adapter-token",
"128",
]

with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
yield remote_server


Expand Down
Loading
Loading