Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor TPU requirements file and pin build dependencies #10008

Closed
wants to merge 56 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
59776c5
Pin dockerfile
richardsliu Oct 31, 2024
1fc4621
fix manual installation
richardsliu Nov 1, 2024
0c971e4
refactor requirements tpu
richardsliu Nov 2, 2024
d00f062
update torch-xla
richardsliu Nov 4, 2024
23dfa46
fix documentation
richardsliu Nov 4, 2024
9efb5fe
[Bugfix][Frontend] Reject guided decoding in multistep mode (#9892)
joerunde Nov 1, 2024
0458bc6
[torch.compile] directly register custom op (#9896)
youkaichao Nov 1, 2024
fd71c52
[Bugfix] Fix layer skip logic with bitsandbytes (#9887)
mgoin Nov 1, 2024
2f80b13
[torch.compile] rework test plans (#9866)
youkaichao Nov 1, 2024
7c07ae3
[Model] Support bitsandbytes for MiniCPMV (#9891)
mgoin Nov 1, 2024
92b845d
[torch.compile] Adding torch compile annotations to some models (#9876)
CRZbulabula Nov 1, 2024
76936dc
[Doc] Update multi-input support (#9906)
DarkLight1337 Nov 1, 2024
6432426
[Frontend] Chat-based Embeddings API (#9759)
DarkLight1337 Nov 1, 2024
cc856c0
[CI/Build] Add Model Tests for PixtralHF (#9813)
mgoin Nov 1, 2024
3678bf1
[Frontend] Use a proper chat template for VLM2Vec (#9912)
DarkLight1337 Nov 1, 2024
f2f6195
[Bugfix] Fix edge cases for MistralTokenizer (#9625)
tjohnson31415 Nov 1, 2024
7a720dc
[Core] Refactor: Clean up unused argument in Scheduler._preempt (#9696)
andrejonasson Nov 1, 2024
a04aeb0
[torch.compile] use interpreter with stable api from pytorch (#9889)
youkaichao Nov 1, 2024
e66b29b
[Bugfix/Core] Flashinfer k_scale and v_scale (#9861)
pavanimajety Nov 1, 2024
522cc4a
[1/N] pass the complete config from engine to executor (#9933)
youkaichao Nov 1, 2024
771f4fd
[Bugfix] PicklingError on RayTaskError (#9934)
GeneDer Nov 1, 2024
7655274
[ci/build] Bump the patch-update group with 10 updates (#9897)
dependabot[bot] Nov 1, 2024
a13a962
[Core][VLM] Add precise multi-modal placeholder tracking (#8346)
petersalas Nov 1, 2024
f2183e3
[ci/build] Have dependabot ignore pinned dependencies (#9935)
khluu Nov 1, 2024
816f845
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder m…
sroy745 Nov 2, 2024
b9d1578
[torch.compile] fix cpu broken code (#9947)
youkaichao Nov 2, 2024
ac3a65b
[Docs] Update Granite 3.0 models in supported models table (#9930)
njhill Nov 2, 2024
dcf7a60
[Doc] Updated tpu-installation.rst with more details (#9926)
mikegre-google Nov 2, 2024
8a8fe7d
[2/N] executor pass the complete config to worker/modelrunner (#9938)
youkaichao Nov 2, 2024
45ca77a
[V1] Fix `EngineArgs` refactor on V1 (#9954)
robertgshaw2-neuralmagic Nov 2, 2024
b5d19f4
[bugfix] fix chatglm dummy_data_for_glmv (#9955)
youkaichao Nov 2, 2024
1dcbf25
[3/N] model runner pass the whole config to model (#9958)
youkaichao Nov 2, 2024
a988f63
[CI/Build] Quoting around > (#9956)
nokados Nov 2, 2024
cd91b02
[torch.compile] Adding torch compile to vision-language models (#9946)
CRZbulabula Nov 2, 2024
6519daa
[bugfix] fix tsts (#9959)
youkaichao Nov 2, 2024
63c4c09
[V1] Support per-request seed (#9945)
njhill Nov 3, 2024
bab1231
[Model] Add support for H2OVL-Mississippi models (#9747)
cooleel Nov 4, 2024
122c821
[V1] Fix Configs (#9971)
robertgshaw2-neuralmagic Nov 4, 2024
2623641
[Bugfix] Fix MiniCPMV and Mllama BNB bug (#9917)
jeejeelee Nov 4, 2024
3c961bc
[Bugfix]Using the correct type hints (#9885)
gshtras Nov 4, 2024
d36328c
[Misc] Compute query_start_loc/seq_start_loc on CPU (#9447)
zhengy001 Nov 4, 2024
94fb4db
[Bugfix] Fix E2EL mean and median stats (#9984)
daitran2k1 Nov 4, 2024
68deb96
[Bugfix][OpenVINO] Fix circular reference #9939 (#9974)
MengqingCao Nov 4, 2024
9ddd35a
[Frontend] Multi-Modality Support for Loading Local Image Files (#9915)
chaunceyjiang Nov 4, 2024
6eb2ed0
[4/N] make quant config first-class citizen (#9978)
youkaichao Nov 4, 2024
a2a024e
[Misc]Reduce BNB static variable (#9987)
jeejeelee Nov 4, 2024
db45c39
[Model] factoring out MambaMixer out of Jamba (#8993)
mzusman Nov 4, 2024
af148c8
[CI] Basic Integration Test For TPU (#9968)
robertgshaw2-neuralmagic Nov 4, 2024
6d353ed
[Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests ru…
hissu-hyvarinen Nov 4, 2024
db1e670
[Doc] Update VLM doc about loading from local files (#9999)
ywang96 Nov 4, 2024
d667c05
[Bugfix] Fix `MQLLMEngine` hanging (#9973)
robertgshaw2-neuralmagic Nov 4, 2024
af37b9b
[Misc] Refactor benchmark_throughput.py (#9779)
lk-chen Nov 4, 2024
20f6878
[Frontend] Add max_tokens prometheus metric (#9881)
tomeras91 Nov 4, 2024
767e8ad
[Bugfix] Upgrade to pytorch 2.5.1 (#10001)
bnellnm Nov 4, 2024
fc7d716
[4.5/N] bugfix for quant config in speculative decode (#10007)
youkaichao Nov 4, 2024
b98cf25
Update tpu-installation.rst
richardsliu Nov 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -107,11 +107,12 @@ fi
PARALLEL_JOB_COUNT=8
# check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs.
if [[ $commands == *"--shard-id="* ]]; then
# assign job count as the number of shards used
commands=${commands//"--num-shards= "/"--num-shards=${PARALLEL_JOB_COUNT} "}
for GPU in $(seq 0 $(($PARALLEL_JOB_COUNT-1))); do
#replace shard arguments
commands=${commands//"--shard-id= "/"--shard-id=${GPU} "}
commands=${commands//"--num-shards= "/"--num-shards=${PARALLEL_JOB_COUNT} "}
echo "Shard ${GPU} commands:$commands"
# assign shard-id for each shard
commands_gpu=${commands//"--shard-id= "/"--shard-id=${GPU} "}
echo "Shard ${GPU} commands:$commands_gpu"
docker run \
--device /dev/kfd --device /dev/dri \
--network host \
Expand All @@ -123,7 +124,7 @@ if [[ $commands == *"--shard-id="* ]]; then
-e HF_HOME=${HF_MOUNT} \
--name ${container_name}_${GPU} \
${image_name} \
/bin/bash -c "${commands}" \
/bin/bash -c "${commands_gpu}" \
|& while read -r line; do echo ">>Shard $GPU: $line"; done &
PIDS+=($!)
done
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/run-tpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ remove_docker_container
# For HF_TOKEN.
source /etc/environment
# Run a simple end-to-end example.
docker run --privileged --net host --shm-size=16G -it -e HF_TOKEN=$HF_TOKEN --name tpu-test vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git && python3 -m pip install pytest && pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py && python3 /workspace/vllm/tests/tpu/test_compilation.py && python3 /workspace/vllm/examples/offline_inference_tpu.py"
docker run --privileged --net host --shm-size=16G -it -e HF_TOKEN=$HF_TOKEN --name tpu-test vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git && python3 -m pip install pytest && python3 -m pip install lm_eval[api]==0.4.4 && pytest -v -s /workspace/vllm/tests/entrypoints/openai/test_accuracy.py && pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py && python3 /workspace/vllm/tests/tpu/test_compilation.py && python3 /workspace/vllm/examples/offline_inference_tpu.py"
9 changes: 9 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,15 @@ updates:
reviewers: ["khluu", "simon-mo"]
allow:
- dependency-type: "all"
ignore:
- dependency-name: "torch"
- dependency-name: "torchvision"
- dependency-name: "xformers"
- dependency-name: "lm-format-enforcer"
- dependency-name: "gguf"
- dependency-name: "compressed-tensors"
- dependency-name: "ray[adag]"
- dependency-name: "lm-eval"
groups:
patch-update:
applies-to: version-updates
Expand Down
4 changes: 2 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx11
# requirements.txt files and should be kept consistent. The ROCm torch
# versions are derived from Dockerfile.rocm
#
set(TORCH_SUPPORTED_VERSION_CUDA "2.5.0")
set(TORCH_SUPPORTED_VERSION_ROCM "2.5.0")
set(TORCH_SUPPORTED_VERSION_CUDA "2.5.1")
set(TORCH_SUPPORTED_VERSION_ROCM "2.5.1")

#
# Try to find python package with an executable that exactly matches
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ FROM vllm-base AS vllm-openai

# install additional dependencies for openai api server
RUN --mount=type=cache,target=/root/.cache/pip \
pip install accelerate hf_transfer 'modelscope!=1.15.0' bitsandbytes>=0.44.0 timm==0.9.10
pip install accelerate hf_transfer 'modelscope!=1.15.0' 'bitsandbytes>=0.44.0' timm==0.9.10

ENV VLLM_USAGE_SOURCE production-docker-image

Expand Down
2 changes: 1 addition & 1 deletion Dockerfile.neuron
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi

RUN python3 -m pip install -U \
cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
'cmake>=3.26' ninja packaging 'setuptools-scm>=8' wheel jinja2 \
-r requirements-neuron.txt

ENV VLLM_TARGET_DEVICE neuron
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile.ppc64le
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ RUN --mount=type=bind,source=.git,target=.git \
# These packages will be in rocketce eventually
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -v --prefer-binary --extra-index-url https://repo.fury.io/mgiessing \
cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
'cmake>=3.26' ninja packaging 'setuptools-scm>=8' wheel jinja2 \
torch==2.3.1 \
-r requirements-cpu.txt \
xformers uvloop==0.20.0
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile.rocm
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip uninstall -y torch torchvision \
&& python3 -m pip install --pre \
torch==2.6.0.dev20240918 \
setuptools-scm>=8 \
'setuptools-scm>=8' \
torchvision==0.20.0.dev20240918 \
--extra-index-url https://download.pytorch.org/whl/nightly/rocm6.2;; \
*) ;; esac
Expand Down
7 changes: 0 additions & 7 deletions Dockerfile.tpu
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,6 @@ RUN apt-get update && apt-get install -y \
git \
ffmpeg libsm6 libxext6 libgl1

# Install the TPU and Pallas dependencies.
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html

# Build vLLM.
COPY . .
ARG GIT_REPO_CHECK=0
Expand All @@ -25,7 +19,6 @@ ENV VLLM_TARGET_DEVICE="tpu"
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,source=.git,target=.git \
python3 -m pip install \
cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
-r requirements-tpu.txt
RUN python3 setup.py develop

Expand Down
4 changes: 2 additions & 2 deletions benchmarks/benchmark_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -406,9 +406,9 @@ def calculate_metrics(
median_itl_ms=np.median(itls or 0) * 1000,
percentiles_itl_ms=[(p, np.percentile(itls or 0, p) * 1000)
for p in selected_percentiles],
mean_e2el_ms=np.median(e2els or 0) * 1000,
mean_e2el_ms=np.mean(e2els or 0) * 1000,
std_e2el_ms=np.std(e2els or 0) * 1000,
median_e2el_ms=np.mean(e2els or 0) * 1000,
median_e2el_ms=np.median(e2els or 0) * 1000,
percentiles_e2el_ms=[(p, np.percentile(e2els or 0, p) * 1000)
for p in selected_percentiles],
)
Expand Down
81 changes: 55 additions & 26 deletions benchmarks/benchmark_throughput.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import json
import random
import time
from typing import List, Optional, Tuple
from typing import List, Optional

import torch
import uvloop
Expand All @@ -15,16 +15,35 @@
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
from vllm.entrypoints.openai.api_server import (
build_async_engine_client_from_engine_args)
from vllm.inputs import TextPrompt
from vllm.multimodal import MultiModalDataDict
from vllm.sampling_params import BeamSearchParams
from vllm.utils import FlexibleArgumentParser, merge_async_iterators


@dataclasses.dataclass
class SampleRequest:
"""A class representing a single inference request for benchmarking.

Attributes:
prompt: The input text prompt for the model.
multi_modal_data: Optional dictionary containing multi-modal data (e.g.
images).
prompt_len: The length of the prompt in tokens.
expected_output_len: The expected length of the output in tokens.
"""
prompt: str
prompt_len: int
expected_output_len: int
multi_modal_data: Optional[MultiModalDataDict] = None


def sample_requests(
dataset_path: str,
num_requests: int,
tokenizer: PreTrainedTokenizerBase,
fixed_output_len: Optional[int],
) -> List[Tuple[str, int, int]]:
) -> List[SampleRequest]:
if fixed_output_len is not None and fixed_output_len < 4:
raise ValueError("output_len too small")

Expand All @@ -41,7 +60,7 @@ def sample_requests(
random.shuffle(dataset)

# Filter out sequences that are too long or too short
filtered_dataset: List[Tuple[str, int, int]] = []
filtered_dataset: List[SampleRequest] = []
for i in range(len(dataset)):
if len(filtered_dataset) == num_requests:
break
Expand All @@ -60,31 +79,34 @@ def sample_requests(
if prompt_len > 1024 or prompt_len + output_len > 2048:
# Prune too long sequences.
continue
filtered_dataset.append((prompt, prompt_len, output_len))
filtered_dataset.append(
SampleRequest(prompt=prompt,
prompt_len=prompt_len,
expected_output_len=output_len))

return filtered_dataset


def run_vllm(
requests: List[Tuple[str, int, int]],
requests: List[SampleRequest],
n: int,
engine_args: EngineArgs,
) -> float:
from vllm import LLM, SamplingParams
llm = LLM(**dataclasses.asdict(engine_args))

# Add the requests to the engine.
prompts: List[str] = []
prompts: List[TextPrompt] = []
sampling_params: List[SamplingParams] = []
for prompt, _, output_len in requests:
prompts.append(prompt)
for request in requests:
prompts.append(TextPrompt(prompt=request.prompt))
sampling_params.append(
SamplingParams(
n=n,
temperature=1.0,
top_p=1.0,
ignore_eos=True,
max_tokens=output_len,
max_tokens=request.expected_output_len,
))

use_beam_search = False
Expand All @@ -94,11 +116,11 @@ def run_vllm(
llm.generate(prompts, sampling_params, use_tqdm=True)
end = time.perf_counter()
else:
prompts = [prompt for prompt, _, _ in requests]
prompts = [request.prompt for request in requests]
# output_len should be the same for all requests.
output_len = requests[0][2]
for prompt, input_len, _output_len in requests:
assert _output_len == output_len
for request in requests:
assert request.expected_output_len == output_len
start = time.perf_counter()
llm.beam_search(
prompts,
Expand All @@ -112,7 +134,7 @@ def run_vllm(


async def run_vllm_async(
requests: List[Tuple[str, int, int]],
requests: List[SampleRequest],
n: int,
engine_args: AsyncEngineArgs,
disable_frontend_multiprocessing: bool = False,
Expand All @@ -123,17 +145,17 @@ async def run_vllm_async(
engine_args, disable_frontend_multiprocessing) as llm:

# Add the requests to the engine.
prompts: List[str] = []
prompts: List[TextPrompt] = []
sampling_params: List[SamplingParams] = []
for prompt, _, output_len in requests:
prompts.append(prompt)
for request in requests:
prompts.append(TextPrompt(prompt=request.prompt))
sampling_params.append(
SamplingParams(
n=n,
temperature=1.0,
top_p=1.0,
ignore_eos=True,
max_tokens=output_len,
max_tokens=request.expected_output_len,
))

generators = []
Expand All @@ -149,7 +171,7 @@ async def run_vllm_async(


def run_hf(
requests: List[Tuple[str, int, int]],
requests: List[SampleRequest],
model: str,
tokenizer: PreTrainedTokenizerBase,
n: int,
Expand Down Expand Up @@ -207,14 +229,14 @@ def run_hf(


def run_mii(
requests: List[Tuple[str, int, int]],
requests: List[SampleRequest],
model: str,
tensor_parallel_size: int,
output_len: int,
) -> float:
from mii import client, serve
llm = serve(model, tensor_parallel=tensor_parallel_size)
prompts = [prompt for prompt, _, _ in requests]
prompts = [request.prompt for request in requests]

start = time.perf_counter()
llm.generate(prompts, max_new_tokens=output_len)
Expand Down Expand Up @@ -243,8 +265,12 @@ def main(args: argparse.Namespace):
else:
raise ValueError(
f"Failed to synthesize a prompt with {args.input_len} tokens.")
requests = [(prompt, args.input_len, args.output_len)
for _ in range(args.num_prompts)]
requests = [
SampleRequest(prompt=prompt,
prompt_len=args.input_len,
expected_output_len=args.output_len)
for _ in range(args.num_prompts)
]
else:
requests = sample_requests(args.dataset, args.num_prompts, tokenizer,
args.output_len)
Expand All @@ -270,9 +296,10 @@ def main(args: argparse.Namespace):
args.output_len)
else:
raise ValueError(f"Unknown backend: {args.backend}")
total_num_tokens = sum(prompt_len + output_len
for _, prompt_len, output_len in requests)
total_output_tokens = sum(output_len for _, _, output_len in requests)
total_num_tokens = sum(request.prompt_len + request.expected_output_len
for request in requests)
total_output_tokens = sum(request.expected_output_len
for request in requests)
print(f"Throughput: {len(requests) / elapsed_time:.2f} requests/s, "
f"{total_num_tokens / elapsed_time:.2f} total tokens/s, "
f"{total_output_tokens / elapsed_time:.2f} output tokens/s")
Expand All @@ -299,7 +326,9 @@ def main(args: argparse.Namespace):
parser.add_argument("--dataset",
type=str,
default=None,
help="Path to the dataset.")
help="Path to the dataset. The dataset is expected to "
"be a json in form of List[Dict[..., conversations: "
"List[Dict[..., value: <prompt_or_response>]]]]")
parser.add_argument("--input-len",
type=int,
default=None,
Expand Down
2 changes: 2 additions & 0 deletions docs/requirements-docs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,7 @@ torch
py-cpuinfo
transformers
mistral_common >= 1.3.4
aiohttp
starlette
openai # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args
partial-json-parser # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,6 @@ def setup(app):

# Mock out external dependencies here, otherwise the autodoc pages may be blank.
autodoc_mock_imports = [
"aiohttp",
"compressed_tensors",
"cpuinfo",
"cv2",
Expand Down Expand Up @@ -143,6 +142,7 @@ def add_line(self, line: str, source: str, *lineno: int) -> None:
"python": ("https://docs.python.org/3", None),
"typing_extensions":
("https://typing-extensions.readthedocs.io/en/latest", None),
"aiohttp": ("https://docs.aiohttp.org/en/stable", None),
"pillow": ("https://pillow.readthedocs.io/en/stable", None),
"numpy": ("https://numpy.org/doc/stable", None),
"torch": ("https://pytorch.org/docs/stable", None),
Expand Down
5 changes: 5 additions & 0 deletions docs/source/dev/pooling_params.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Pooling Parameters
==================

.. autoclass:: vllm.PoolingParams
:members:
8 changes: 4 additions & 4 deletions docs/source/getting_started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -138,10 +138,10 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep

A more detailed client example can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`__.

OpenAI Chat API with vLLM
~~~~~~~~~~~~~~~~~~~~~~~~~~
OpenAI Chat Completions API with vLLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

vLLM is designed to also support the OpenAI Chat API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
vLLM is designed to also support the OpenAI Chat Completions API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.

You can use the `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_ endpoint to interact with the model:

Expand All @@ -157,7 +157,7 @@ You can use the `create chat completion <https://platform.openai.com/docs/api-re
$ ]
$ }'

Alternatively, you can use the `openai` python package:
Alternatively, you can use the ``openai`` python package:

.. code-block:: python

Expand Down
Loading