Skip to content

Commit 12b41b1

Browse files
committed
Merge remote-tracking branch 'upstream/main'
Signed-off-by: Shanshan Shen <[email protected]>
2 parents 6fea4c0 + aa25985 commit 12b41b1

File tree

267 files changed

+10615
-8836
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

267 files changed

+10615
-8836
lines changed

.buildkite/generate_index.py

+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
import argparse
2+
import os
3+
4+
template = """<!DOCTYPE html>
5+
<html>
6+
<body>
7+
<h1>Links for vLLM</h1/>
8+
<a href="../{wheel_html_escaped}">{wheel}</a><br/>
9+
</body>
10+
</html>
11+
"""
12+
13+
parser = argparse.ArgumentParser()
14+
parser.add_argument("--wheel", help="The wheel path.", required=True)
15+
args = parser.parse_args()
16+
17+
filename = os.path.basename(args.wheel)
18+
19+
with open("index.html", "w") as f:
20+
print(f"Generated index.html for {args.wheel}")
21+
# cloudfront requires escaping the '+' character
22+
f.write(
23+
template.format(wheel=filename,
24+
wheel_html_escaped=filename.replace("+", "%2B")))

.buildkite/nightly-benchmarks/benchmark-pipeline.yaml

+3-3
Original file line numberDiff line numberDiff line change
@@ -65,9 +65,9 @@ steps:
6565
- VLLM_USAGE_SOURCE
6666
- HF_TOKEN
6767

68-
- block: "Run H100 Benchmark"
69-
key: block-h100
70-
depends_on: ~
68+
#- block: "Run H100 Benchmark"
69+
#key: block-h100
70+
#depends_on: ~
7171

7272
- label: "H100"
7373
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"

.buildkite/upload-wheels.sh

+29-1
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@ wheel="$new_wheel"
2323
version=$(unzip -p "$wheel" '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)
2424
echo "Version: $version"
2525

26+
normal_wheel="$wheel" # Save the original wheel filename
27+
2628
# If the version contains "dev", rename it to v1.0.0.dev for consistency
2729
if [[ $version == *dev* ]]; then
2830
suffix="${version##*.}"
@@ -32,12 +34,38 @@ if [[ $version == *dev* ]]; then
3234
new_version="1.0.0.dev"
3335
fi
3436
new_wheel="${wheel/$version/$new_version}"
35-
mv -- "$wheel" "$new_wheel"
37+
# use cp to keep both files in the artifacts directory
38+
cp -- "$wheel" "$new_wheel"
3639
wheel="$new_wheel"
3740
version="$new_version"
3841
fi
3942

4043
# Upload the wheel to S3
44+
python3 .buildkite/generate_index.py --wheel "$normal_wheel"
45+
46+
# generate index for this commit
4147
aws s3 cp "$wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
48+
aws s3 cp "$normal_wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
49+
50+
if [[ $normal_wheel == *"cu118"* ]]; then
51+
# if $normal_wheel matches cu118, do not upload the index.html
52+
echo "Skipping index files for cu118 wheels"
53+
else
54+
# only upload index.html for cu12 wheels (default wheels)
55+
aws s3 cp index.html "s3://vllm-wheels/$BUILDKITE_COMMIT/vllm/index.html"
56+
aws s3 cp "s3://vllm-wheels/nightly/index.html" "s3://vllm-wheels/$BUILDKITE_COMMIT/index.html"
57+
fi
58+
59+
# generate index for nightly
4260
aws s3 cp "$wheel" "s3://vllm-wheels/nightly/"
61+
aws s3 cp "$normal_wheel" "s3://vllm-wheels/nightly/"
62+
63+
if [[ $normal_wheel == *"cu118"* ]]; then
64+
# if $normal_wheel matches cu118, do not upload the index.html
65+
echo "Skipping index files for cu118 wheels"
66+
else
67+
# only upload index.html for cu12 wheels (default wheels)
68+
aws s3 cp index.html "s3://vllm-wheels/nightly/vllm/index.html"
69+
fi
70+
4371
aws s3 cp "$wheel" "s3://vllm-wheels/$version/"

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,8 @@ instance/
8181
docs/_build/
8282
docs/source/getting_started/examples/*.rst
8383
!**/*.template.rst
84+
docs/source/getting_started/examples/*.md
85+
!**/*.template.md
8486

8587
# PyBuilder
8688
.pybuilder/

Dockerfile

+4-4
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# to run the OpenAI compatible server.
33

44
# Please update any changes made here to
5-
# docs/source/dev/dockerfile/dockerfile.rst and
5+
# docs/source/dev/dockerfile/dockerfile.md and
66
# docs/source/assets/dev/dockerfile-stages-dependency.png
77

88
ARG CUDA_VERSION=12.4.1
@@ -163,7 +163,7 @@ RUN PYTHON_VERSION_STR=$(echo ${PYTHON_VERSION} | sed 's/\.//g') && \
163163
RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
164164
&& echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \
165165
&& apt-get update -y \
166-
&& apt-get install -y ccache software-properties-common git curl sudo vim python3-pip \
166+
&& apt-get install -y ccache software-properties-common git curl wget sudo vim python3-pip \
167167
&& apt-get install -y ffmpeg libsm6 libxext6 libgl1 \
168168
&& add-apt-repository ppa:deadsnakes/ppa \
169169
&& apt-get update -y \
@@ -240,9 +240,9 @@ FROM vllm-base AS vllm-openai
240240
# install additional dependencies for openai api server
241241
RUN --mount=type=cache,target=/root/.cache/pip \
242242
if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
243-
pip install accelerate hf_transfer 'modelscope!=1.15.0' 'bitsandbytes>=0.42.0' 'timm==0.9.10'; \
243+
pip install accelerate hf_transfer 'modelscope!=1.15.0' 'bitsandbytes>=0.42.0' 'timm==0.9.10' boto3 runai-model-streamer runai-model-streamer[s3]; \
244244
else \
245-
pip install accelerate hf_transfer 'modelscope!=1.15.0' 'bitsandbytes>=0.45.0' 'timm==0.9.10'; \
245+
pip install accelerate hf_transfer 'modelscope!=1.15.0' 'bitsandbytes>=0.45.0' 'timm==0.9.10' boto3 runai-model-streamer runai-model-streamer[s3]; \
246246
fi
247247

248248
ENV VLLM_USAGE_SOURCE production-docker-image

Dockerfile.cpu

+3-3
Original file line numberDiff line numberDiff line change
@@ -26,20 +26,20 @@ RUN pip install intel_extension_for_pytorch==2.5.0
2626

2727
WORKDIR /workspace
2828

29+
COPY requirements-build.txt requirements-build.txt
2930
ARG PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
3031
ENV PIP_EXTRA_INDEX_URL=${PIP_EXTRA_INDEX_URL}
3132
RUN --mount=type=cache,target=/root/.cache/pip \
32-
--mount=type=bind,src=requirements-build.txt,target=requirements-build.txt \
3333
pip install --upgrade pip && \
3434
pip install -r requirements-build.txt
3535

3636
FROM cpu-test-1 AS build
3737

3838
WORKDIR /workspace/vllm
3939

40+
COPY requirements-common.txt requirements-common.txt
41+
COPY requirements-cpu.txt requirements-cpu.txt
4042
RUN --mount=type=cache,target=/root/.cache/pip \
41-
--mount=type=bind,src=requirements-common.txt,target=requirements-common.txt \
42-
--mount=type=bind,src=requirements-cpu.txt,target=requirements-cpu.txt \
4343
pip install -v -r requirements-cpu.txt
4444

4545
COPY . .

benchmarks/disagg_benchmarks/disagg_overhead_benchmark.sh

+7-6
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@ set -ex
1010

1111
kill_gpu_processes() {
1212
# kill all processes on GPU.
13-
pkill -f pt_main_thread
13+
pgrep pt_main_thread | xargs -r kill -9
14+
pgrep python3 | xargs -r kill -9
1415
sleep 10
1516

1617
# remove vllm config file
@@ -54,7 +55,7 @@ benchmark() {
5455

5556
CUDA_VISIBLE_DEVICES=0 python3 \
5657
-m vllm.entrypoints.openai.api_server \
57-
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
58+
--model $model \
5859
--port 8100 \
5960
--max-model-len 10000 \
6061
--gpu-memory-utilization 0.6 \
@@ -64,7 +65,7 @@ benchmark() {
6465

6566
CUDA_VISIBLE_DEVICES=1 python3 \
6667
-m vllm.entrypoints.openai.api_server \
67-
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
68+
--model $model \
6869
--port 8200 \
6970
--max-model-len 10000 \
7071
--gpu-memory-utilization 0.6 \
@@ -87,7 +88,7 @@ benchmark() {
8788
--port 8100 \
8889
--save-result \
8990
--result-dir $results_folder \
90-
--result-filename disagg_prefill_2xtp4.json \
91+
--result-filename disagg_prefill_tp1.json \
9192
--request-rate "inf"
9293

9394

@@ -105,7 +106,7 @@ benchmark() {
105106
--port 8200 \
106107
--save-result \
107108
--result-dir $results_folder \
108-
--result-filename disagg_prefill_2xtp4.json \
109+
--result-filename disagg_prefill_tp1_overhead.json \
109110
--request-rate "$qps"
110111
kill_gpu_processes
111112

@@ -118,7 +119,7 @@ main() {
118119
(which jq) || (apt-get -y install jq)
119120
(which socat) || (apt-get -y install socat)
120121

121-
pip install quart httpx
122+
pip install quart httpx datasets
122123

123124
cd "$(dirname "$0")"
124125

benchmarks/disagg_benchmarks/disagg_performance_benchmark.sh

+6-7
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,12 @@
11
#!/bin/bash
22

3-
# Requirement: 8x H100 GPUs.
3+
# Requirement: 2x GPUs.
44

55

6-
# Model: neuralmagic/Meta-Llama-3-70B-Instruct-FP8-KV
7-
# Query: 2048 input tokens, 11 output tokens, QPS 4, 500 requests
8-
# Resource: 8x H100
6+
# Model: meta-llama/Meta-Llama-3.1-8B-Instruct
7+
# Query: 1024 input tokens, 6 output tokens, QPS 2/4/6/8, 100 requests
8+
# Resource: 2x GPU
99
# Approaches:
10-
# 1. Chunked prefill: 1 vllm instance with tp=8
1110
# 2. Chunked prefill: 2 vllm instance with tp=4, equivalent to 1 tp=4 instance with QPS 4
1211
# 3. Disaggregated prefill: 1 prefilling instance and 1 decoding instance
1312
# Prefilling instance: max_output_token=1
@@ -114,7 +113,6 @@ benchmark() {
114113
--request-rate "$qps"
115114

116115
sleep 2
117-
118116
}
119117

120118

@@ -123,8 +121,9 @@ main() {
123121
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
124122
(which jq) || (apt-get -y install jq)
125123
(which socat) || (apt-get -y install socat)
124+
(which lsof) || (apt-get -y install lsof)
126125

127-
pip install quart httpx matplotlib aiohttp
126+
pip install quart httpx matplotlib aiohttp datasets
128127

129128
cd "$(dirname "$0")"
130129

docs/requirements-docs.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
sphinx==6.2.1
22
sphinx-book-theme==1.0.1
33
sphinx-copybutton==0.5.2
4-
myst-parser==2.0.0
4+
myst-parser==3.0.1
55
sphinx-argparse==0.4.0
66
msgspec
77
cloudpickle
+102
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
(apc)=
2+
3+
# Introduction
4+
5+
## What is Automatic Prefix Caching
6+
7+
Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.
8+
9+
```{note}
10+
Technical details on how vLLM implements APC are in the next page.
11+
```
12+
13+
## Enabling APC in vLLM
14+
15+
Set `enable_prefix_caching=True` in vLLM engine to enable APC. Here is an example:
16+
17+
```python
18+
import time
19+
from vllm import LLM, SamplingParams
20+
21+
22+
# A prompt containing a large markdown table. The table is randomly generated by GPT-4.
23+
LONG_PROMPT = "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n# Table\n" + """
24+
| ID | Name | Age | Occupation | Country | Email | Phone Number | Address |
25+
|-----|---------------|-----|---------------|---------------|------------------------|----------------|------------------------------|
26+
| 1 | John Doe | 29 | Engineer | USA | [email protected] | 555-1234 | 123 Elm St, Springfield, IL |
27+
| 2 | Jane Smith | 34 | Doctor | Canada | [email protected] | 555-5678 | 456 Oak St, Toronto, ON |
28+
| 3 | Alice Johnson | 27 | Teacher | UK | [email protected] | 555-8765 | 789 Pine St, London, UK |
29+
| 4 | Bob Brown | 45 | Artist | Australia | [email protected] | 555-4321 | 321 Maple St, Sydney, NSW |
30+
| 5 | Carol White | 31 | Scientist | New Zealand | [email protected] | 555-6789 | 654 Birch St, Wellington, NZ |
31+
| 6 | Dave Green | 28 | Lawyer | Ireland | [email protected] | 555-3456 | 987 Cedar St, Dublin, IE |
32+
| 7 | Emma Black | 40 | Musician | USA | [email protected] | 555-1111 | 246 Ash St, New York, NY |
33+
| 8 | Frank Blue | 37 | Chef | Canada | [email protected] | 555-2222 | 135 Spruce St, Vancouver, BC |
34+
| 9 | Grace Yellow | 50 | Engineer | UK | [email protected] | 555-3333 | 864 Fir St, Manchester, UK |
35+
| 10 | Henry Violet | 32 | Artist | Australia | [email protected] | 555-4444 | 753 Willow St, Melbourne, VIC|
36+
| 11 | Irene Orange | 26 | Scientist | New Zealand | [email protected] | 555-5555 | 912 Poplar St, Auckland, NZ |
37+
| 12 | Jack Indigo | 38 | Teacher | Ireland | [email protected] | 555-6666 | 159 Elm St, Cork, IE |
38+
| 13 | Karen Red | 41 | Lawyer | USA | [email protected] | 555-7777 | 357 Cedar St, Boston, MA |
39+
| 14 | Leo Brown | 30 | Chef | Canada | [email protected] | 555-8888 | 246 Oak St, Calgary, AB |
40+
| 15 | Mia Green | 33 | Musician | UK | [email protected] | 555-9999 | 975 Pine St, Edinburgh, UK |
41+
| 16 | Noah Yellow | 29 | Doctor | Australia | [email protected] | 555-0000 | 864 Birch St, Brisbane, QLD |
42+
| 17 | Olivia Blue | 35 | Engineer | New Zealand | [email protected] | 555-1212 | 753 Maple St, Hamilton, NZ |
43+
| 18 | Peter Black | 42 | Artist | Ireland | [email protected] | 555-3434 | 912 Fir St, Limerick, IE |
44+
| 19 | Quinn White | 28 | Scientist | USA | [email protected] | 555-5656 | 159 Willow St, Seattle, WA |
45+
| 20 | Rachel Red | 31 | Teacher | Canada | [email protected] | 555-7878 | 357 Poplar St, Ottawa, ON |
46+
| 21 | Steve Green | 44 | Lawyer | UK | [email protected] | 555-9090 | 753 Elm St, Birmingham, UK |
47+
| 22 | Tina Blue | 36 | Musician | Australia | [email protected] | 555-1213 | 864 Cedar St, Perth, WA |
48+
| 23 | Umar Black | 39 | Chef | New Zealand | [email protected] | 555-3435 | 975 Spruce St, Christchurch, NZ|
49+
| 24 | Victor Yellow | 43 | Engineer | Ireland | [email protected] | 555-5657 | 246 Willow St, Galway, IE |
50+
| 25 | Wendy Orange | 27 | Artist | USA | [email protected] | 555-7879 | 135 Elm St, Denver, CO |
51+
| 26 | Xavier Green | 34 | Scientist | Canada | [email protected] | 555-9091 | 357 Oak St, Montreal, QC |
52+
| 27 | Yara Red | 41 | Teacher | UK | [email protected] | 555-1214 | 975 Pine St, Leeds, UK |
53+
| 28 | Zack Blue | 30 | Lawyer | Australia | [email protected] | 555-3436 | 135 Birch St, Adelaide, SA |
54+
| 29 | Amy White | 33 | Musician | New Zealand | [email protected] | 555-5658 | 159 Maple St, Wellington, NZ |
55+
| 30 | Ben Black | 38 | Chef | Ireland | [email protected] | 555-7870 | 246 Fir St, Waterford, IE |
56+
"""
57+
58+
59+
def get_generation_time(llm, sampling_params, prompts):
60+
# time the generation
61+
start_time = time.time()
62+
output = llm.generate(prompts, sampling_params=sampling_params)
63+
end_time = time.time()
64+
# print the output and generation time
65+
print(f"Output: {output[0].outputs[0].text}")
66+
print(f"Generation time: {end_time - start_time} seconds.")
67+
68+
69+
# set enable_prefix_caching=True to enable APC
70+
llm = LLM(
71+
model='lmsys/longchat-13b-16k',
72+
enable_prefix_caching=True
73+
)
74+
75+
sampling_params = SamplingParams(temperature=0, max_tokens=100)
76+
77+
# Querying the age of John Doe
78+
get_generation_time(
79+
llm,
80+
sampling_params,
81+
LONG_PROMPT + "Question: what is the age of John Doe? Your answer: The age of John Doe is ",
82+
)
83+
84+
# Querying the age of Zack Blue
85+
# This query will be faster since vllm avoids computing the KV cache of LONG_PROMPT again.
86+
get_generation_time(
87+
llm,
88+
sampling_params,
89+
LONG_PROMPT + "Question: what is the age of Zack Blue? Your answer: The age of Zack Blue is ",
90+
)
91+
```
92+
93+
## Example workloads
94+
95+
We describe two example workloads, where APC can provide huge performance benefit:
96+
97+
- Long document query, where the user repeatedly queries the same long document (e.g. software manual or annual report) with different queries. In this case, instead of processing the long document again and again, APC allows vLLM to process this long document *only once*, and all future requests can avoid recomputing this long document by reusing its KV cache. This allows vLLM to serve future requests with much higher throughput and much lower latency.
98+
- Multi-round conversation, where the user may chat with the application multiple times in the same chatting session. In this case, instead of processing the whole chatting history again and again, APC allows vLLM to reuse the processing results of the chat history across all future rounds of conversation, allowing vLLM to serve future requests with much higher throughput and much lower latency.
99+
100+
## Limits
101+
102+
APC in general does not reduce the performance of vLLM. With that being said, APC only reduces the time of processing the queries (the prefilling phase) and does not reduce the time of generating new tokens (the decoding phase). So APC does not bring performance gain when vLLM spends most of the time generating answers to the queries (e.g. when the length of the answer is long), or new queries do not share the same prefix with any of existing queries (so that the computation cannot be reused).

0 commit comments

Comments
 (0)