|
3 | 3 |
|
4 | 4 | ## Introduction
|
5 | 5 |
|
6 |
| -This directory contains the performance benchmarking CI for vllm. |
7 |
| -The goal is to help developers know the impact of their PRs on the performance of vllm. |
| 6 | +This directory contains two sets of benchmark for vllm. |
| 7 | +- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance |
| 8 | +- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm. |
8 | 9 |
|
9 |
| -This benchmark will be *triggered* upon: |
| 10 | + |
| 11 | +See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results. |
| 12 | + |
| 13 | + |
| 14 | +## Performance benchmark quick overview |
| 15 | + |
| 16 | +**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!), with different models. |
| 17 | + |
| 18 | +**Benchmarking Duration**: about 1hr. |
| 19 | + |
| 20 | +**For benchmarking developers**: please try your best to constraint the duration of benchmarking to about 1 hr so that it won't take forever to run. |
| 21 | + |
| 22 | + |
| 23 | +## Nightly benchmark quick overview |
| 24 | + |
| 25 | +**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B. |
| 26 | + |
| 27 | +**Benchmarking engines**: vllm, TGI, trt-llm and lmdeploy. |
| 28 | + |
| 29 | +**Benchmarking Duration**: about 3.5hrs. |
| 30 | + |
| 31 | + |
| 32 | + |
| 33 | +## Trigger the benchmark |
| 34 | + |
| 35 | +Performance benchmark will be triggered when: |
10 | 36 | - A PR being merged into vllm.
|
11 | 37 | - Every commit for those PRs with `perf-benchmarks` label.
|
12 | 38 |
|
13 |
| -**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for more GPUs is comming later), with different models. |
| 39 | +Nightly benchmark will be triggered when: |
| 40 | +- Every commit for those PRs with `nightly-benchmarks` label. |
14 | 41 |
|
15 |
| -**Benchmarking Duration**: about 1hr. |
16 | 42 |
|
17 |
| -**For benchmarking developers**: please try your best to constraint the duration of benchmarking to less than 1.5 hr so that it won't take forever to run. |
18 | 43 |
|
19 | 44 |
|
20 |
| -## Configuring the workload |
| 45 | +## Performance benchmark details |
21 | 46 |
|
22 |
| -The benchmarking workload contains three parts: |
23 |
| -- Latency tests in `latency-tests.json`. |
24 |
| -- Throughput tests in `throughput-tests.json`. |
25 |
| -- Serving tests in `serving-tests.json`. |
| 47 | +See [descriptions.md](tests/descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases. |
26 | 48 |
|
27 |
| -See [descriptions.md](tests/descriptions.md) for detailed descriptions. |
28 | 49 |
|
29 |
| -### Latency test |
| 50 | +#### Latency test |
30 | 51 |
|
31 | 52 | Here is an example of one test inside `latency-tests.json`:
|
32 | 53 |
|
@@ -54,12 +75,12 @@ Note that the performance numbers are highly sensitive to the value of the param
|
54 | 75 | WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.
|
55 | 76 |
|
56 | 77 |
|
57 |
| -### Throughput test |
| 78 | +#### Throughput test |
58 | 79 | The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
|
59 | 80 |
|
60 | 81 | The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
|
61 | 82 |
|
62 |
| -### Serving test |
| 83 | +#### Serving test |
63 | 84 | We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
|
64 | 85 |
|
65 | 86 | ```
|
@@ -96,9 +117,36 @@ The number of this test is less stable compared to the delay and latency benchma
|
96 | 117 |
|
97 | 118 | WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.
|
98 | 119 |
|
99 |
| -## Visualizing the results |
| 120 | +#### Visualizing the results |
100 | 121 | The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
|
101 | 122 | You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
|
102 | 123 | If you do not see the table, please wait till the benchmark finish running.
|
103 | 124 | The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
|
104 | 125 | The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
|
| 126 | + |
| 127 | + |
| 128 | + |
| 129 | +## Nightly test details |
| 130 | + |
| 131 | +See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines. |
| 132 | + |
| 133 | + |
| 134 | +#### Workflow |
| 135 | + |
| 136 | +- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines. |
| 137 | +- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container. |
| 138 | +- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark. |
| 139 | +- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite. |
| 140 | + |
| 141 | +#### Nightly tests |
| 142 | + |
| 143 | +In [nightly-tests.json](tests/nightly-tests.json), we include the command line arguments for benchmarking commands, together with the benchmarking test cases. The format is highly similar to performance benchmark. |
| 144 | + |
| 145 | +#### Docker containers |
| 146 | + |
| 147 | +The docker containers for benchmarking are specified in `nightly-pipeline.yaml`. |
| 148 | + |
| 149 | +WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`. |
| 150 | + |
| 151 | +WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git). |
| 152 | + |
0 commit comments