-
-
Notifications
You must be signed in to change notification settings - Fork 5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'vllm-project:main' into main
- Loading branch information
Showing
178 changed files
with
5,785 additions
and
1,967 deletions.
There are no files selected for viewing
11 changes: 11 additions & 0 deletions
11
.buildkite/lm-eval-harness/configs/DeepSeek-V2-Lite-Chat.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m deepseek-ai/DeepSeek-V2-Lite-Chat -b "auto" -l 1000 -f 5 -t 2 | ||
model_name: "deepseek-ai/DeepSeek-V2-Lite-Chat" | ||
tasks: | ||
- name: "gsm8k" | ||
metrics: | ||
- name: "exact_match,strict-match" | ||
value: 0.671 | ||
- name: "exact_match,flexible-extract" | ||
value: 0.664 | ||
limit: 1000 | ||
num_fewshot: 5 |
8 changes: 4 additions & 4 deletions
8
.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,11 @@ | ||
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test -b 32 -l 250 -f 5 -t 1 | ||
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test -b 32 -l 1000 -f 5 -t 1 | ||
model_name: "nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test" | ||
tasks: | ||
- name: "gsm8k" | ||
metrics: | ||
- name: "exact_match,strict-match" | ||
value: 0.752 | ||
value: 0.755 | ||
- name: "exact_match,flexible-extract" | ||
value: 0.752 | ||
limit: 250 | ||
value: 0.755 | ||
limit: 1000 | ||
num_fewshot: 5 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
11 changes: 11 additions & 0 deletions
11
.buildkite/lm-eval-harness/configs/Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1 | ||
model_name: "neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8" | ||
tasks: | ||
- name: "gsm8k" | ||
metrics: | ||
- name: "exact_match,strict-match" | ||
value: 0.593 | ||
- name: "exact_match,flexible-extract" | ||
value: 0.588 | ||
limit: 1000 | ||
num_fewshot: 5 |
11 changes: 11 additions & 0 deletions
11
.buildkite/lm-eval-harness/configs/Qwen2-1.5B-Instruct-W8A16-compressed-tensors.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise -b "auto" -l 1000 -f 5 -t 1 | ||
model_name: "nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise" | ||
tasks: | ||
- name: "gsm8k" | ||
metrics: | ||
- name: "exact_match,strict-match" | ||
value: 0.595 | ||
- name: "exact_match,flexible-extract" | ||
value: 0.582 | ||
limit: 1000 | ||
num_fewshot: 5 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
Meta-Llama-3-70B-Instruct.yaml | ||
Mixtral-8x7B-Instruct-v0.1.yaml | ||
Qwen2-57B-A14-Instruct.yaml | ||
DeepSeek-V2-Lite-Chat.yaml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,6 @@ | ||
# vLLM benchmark suite | ||
|
||
|
||
## Introduction | ||
|
||
This directory contains the performance benchmarking CI for vllm. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
|
||
# Nightly benchmark | ||
|
||
The main goal of this benchmarking is two-fold: | ||
- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload. | ||
- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in [reproduce.md](). | ||
|
||
|
||
## Docker images | ||
|
||
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images: | ||
- vllm/vllm-openai:v0.5.0.post1 | ||
- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 | ||
- openmmlab/lmdeploy:v0.5.0 | ||
- ghcr.io/huggingface/text-generation-inference:2.1 | ||
|
||
<!-- Please check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/nightly-pipeline.yaml">nightly-pipeline.yaml</a> artifact for more details on how we deploy the docker images. --> | ||
|
||
|
||
## Hardware | ||
|
||
One AWS node with 8x NVIDIA A100 GPUs. | ||
|
||
|
||
## Workload description | ||
|
||
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload: | ||
|
||
- Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed). | ||
- Output length: the corresponding output length of these 500 prompts. | ||
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B. | ||
- Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed). | ||
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better). | ||
|
||
<!-- Check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/tests/nightly-tests.json">nightly-tests.json</a> artifact for more details. --> | ||
|
||
## Plots | ||
|
||
In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed. | ||
|
||
<img src="artifact://nightly_results.png" alt="Benchmarking results" height=250 > | ||
|
||
## Results | ||
|
||
{nightly_results_benchmarking_table} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,120 @@ | ||
common_pod_spec: &common_pod_spec | ||
priorityClassName: perf-benchmark | ||
nodeSelector: | ||
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB | ||
volumes: | ||
- name: devshm | ||
emptyDir: | ||
medium: Memory | ||
- name: hf-cache | ||
hostPath: | ||
path: /root/.cache/huggingface | ||
type: Directory | ||
|
||
common_container_settings: &common_container_settings | ||
command: | ||
- bash .buildkite/nightly-benchmarks/run-nightly-suite.sh | ||
resources: | ||
limits: | ||
nvidia.com/gpu: 8 | ||
volumeMounts: | ||
- name: devshm | ||
mountPath: /dev/shm | ||
- name: hf-cache | ||
mountPath: /root/.cache/huggingface | ||
env: | ||
- name: VLLM_USAGE_SOURCE | ||
value: ci-test | ||
- name: HF_HOME | ||
value: /root/.cache/huggingface | ||
- name: VLLM_SOURCE_CODE_LOC | ||
value: /workspace/build/buildkite/vllm/performance-benchmark | ||
- name: HF_TOKEN | ||
valueFrom: | ||
secretKeyRef: | ||
name: hf-token-secret | ||
key: token | ||
|
||
steps: | ||
- block: ":rocket: Ready for comparing vllm against alternatives? This will take 4 hours." | ||
- label: "A100 trt benchmark" | ||
priority: 100 | ||
agents: | ||
queue: A100 | ||
plugins: | ||
- kubernetes: | ||
podSpec: | ||
<<: *common_pod_spec | ||
containers: | ||
- image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 | ||
<<: *common_container_settings | ||
|
||
- label: "A100 lmdeploy benchmark" | ||
priority: 100 | ||
agents: | ||
queue: A100 | ||
plugins: | ||
- kubernetes: | ||
podSpec: | ||
<<: *common_pod_spec | ||
containers: | ||
- image: openmmlab/lmdeploy:v0.5.0 | ||
<<: *common_container_settings | ||
|
||
|
||
- label: "A100 vllm benchmark" | ||
priority: 100 | ||
agents: | ||
queue: A100 | ||
plugins: | ||
- kubernetes: | ||
podSpec: | ||
<<: *common_pod_spec | ||
containers: | ||
- image: vllm/vllm-openai:latest | ||
<<: *common_container_settings | ||
|
||
- label: "A100 tgi benchmark" | ||
priority: 100 | ||
agents: | ||
queue: A100 | ||
plugins: | ||
- kubernetes: | ||
podSpec: | ||
<<: *common_pod_spec | ||
containers: | ||
- image: ghcr.io/huggingface/text-generation-inference:2.1 | ||
<<: *common_container_settings | ||
|
||
- wait | ||
|
||
- label: "Plot" | ||
priority: 100 | ||
agents: | ||
queue: A100 | ||
plugins: | ||
- kubernetes: | ||
podSpec: | ||
<<: *common_pod_spec | ||
containers: | ||
- image: vllm/vllm-openai:v0.5.0.post1 | ||
command: | ||
- bash .buildkite/nightly-benchmarks/scripts/nightly-annotate.sh | ||
resources: | ||
limits: | ||
nvidia.com/gpu: 8 | ||
volumeMounts: | ||
- name: devshm | ||
mountPath: /dev/shm | ||
env: | ||
- name: VLLM_USAGE_SOURCE | ||
value: ci-test | ||
- name: VLLM_SOURCE_CODE_LOC | ||
value: /workspace/build/buildkite/vllm/performance-benchmark | ||
- name: HF_TOKEN | ||
valueFrom: | ||
secretKeyRef: | ||
name: hf-token-secret | ||
key: token | ||
|
||
- wait |
Oops, something went wrong.