-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] PR 1/N for v1 sample and prompt logprobs support #9880
base: main
Are you sure you want to change the base?
[V1] PR 1/N for v1 sample and prompt logprobs support #9880
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This pull request has merge conflicts that must be resolved before it can be |
This pull request has merge conflicts that must be resolved before it can be |
Thanks @afeldman-nm! Did you had a chance to see the performance penalty of enabling logprobs? |
vllm/v1/engine/detokenizer.py
Outdated
@@ -91,25 +102,34 @@ def from_new_request( | |||
prompt_token_ids=request.prompt_token_ids, | |||
tokenizer=tokenizer, | |||
stop_buffer_length=stop_buffer_length, | |||
) | |||
logprobs=[] if do_logprobs else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of making logprobs: Optional[List[SampleLogprobs]]
and using None
to handle the case where there are no logprobs, you could instead make logprobs: List[SampleLogprobs]
and treat an empty list as no logprobs
This will simplify the code in the detokenizer
since you can then avoid the None
checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle I see the value of this. However, I attempted to emulate the behavior expected by the v0 engine unit tests (since there is no reason for the interface spec to change between v0 and v1.)
At the link below, I highlighted two lines from a v0 logprobs unit test. The test implicitly configures the number of logprobs as None
(by leaving logprobs unspecified in the SamplingParams
) and the test expects that the request output will have results_logprobs_none[i].outputs[0].logprobs
be None
:
vllm/tests/samplers/test_logprobs.py
Lines 170 to 172 in 9a99273
for i in range(len(results_logprobs_none)): | |
assert results_logprobs_none[i].outputs[0].logprobs is None | |
assert results_logprobs_none[i].outputs[0].cumulative_logprob is None |
This is why I used slightly more complex logic to make the request output logprobs be None
when the user does not request logprobs.
Just to confirm: is this PR ready to review? If so please remove "WIP" from the title to be less confusing. |
Initial benchmark resultsThe vLLM server with V1 engine is launched with the following command:
The vLLM server with V0 engine is launched with the following command:
Comparing V1 engine to V0 engine with
|
Metric | Main Branch V0 (logprobs=None) | Main Branch V1 (logprobs=None) |
---|---|---|
Successful requests | 659 | 660 |
Benchmark duration (s) | 22.24 | 23.71 |
Total input tokens | 85032 | 84700 |
Total generated tokens | 131399 | 131550 |
Request throughput (req/s) | 29.63 | 27.84 |
Output token throughput (tok/s) | 5908.23 | 5548.69 |
Total Token throughput (tok/s) | 9731.61 | 9121.28 |
Mean TTFT (ms) | 8791.65 | 7430.40 |
Median TTFT (ms) | 6851.74 | 7223.75 |
P99 TTFT (ms) | 16865.68 | 11917.53 |
Mean TPOT (ms) | 49.30 | 60.85 |
Median TPOT (ms) | 41.68 | 56.23 |
P99 TPOT (ms) | 211.19 | 141.15 |
Mean ITL (ms) | 32.44 | 43.55 |
Median ITL (ms) | 20.71 | 21.85 |
P99 ITL (ms) | 344.67 | 430.56 |
Observations
- 6% lower total token throughput
- 30% lower P99 TTFT
- 33% lower P99 TPOT
- 25% higher P99 ITL
Comparing v1_logprobs
branch to main
with logprobs=None
(V1 engine)
The serving benchmark client was launched with the following command:
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json
And the benchmark results were compared between the v1_logprobs
and main
branches. VLLM_USE_V1=1
was used to enable the vLLM v1 engine in both cases.
Metric | main | v1_logprobs |
---|---|---|
Successful requests | 660 | 660 |
Benchmark duration (s) | 23.71 | 22.30 |
Total input tokens | 84700 | 84700 |
Total generated tokens | 131550 | 131550 |
Request throughput (req/s) | 27.84 | 29.60 |
Output token throughput (tok/s) | 5548.69 | 5900.08 |
Total Token throughput (tok/s) | 9121.28 | 9698.92 |
Mean TTFT (ms) | 7430.40 | 6026.84 |
Median TTFT (ms) | 7223.75 | 5885.70 |
P99 TTFT (ms) | 11917.53 | 10310.81 |
Mean TPOT (ms) | 60.85 | 58.68 |
Median TPOT (ms) | 56.23 | 55.20 |
P99 TPOT (ms) | 141.15 | 187.53 |
Mean ITL (ms) | 43.55 | 41.98 |
Median ITL (ms) | 21.85 | 22.46 |
P99 ITL (ms) | 430.56 | 392.53 |
Observations
It appears that when logprobs=None
, performance by most metrics is similar or better with the v1_logprobs
branch than the main
branch. The exception is P99 time-per-output-token which is about 30%-40% worse however it is possible that this is run-to-run variation.;
Comparing logprobs=5
to logprobs=None
with the v1_logprobs
branch
Utilizing the v1_logprobs
branch, two different scenarios were benchmarked:
logprobs=None
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json
logprobs=5
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json --logprobs 5
Metric | logprobs=None | logprobs=5 |
---|---|---|
Successful requests | 660 | 662 |
Benchmark duration (s) | 22.30 | 26.94 |
Total input tokens | 84700 | 84124 |
Total generated tokens | 131550 | 132169 |
Request throughput (req/s) | 29.60 | 24.57 |
Output token throughput (tok/s) | 5900.08 | 4905.59 |
Total Token throughput (tok/s) | 9698.92 | 8027.94 |
Mean TTFT (ms) | 6026.84 | 6687.43 |
Median TTFT (ms) | 5885.70 | 5451.56 |
P99 TTFT (ms) | 10310.81 | 17033.30 |
Mean TPOT (ms) | 58.68 | 230.62 |
Median TPOT (ms) | 55.20 | 92.15 |
P99 TPOT (ms) | 187.53 | 1325.77 |
Mean ITL (ms) | 41.98 | 70.19 |
Median ITL (ms) | 22.46 | 23.57 |
P99 ITL (ms) | 392.53 | 1159.93 |
Observations
With logprobs=5
and using the v1_logprobs
branch,
- ~17% decrease in all throughput metrics
- 65% higher P99 TTFT
- 7x higher P99 TPOT
- 3x higher P99 ITL
compared to logprobs=None
and the v1_logprobs
branch
Comparing the v1_logprobs
branch with V1 engine to the main
branch with V0 engine, logprobs=5
Both scenarios used the same benchmark launch command:
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json --logprobs 5
Since the V1 engine on the main
branch does not support logprobs, the V0 engine's logprobs support was used as a baseline. V1 engine vs V0 engine was configured using VLLM_USE_V1=1
and VLLM_USE_V1=0
respectively.
Metric | Main Branch V0 | v1_logprobs Branch V1 |
---|---|---|
Successful requests | 659 | 662 |
Benchmark duration (s) | 36.93 | 26.94 |
Total input tokens | 85032 | 84124 |
Total generated tokens | 131412 | 132169 |
Request throughput (req/s) | 17.85 | 24.57 |
Output token throughput (tok/s) | 3558.73 | 4905.59 |
Total Token throughput (tok/s) | 5861.46 | 8027.94 |
Mean TTFT (ms) | 9922.62 | 6687.43 |
Median TTFT (ms) | 6618.40 | 5451.56 |
P99 TTFT (ms) | 23072.55 | 17033.30 |
Mean TPOT (ms) | 103.29 | 230.62 |
Median TPOT (ms) | 105.09 | 92.15 |
P99 TPOT (ms) | 267.94 | 1325.77 |
Mean ITL (ms) | 75.31 | 70.19 |
Median ITL (ms) | 42.60 | 23.57 |
P99 ITL (ms) | 644.76 | 1159.93 |
Observations
- 37% higher throughput
- 26% lower P99 TTFT
- 5x higher P99 TPOT
- 80% higher P99 ITL
Overall analysis of benchmark results
- It appears that the addition of logprobs support in the v1 engine has not significantly degraded performance in the
logprobs=None
scenario - Enabling
logprobs=5
in the v1 engine degrades throughput by about 17% and increases TTFT/TPOT/ITL significantly - Across most metrics, the V0 and V1 engines with
logprobs=None
seem to be within 30% of each other. However, withlogprobs=5
, V1 ITL is 80% higher than V0 and V1 TPOT is 5x higher than V0
Next steps
The default behavior of the V1 engine is that if logprobs is enabled for any requests in the batch, then logprobs are computed for all requests. If the max number of logprobs requested in the batch is 5, then 5 logprobs are computed across all requests in the batch. I hypothesize that computing logprobs only for requests which require logprobs & only computing the required number could reduce the performance difference between V1 and V0 when logprobs > 0
I don't think you have CUDAGraphs enabled in your V1 benchmarks, so I would not compare against V0 without that turned on as it makes a big difference. |
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
This PR is adding support for sample logprobs & prompt logprobs to vLLM v1.
New behavior:
Scheduler.update_from_output()
pythonizes the sample logprobs & prompt logprobs tensors into lists-of-dicts. This method ensures that each sample logprob dict has the appropriate number of keys based on how many sample logprobs the user requested (same goes for prompt logprobs)PR no. 1 (this PR) adds the infrastructure for logprobs support with limited unit tests
PR no. 2 (#11910 )
PR no. 3 will port
test_completion.py
logprobs tests to v1, make any adjustments required to make these tests pass, and fix a compatibility issue with Mistral tokenizer