[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #9291

NickLucche · 2024-10-11T17:03:39Z

Hey, this PR implements #5016.

The main idea is to make use of the current Speculative Decoder workflow and integrate it with mixed prefill-decode batches.
In particular, we can run the batched prefills and decodes together through the scorer (with the usual prefill|decode layout supported by backend), while the proposer can sync its KV cache on prefills only.

Current attention kernel implementation still doesn't make full use of the prefill|decode, but once the MQA integration is finalized we can get an easy speedup by running the batch in a single forward.

Current implementation on main already is (to some extent) prefill aware, so I was able to re-use a good chunk of the logic and the changes aren't (purposely) drastic.
On the other hand, one could prioritize optimizations more and I am open to any suggestion on how to best implement the approach, even at the cost
of re-writing more parts and making the PR more invasive (ie breaking some of the interfaces to avoid duplication).

TODO:

benchmark on A/H100
expand test coverage with prefill chunking enabled
test with new mqa_scorer, current implementation was rebased from v0.6.2
~~fix speculative methods requiring return_hidden_states~~ EDIT: on second thought, I believe this would be better addressed in a separate PR
~~disable_logprobs_during_spec_decoding compatibility~~

Update:

We add support for chunk prefill and spec decoding with the workflow depicted above, unless the proposer requires final hidden state from the target model (MLPSpeculator/Medusa): this is deferred to a second follow-up PR.

mqa_scorer is set to supersede BatchExpansion* thanks to the great work by @LiuXiaoxuanPKU, so we add support to that scorer directly in this PR!
Incidentally, this means enabling backend with flash_attn_varlen_func to take in any "mixed prefill-decode batch" in a single kernel call (so no more decoupled prefix-decode calls), which should also boost performance in "vanilla" chunked prefill scheduling policy (no spec).

Many thanks to @sroy745 for benchmarking the BatchExpansionTop1Scorer approach here (MQA to follow)!

Update 2:

After reviewing @sroy745 benchmarks, contrarily to expectations, fusing the two separate kernel call into a unified prefill+decode (single flash_attn_varlen_func call) did not yield improvements. I reverted the unifying kernel change, but I will keep the commit history here so we can come back to it and investigate some more on a separate optimization PR.

github-actions · 2024-10-11T17:03:51Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/spec_decode/spec_decode_worker.py

sroy745

Thanks for the pr. Left some comments. PTAL

vllm/spec_decode/spec_decode_worker.py

vllm/spec_decode/batch_expansion.py

vllm/config.py

vllm/spec_decode/spec_decode_worker.py

vllm/worker/model_runner.py

arashsadrieh · 2024-10-15T05:23:46Z

@NickLucche Thanks for the great work and understand that is WIP, just small note while you are working on this piece

We tried this PR with tensor parallelism and we found that it throughs the following exception when we activate tensor parallelism:

python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8083 --model /8b/  --speculative_model /1b/  --served-model-name SpeculativeLLM --tensor-parallel-size 4  --max-model-len 34336  --max-num-seqs 128  --enable-prefix-caching  --disable-log-requests --use-v2-block-manager --seed 42 --num_speculative_tokens 5  --spec-decoding-acceptance-method typical_acceptance_sampler  --enable_chunked_prefill

Here is the exception:

Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: 'num_seq_groups', Traceback (most recent call last):
   File "/home/ec2-user/tengfei_workspace/vllm/vllm/executor/multiproc_worker_utils.py", line 224, in _run_worker_process
     output = executor(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/home/ec2-user/tengfei_workspace/vllm/vllm/spec_decode/spec_decode_worker.py", line 459, in start_worker_execution_loop
     while self._run_non_driver_rank():
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/ec2-user/tengfei_workspace/vllm/vllm/spec_decode/spec_decode_worker.py", line 649, in _run_non_driver_rank
     self.proposer_worker.execute_model()
   File "/home/ec2-user/tengfei_workspace/vllm/vllm/worker/worker_base.py", line 308, in execute_model
     inputs = self.prepare_input(execute_model_req)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/ec2-user/tengfei_workspace/vllm/vllm/worker/worker_base.py", line 298, in prepare_input
     return self._get_worker_input_from_broadcast()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/ec2-user/tengfei_workspace/vllm/vllm/worker/worker_base.py", line 240, in _get_worker_input_from_broadcast
     worker_input = WorkerInput.from_broadcasted_tensor_dict(broadcast_data)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/ec2-user/tengfei_workspace/vllm/vllm/worker/worker_base.py", line 151, in from_broadcasted_tensor_dict
     num_seq_groups=tensor_dict.pop("num_seq_groups"),
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 KeyError: 'num_seq_groups'

The following command works normally

python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8083 --model /home/ec2-user/tengfei_workspace/output/8b-aio-20240923-3/merged/ --speculative_model /home/ec2-user/tengfei_workspace/output/1b-aio-20240923-3/merged/ --served-model-name SpeculativeLLM --tensor-parallel-size 1 --max-model-len 34336 --max-num-seqs 128 --enable-prefix-caching --disable-log-requests --use-v2-block-manager --seed 42 --num_speculative_tokens 5  --spec-decoding-acceptance-method typical_acceptance_sampler --enable_chunked_prefill --tensor-parallel-size 1

Thanks again and appreciate your work/ VLLM community

NickLucche · 2024-10-15T07:28:52Z

Thanks for testing that, will look right into it!
Might actually be related to prefix_caching, which I haven't taken into account yet (I know there's been some recent work on that too).

NickLucche · 2024-10-16T15:25:44Z

Update on mqa_scorer integration: enable_chunked_prefill with changes in this PR appears to work fine with the flash_attn kernel prior to the optimized one introduced here #9298 (so flash_attn_with_kvcache instead of flash_attn_varlen_func). I will sync with @LiuXiaoxuanPKU on this.

vllm/config.py

vllm/attention/backends/flash_attn.py

sroy745

Thanks for the pr. Left a few comments. PTAL.

vllm/attention/backends/flash_attn.py

vllm/config.py

vllm/spec_decode/spec_decode_worker.py

tests/spec_decode/test_spec_decode_worker.py

vllm/spec_decode/mqa_scorer.py

tests/spec_decode/test_spec_decode_worker.py

sroy745

Thanks for the pr! One comment about leaving out the unified kernel changes in this pr. Please check with @LiuXiaoxuanPKU and @comaniac on this. Otherwise LGTM.

vllm/attention/backends/flash_attn.py

NickLucche · 2024-10-23T19:50:35Z

Thanks for reviewing this!

sroy745

Thanks for the pr!! LGTM.

vllm/attention/backends/flash_attn.py

comaniac

LGTM. Good job! Only nits.

vllm/config.py

tests/utils.py

mergify · 2024-10-30T16:34:45Z

This pull request has merge conflicts that must be resolved before it can be
merged. @NickLucche please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: NickLucche <[email protected]>

NickLucche · 2024-10-31T17:53:58Z

Mmm apologies for the automatic call to review on so many people, had to sign commits and force push

sroy745 · 2024-11-01T20:52:24Z

@NickLucche I think you need to remove test_spec_decode_xfail_chunked_prefill from spec_decode/e2e/test_compatibility.py since its no longer applicable. Could you also please sync your branch to the head. It seems like some of the failures e.g. in buildkite/ci-aws/pr/decoder-only-multi-modal-models-test might already be fixed in head.

Signed-off-by: NickLucche <[email protected]>

…ebase

…odes Signed-off-by: NickLucche <[email protected]>

Signed-off-by: NickLucche <[email protected]>

njhill

Thanks @NickLucche for the awesome work, and to @sroy745 @LiuXiaoxuanPKU @comaniac for the reviews

andoorve · 2024-11-07T21:34:33Z

Hi @NickLucche, thanks for the PR!

I tried with TP on the latest main. It seems like I get the same error as @arashsadrieh still. Is this expected to work?

KeyError: 'num_seq_groups'

NickLucche · 2024-11-07T22:55:50Z

Hey @andoorve, yeah tp for the target model should be working, iirc even @sroy745's benchmarks ran with tp=4. Unfortunately I do not have a way to test master right now as I am away :/

sroy745 · 2024-11-07T22:59:11Z

Hi @andoorve / @arashsadrieh
I am able to run with this pr with the following command

python3 -m vllm.entrypoints.openai.api_server --model "meta-llama/Meta-Llama-3-70B-Instruct" --tensor-parallel-size 4 --disable-log-requests --enable-chunked-prefill --max_num_batched_tokens 2048 --speculative_model turboderp/Qwama-0.5B-Instruct --num_speculative_tokens 1 --speculative_draft_tensor_parallel_size 1 --disable-custom-all-reduce --swap_space 16 --speculative_disable_mqa_scorer

What is the command you are using?

One difference I think is that in our evals we ran with the speculative model running with tp=1 and the target model running with tp=4. Can you try and see if that works for you?

andoorve · 2024-11-07T23:20:27Z

Hey @NickLucche @sroy745, this is what I'm using. I think this is the difference, as I'm running with TP > 1 on the draft model as well. Unfortunately the Llama 8B draft model that I want to use is relatively large for TP=1.

vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --max-num-seqs 32  --block-size 32  --speculative-model meta-llama/Llama-3.1-8B-Instruct  --num-speculative-tokens 8 --gpu-memory-utilization  0.98 --use-v2-block-manager --distributed-executor-backend ray --enable-chunked-prefill --max-num-batched-tokens 4096 --max-model-len 32768

sroy745 · 2024-11-08T00:13:16Z

Hey @NickLucche @sroy745, this is what I'm using. I think this is the difference, as I'm running with TP > 1 on the draft model as well. Unfortunately the Llama 8B draft model that I want to use is relatively large for TP=1.
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --max-num-seqs 32  --block-size 32  --speculative-model meta-llama/Llama-3.1-8B-Instruct  --num-speculative-tokens 8 --gpu-memory-utilization  0.98 --use-v2-block-manager --distributed-executor-backend ray --enable-chunked-prefill --max-num-batched-tokens 4096 --max-model-len 32768

I will add a check to verify that sd + chunked-prefill is enabled for tp=1 draft model and then continue with the investigation. It is not breaking any existing cases so will add the check and debug.

…oding (vllm-project#9291) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Isotr0py <[email protected]>

…oding (vllm-project#9291) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Loc Huynh <[email protected]>

…oding (vllm-project#9291) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>

…oding (vllm-project#9291) Signed-off-by: NickLucche <[email protected]>

…oding (vllm-project#9291) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>

…oding (vllm-project#9291) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>

…oding (vllm-project#9291) Signed-off-by: NickLucche <[email protected]>

NickLucche requested review from njhill, LiuXiaoxuanPKU, WoosukKwon, zhuohan123, youkaichao, alexm-neuralmagic and comaniac as code owners October 11, 2024 17:03

NickLucche marked this pull request as draft October 11, 2024 17:04

comaniac assigned comaniac and LiuXiaoxuanPKU Oct 11, 2024

comaniac reviewed Oct 12, 2024

View reviewed changes

vllm/spec_decode/spec_decode_worker.py Show resolved Hide resolved

sroy745 reviewed Oct 13, 2024

View reviewed changes

NickLucche force-pushed the chunk-spec-decoding-rebase branch from 49b03ab to 8b88b8a Compare October 14, 2024 10:41

NickLucche marked this pull request as ready for review October 17, 2024 15:45

LiuXiaoxuanPKU reviewed Oct 17, 2024

View reviewed changes

vllm/config.py Outdated Show resolved Hide resolved

LiuXiaoxuanPKU reviewed Oct 17, 2024

View reviewed changes

vllm/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

sroy745 reviewed Oct 18, 2024

View reviewed changes

NickLucche force-pushed the chunk-spec-decoding-rebase branch from 0819d12 to 3e5b882 Compare October 22, 2024 09:15

sroy745 reviewed Oct 23, 2024

View reviewed changes

vllm/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

sroy745 reviewed Oct 30, 2024

View reviewed changes

vllm/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

vllm/attention/backends/flash_attn.py Show resolved Hide resolved

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 30, 2024

comaniac approved these changes Oct 30, 2024

View reviewed changes

vllm/config.py Outdated Show resolved Hide resolved

tests/utils.py Outdated Show resolved Hide resolved

mergify bot added needs-rebase and removed needs-rebase labels Oct 30, 2024

lint

ca2691e

Signed-off-by: NickLucche <[email protected]>

NickLucche force-pushed the chunk-spec-decoding-rebase branch from cd9bd2a to ca2691e Compare October 31, 2024 17:49

mergify bot removed the needs-rebase label Oct 31, 2024

fix ngram tests; remove sd+cp outdated compat check

fb66563

Signed-off-by: NickLucche <[email protected]>

NickLucche force-pushed the chunk-spec-decoding-rebase branch from 756d33f to fb66563 Compare November 4, 2024 17:47

NickLucche and others added 5 commits November 4, 2024 18:47

Merge branch 'vllm-project:main' into chunk-spec-decoding-rebase

71ff65c

typo in ngram test

b6ec7a0

Signed-off-by: NickLucche <[email protected]>

Merge remote-tracking branch 'origin/main' into chunk-spec-decoding-r…

ebcf6ac

…ebase

fix failing test by avoiding re-running proposer on spec-disabled dec…

87ba98d

…odes Signed-off-by: NickLucche <[email protected]>

format

d743406

Signed-off-by: NickLucche <[email protected]>

njhill approved these changes Nov 7, 2024

View reviewed changes

njhill merged commit 9d43afc into vllm-project:main Nov 7, 2024
57 checks passed

NickLucche mentioned this pull request Nov 7, 2024

[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill #10132

Open

sroy745 mentioned this pull request Nov 8, 2024

Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 #10136

Merged

Isotr0py pushed a commit to Isotr0py/vllm that referenced this pull request Nov 8, 2024

[Feature] [Spec decode]: Combine chunked prefill with speculative dec…

55bda6f

…oding (vllm-project#9291) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Isotr0py <[email protected]>

JC1DA pushed a commit to JC1DA/vllm that referenced this pull request Nov 11, 2024

[Feature] [Spec decode]: Combine chunked prefill with speculative dec…

5e7de2b

…oding (vllm-project#9291) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Loc Huynh <[email protected]>

andoorve mentioned this pull request Nov 11, 2024

[Bugfix] Fix for Spec model TP + Chunked Prefill #10232

Merged

sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024

[Feature] [Spec decode]: Combine chunked prefill with speculative dec…

3b018a1

…oding (vllm-project#9291) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Feature] [Spec decode]: Combine chunked prefill with speculative dec…

aa35e73

…oding (vllm-project#9291) Signed-off-by: NickLucche <[email protected]>

NickLucche mentioned this pull request Nov 27, 2024

[Performance]: Unified flashattn kernel not outperforming current one #10707

Open

1 task

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[Feature] [Spec decode]: Combine chunked prefill with speculative dec…

bce3884

…oding (vllm-project#9291) Signed-off-by: NickLucche <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #9291

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #9291

NickLucche commented Oct 11, 2024 •

edited

Loading

github-actions bot commented Oct 11, 2024

sroy745 left a comment

arashsadrieh commented Oct 15, 2024 •

edited

Loading

NickLucche commented Oct 15, 2024 •

edited

Loading

NickLucche commented Oct 16, 2024

sroy745 left a comment

sroy745 left a comment

NickLucche commented Oct 23, 2024

sroy745 left a comment •

edited

Loading

comaniac left a comment

mergify bot commented Oct 30, 2024

NickLucche commented Oct 31, 2024

sroy745 commented Nov 1, 2024 •

edited

Loading

njhill left a comment

andoorve commented Nov 7, 2024

NickLucche commented Nov 7, 2024 •

edited

Loading

sroy745 commented Nov 7, 2024 •

edited

Loading

andoorve commented Nov 7, 2024

sroy745 commented Nov 8, 2024 •

edited

Loading

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #9291

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #9291

Conversation

NickLucche commented Oct 11, 2024 • edited Loading

github-actions bot commented Oct 11, 2024

sroy745 left a comment

Choose a reason for hiding this comment

arashsadrieh commented Oct 15, 2024 • edited Loading

NickLucche commented Oct 15, 2024 • edited Loading

NickLucche commented Oct 16, 2024

sroy745 left a comment

Choose a reason for hiding this comment

sroy745 left a comment

Choose a reason for hiding this comment

NickLucche commented Oct 23, 2024

sroy745 left a comment • edited Loading

Choose a reason for hiding this comment

comaniac left a comment

Choose a reason for hiding this comment

mergify bot commented Oct 30, 2024

NickLucche commented Oct 31, 2024

sroy745 commented Nov 1, 2024 • edited Loading

njhill left a comment

Choose a reason for hiding this comment

andoorve commented Nov 7, 2024

NickLucche commented Nov 7, 2024 • edited Loading

sroy745 commented Nov 7, 2024 • edited Loading

andoorve commented Nov 7, 2024

sroy745 commented Nov 8, 2024 • edited Loading

NickLucche commented Oct 11, 2024 •

edited

Loading

arashsadrieh commented Oct 15, 2024 •

edited

Loading

NickLucche commented Oct 15, 2024 •

edited

Loading

sroy745 left a comment •

edited

Loading

sroy745 commented Nov 1, 2024 •

edited

Loading

NickLucche commented Nov 7, 2024 •

edited

Loading

sroy745 commented Nov 7, 2024 •

edited

Loading

sroy745 commented Nov 8, 2024 •

edited

Loading