[Speculative Decoding] EAGLE Implementation with Top-1 proposer #6830

abhigoyal1997 · 2024-07-26T08:33:12Z

This PR adds support for the EAGLE draft model.

github-actions · 2024-07-26T08:33:27Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

abhigoyal1997 · 2024-07-26T13:42:31Z

/ready

…eagle

vllm/model_executor/models/eagle.py

caddfa31434 · 2024-07-27T11:35:16Z

For some reason, Eagle seems to have removed the input_layernorm ： https://github.com/SafeAILab/EAGLE/blob/main/eagle/model/cnets.py#L419

abhigoyal1997 · 2024-07-27T11:43:43Z

For some reason, Eagle seems to have removed the input_layernorm ： https://github.com/SafeAILab/EAGLE/blob/main/eagle/model/cnets.py#L419

Yes, I saw that recently. But making that change would mean either changing the decoder layer to have input layernorm as optional or rewriting the decoder layer just for EAGLE. Both these options would reduce the freedom to use any decoder with EAGLE.

tests/spec_decode/e2e/test_eagle_correctness.py

cadedaniel · 2024-07-30T17:51:52Z

cc @comaniac @sroy745

…eagle

…orker

sroy745

Thanks for the PR. Left some comments. PTAL

tests/spec_decode/e2e/test_eagle_correctness.py

vllm/spec_decode/spec_decode_worker.py

vllm/worker/worker_base.py

vllm/spec_decode/multi_step_worker.py

vllm/model_executor/models/eagle.py

…or eagle where required.

abhigoyal1997 · 2024-08-01T06:04:55Z

@sroy745 Thanks for the review. I've made the changes and responded to your comments. PTAL

abhigoyal1997

@cadedaniel I've addressed the comments. PTAL, thanks!

vllm/model_executor/models/eagle.py

vllm/sequence.py

cadedaniel · 2024-08-20T05:39:50Z

Will review tomorrow !

…eagle

…th `worker.worker_base.LocalOrDistributedWorkerBase`

vllm/sequence.py

vllm/worker/multi_step_worker.py

cadedaniel

sorry, need 1 more day to finish review. partial comments below.

tests/spec_decode/e2e/test_eagle_correctness.py

vllm/model_executor/models/medusa.py

cadedaniel · 2024-08-21T08:12:42Z

vllm/sequence.py

            self._seq_ids = seq_ids

+    def expand_with_bonus_tokens(


(not blocking for this PR): these datastructures which require torch operations should not live in sequence // should go under spec_decode.

cadedaniel

LGTM! let's merge once tests are passing

tests/spec_decode/e2e/test_eagle_correctness.py

vllm/worker/worker_base.py

abhigoyal1997 · 2024-08-22T08:28:03Z

Thanks @cadedaniel for reviewing and approving. This is ready to merge!

…-project#6830)

cadedaniel · 2024-08-29T23:34:20Z

do you have the steps for creating the checkpoint @abhigoyal1997 ?

jokmingwong · 2024-08-30T01:42:33Z

python3 benchmarks/benchmark_latency.py --model Qwen/Qwen2-7B-Instruct --speculative-model yuhuili/EAGLE-Qwen2-7B-Instruct --num_speculative_tokens 4 --use-v2-block-manager --batch-size 1 --input-len 1024--output-len 128 --max-model-len 2048

I used the command above to run the latest Eagle speculative decoding PR with Qwen2-7B. And the eagle model has been converted according to the conversion script in the comments. After running, I found that the output generated by speculative decoding is inconsistent with that without speculative decoding. Upon examining the source code, I suspect that there may be an issue with the implementation about input_ids as the implementation of vLLM is inconsistent with the Eagle library implementation:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/eagle.py#L76-L99
https://github.com/SafeAILab/EAGLE/blob/main/eagle/modeling_eagle.py#L1273
Namely, when running the forward pass for the Eagle model, the first token of the original model should also be placed on the input_ids. Can you explain the reason for the difference in these two implementations?

abhigoyal1997 · 2024-08-30T03:52:12Z

python3 benchmarks/benchmark_latency.py --model Qwen/Qwen2-7B-Instruct --speculative-model yuhuili/EAGLE-Qwen2-7B-Instruct --num_speculative_tokens 4 --use-v2-block-manager --batch-size 1 --input-len 1024--output-len 128 --max-model-len 2048
I used the command above to run the latest Eagle speculative decoding PR with Qwen2-7B. And the eagle model has been converted according to the conversion script in the comments. After running, I found that the output generated by speculative decoding is inconsistent with that without speculative decoding. Upon examining the source code, I suspect that there may be an issue with the implementation about input_ids as the implementation of vLLM is inconsistent with the Eagle library implementation: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/eagle.py#L76-L99 https://github.com/SafeAILab/EAGLE/blob/main/eagle/modeling_eagle.py#L1273 Namely, when running the forward pass for the Eagle model, the first token of the original model should also be placed on the input_ids. Can you explain the reason for the difference in these two implementations?

Even in vLLM, the first token of the target model is present in input_ids. This is because the first token is generated by the target model in the prefill step which is then added to input_ids and EAGLE only starts generating tokens in the subsequent decode step. As for the masking in the forward pass, that masks the first input token and not any token output by the target model. This didn't make any difference in the outputs.

As for why you are seeing inconsistency, if you are using 16-bit precision, could it be related to this: #4978 (comment) ?

abhigoyal1997 · 2024-08-30T03:54:21Z

do you have the steps for creating the checkpoint @abhigoyal1997 ?

Is the gist in this comment helpful?:

vllm/vllm/model_executor/models/eagle.py

Line 127 in 80c7b08

    
           # This implementation is incompitable with https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B

jokmingwong · 2024-08-30T04:55:53Z

python3 benchmarks/benchmark_latency.py --model Qwen/Qwen2-7B-Instruct --speculative-model yuhuili/EAGLE-Qwen2-7B-Instruct --num_speculative_tokens 4 --use-v2-block-manager --batch-size 1 --input-len 1024--output-len 128 --max-model-len 2048
I used the command above to run the latest Eagle speculative decoding PR with Qwen2-7B. And the eagle model has been converted according to the conversion script in the comments. After running, I found that the output generated by speculative decoding is inconsistent with that without speculative decoding. Upon examining the source code, I suspect that there may be an issue with the implementation about input_ids as the implementation of vLLM is inconsistent with the Eagle library implementation: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/eagle.py#L76-L99 https://github.com/SafeAILab/EAGLE/blob/main/eagle/modeling_eagle.py#L1273 Namely, when running the forward pass for the Eagle model, the first token of the original model should also be placed on the input_ids. Can you explain the reason for the difference in these two implementations?
Even in vLLM, the first token of the target model is present in input_ids. This is because the first token is generated by the target model in the prefill step which is then added to input_ids and EAGLE only starts generating tokens in the subsequent decode step. As for the masking in the forward pass, that masks the first input token and not any token output by the target model. This didn't make any difference in the outputs.

As for why you are seeing inconsistency, if you are using 16-bit precision, could it be related to this: #4978 (comment) ?

Sorry, I may not have expressed this issue clearly. I noticed that when using speculative decoding with Eagle in vLLM under the same prompt and same models, with top_k=1 and temperature=0.5, the output is inconsistent with the official Eagle implementation. I will provide my test cases later to help reproduce this issue.

Siegfried-qgf · 2024-09-03T11:40:09Z

Great work. Do you have any plans to implement tree decoding? It seems that tree decoding will be very important to improve the results.

Sekri0 · 2024-09-20T08:12:14Z

Do you have any plans to support scenarios where tp > 1? @abhigoyal1997

den-run-ai · 2024-10-15T21:14:41Z

Is it possible to reproduce the ~2x-3x speedup reported in EAGLE 1/2 papers with this PR in vLLM?

…-project#6830) Signed-off-by: Alvant <[email protected]>

…-project#6830)

xiongqisong · 2024-12-13T08:08:44Z

python3 benchmarks/benchmark_latency.py --model Qwen/Qwen2-7B-Instruct --speculative-model yuhuili/EAGLE-Qwen2-7B-Instruct --num_speculative_tokens 4 --use-v2-block-manager --batch-size 1 --input-len 1024--output-len 128 --max-model-len 2048
I used the command above to run the latest Eagle speculative decoding PR with Qwen2-7B. And the eagle model has been converted according to the conversion script in the comments. After running, I found that the output generated by speculative decoding is inconsistent with that without speculative decoding. Upon examining the source code, I suspect that there may be an issue with the implementation about input_ids as the implementation of vLLM is inconsistent with the Eagle library implementation: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/eagle.py#L76-L99 https://github.com/SafeAILab/EAGLE/blob/main/eagle/modeling_eagle.py#L1273 Namely, when running the forward pass for the Eagle model, the first token of the original model should also be placed on the input_ids. Can you explain the reason for the difference in these two implementations?

I try to use EAGLE on Llama but failed, i want to know how to use EAGLE on vLLM, it's so hard to use with no demo

sroy745 · 2024-12-14T14:21:47Z

Hi @xiongqisong, which checkpoint are you using as the draft model? Is it one of the checkpoints available here ? If so it will not work since the checkpoint needed in vLLM is a bit different from what is available at https://huggingface.co/yuhuili. You need to convert the checkpoint available in https://huggingface.co/yuhuili using the script here and use the converted checkpoint as the draft model in the vllm.

Please let us know if this works for you or not. I think @LiuXiaoxuanPKU recently used the script to convert the checkpoint for yuhuili/EAGLE-LLaMA3-Instruct-70B into the vLLM compatible checkpoint and it worked for her.

I will add a section on how to use Eagle to the sd documentation here shortly

cc: @LiuXiaoxuanPKU

xiongqisong · 2024-12-16T06:34:33Z

Hi @xiongqisong, which checkpoint are you using as the draft model? Is it one of the checkpoints available here ? If so it will not work since the checkpoint needed in vLLM is a bit different from what is available at https://huggingface.co/yuhuili. You need to convert the checkpoint available in https://huggingface.co/yuhuili using the script here and use the converted checkpoint as the draft model in the vllm.

Please let us know if this works for you or not. I think @LiuXiaoxuanPKU recently used the script to convert the checkpoint for yuhuili/EAGLE-LLaMA3-Instruct-70B into the vLLM compatible checkpoint and it worked for her.

I will add a section on how to use Eagle to the sd documentation here shortly

cc: @LiuXiaoxuanPKU

Thanks for reply, i already use the script to convert EAGLE model weight to vllm format weight, but vLLM can't run EAGLE normally, i share the clue in #11126 , hope you have time to help me @sroy745 ~
If you can add a section on how to use EAGLE to the sd documentation https://docs.vllm.ai/en/latest/usage/spec_decode.html, that will be great to let more and more people involve into develope good performance EAGLE speculative decoding on vLLM.

initial changes to support EAGLE

3c590b1

abhigoyal1997 added 3 commits July 26, 2024 17:19

handling hidden_states in case of bonus tokens since EAGLE will need it

5f5bed1

enabling CUDA graph

023e72d

adding E2E test and formatting

8ac1570

abhigoyal1997 marked this pull request as ready for review July 26, 2024 13:33

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 26, 2024

abhigoyal1997 changed the title ~~[WIP] [Speculative Decoding] EAGLE Implementation with Top-1 proposer~~ [Speculative Decoding] EAGLE Implementation with Top-1 proposer Jul 26, 2024

abhigoyal1997 added 5 commits July 26, 2024 19:30

minor bug fix in graph capture

b379948

fixing broadcasting of hidden states in distributed worker

aef9c00

formatting

c8d63bd

Merge branch 'main' of github.fkinternal.com:abhinav-goyal/vllm into …

733ca4f

…eagle

formatting

1a0aa60

caddfa31434 reviewed Jul 27, 2024

View reviewed changes

vllm/model_executor/models/eagle.py Show resolved Hide resolved

abhigoyal1997 commented Jul 27, 2024

View reviewed changes

tests/spec_decode/e2e/test_eagle_correctness.py Show resolved Hide resolved

abhigoyal1997 added 4 commits July 31, 2024 10:16

Merge branch 'main' of github.fkinternal.com:abhinav-goyal/vllm into …

83b3dd8

…eagle

Masking position=0 in inputs for EAGLE

b1f05ac

reformatting

bdee07c

Fixing the order of execution for scorer and proposer in non-driver w…

441374f

…orker

sroy745 reviewed Jul 31, 2024

View reviewed changes

abhigoyal1997 and others added 5 commits August 1, 2024 09:29

Adding hidden state propagation to _execute_model_spmd

0d1cbae

Adding CUDA graph tests for medusa and eagle. Renaming mlp to medusa …

b60384a

…or eagle where required.

Moving hidden states shift to spec_decode_worker

7b6a0e6

formatting

9d806b3

Merge branch 'vllm-project:main' into eagle

e1e3175

abhigoyal1997 commented Aug 19, 2024

View reviewed changes

vllm/model_executor/models/eagle.py Show resolved Hide resolved

vllm/sequence.py Outdated Show resolved Hide resolved

abhigoyal1997 requested a review from cadedaniel August 19, 2024 05:53

abhigoyal1997 force-pushed the eagle branch from 6af6665 to f906cef Compare August 20, 2024 06:54

abhigoyal1997 added 2 commits August 20, 2024 12:30

Merge branch 'main' of github.fkinternal.com:abhinav-goyal/vllm into …

2147583

…eagle

Fixing compatibility of worker.multi_step_worker.MultiStepWorker wi…

3febb95

…th `worker.worker_base.LocalOrDistributedWorkerBase`

abhigoyal1997 commented Aug 20, 2024

View reviewed changes

vllm/sequence.py Show resolved Hide resolved

abhigoyal1997 commented Aug 20, 2024

View reviewed changes

vllm/worker/multi_step_worker.py Outdated Show resolved Hide resolved

Merge branch 'main' into eagle

90582e2

cadedaniel reviewed Aug 21, 2024

View reviewed changes

Merge branch 'main' into eagle

af5552b

cadedaniel approved these changes Aug 22, 2024

View reviewed changes

tests/spec_decode/e2e/test_eagle_correctness.py Show resolved Hide resolved

vllm/worker/worker_base.py Outdated Show resolved Hide resolved

adding comment

284468d

cadedaniel merged commit a3fce56 into vllm-project:main Aug 22, 2024
46 checks passed

omrishiv pushed a commit to omrishiv/vllm that referenced this pull request Aug 26, 2024

[Speculative Decoding] EAGLE Implementation with Top-1 proposer (vllm…

0d286a0

…-project#6830)

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Speculative Decoding] EAGLE Implementation with Top-1 proposer (vllm…

973c763

…-project#6830) Signed-off-by: Alvant <[email protected]>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Speculative Decoding] EAGLE Implementation with Top-1 proposer (vllm…

1bc9d4b

…-project#6830)

AlpinDale mentioned this pull request Dec 16, 2024

spec decode: add support for EAGLE PygmalionAI/aphrodite-engine#899

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speculative Decoding] EAGLE Implementation with Top-1 proposer #6830

[Speculative Decoding] EAGLE Implementation with Top-1 proposer #6830

abhigoyal1997 commented Jul 26, 2024 •

edited

Loading

github-actions bot commented Jul 26, 2024

abhigoyal1997 commented Jul 26, 2024

caddfa31434 commented Jul 27, 2024

abhigoyal1997 commented Jul 27, 2024

cadedaniel commented Jul 30, 2024

sroy745 left a comment

abhigoyal1997 commented Aug 1, 2024

abhigoyal1997 left a comment

cadedaniel commented Aug 20, 2024

cadedaniel left a comment

cadedaniel Aug 21, 2024

cadedaniel left a comment

abhigoyal1997 commented Aug 22, 2024

cadedaniel commented Aug 29, 2024

jokmingwong commented Aug 30, 2024 •

edited

Loading

abhigoyal1997 commented Aug 30, 2024 •

edited

Loading

abhigoyal1997 commented Aug 30, 2024 •

edited

Loading

jokmingwong commented Aug 30, 2024

Siegfried-qgf commented Sep 3, 2024

Sekri0 commented Sep 20, 2024

den-run-ai commented Oct 15, 2024

xiongqisong commented Dec 13, 2024

sroy745 commented Dec 14, 2024 •

edited

Loading

xiongqisong commented Dec 16, 2024 •

edited

Loading

[Speculative Decoding] EAGLE Implementation with Top-1 proposer #6830

[Speculative Decoding] EAGLE Implementation with Top-1 proposer #6830

Conversation

abhigoyal1997 commented Jul 26, 2024 • edited Loading

github-actions bot commented Jul 26, 2024

abhigoyal1997 commented Jul 26, 2024

caddfa31434 commented Jul 27, 2024

abhigoyal1997 commented Jul 27, 2024

cadedaniel commented Jul 30, 2024

sroy745 left a comment

Choose a reason for hiding this comment

abhigoyal1997 commented Aug 1, 2024

abhigoyal1997 left a comment

Choose a reason for hiding this comment

cadedaniel commented Aug 20, 2024

cadedaniel left a comment

Choose a reason for hiding this comment

cadedaniel Aug 21, 2024

Choose a reason for hiding this comment

cadedaniel left a comment

Choose a reason for hiding this comment

abhigoyal1997 commented Aug 22, 2024

cadedaniel commented Aug 29, 2024

jokmingwong commented Aug 30, 2024 • edited Loading

abhigoyal1997 commented Aug 30, 2024 • edited Loading

abhigoyal1997 commented Aug 30, 2024 • edited Loading

jokmingwong commented Aug 30, 2024

Siegfried-qgf commented Sep 3, 2024

Sekri0 commented Sep 20, 2024

den-run-ai commented Oct 15, 2024

xiongqisong commented Dec 13, 2024

sroy745 commented Dec 14, 2024 • edited Loading

xiongqisong commented Dec 16, 2024 • edited Loading

abhigoyal1997 commented Jul 26, 2024 •

edited

Loading

jokmingwong commented Aug 30, 2024 •

edited

Loading

abhigoyal1997 commented Aug 30, 2024 •

edited

Loading

abhigoyal1997 commented Aug 30, 2024 •

edited

Loading

sroy745 commented Dec 14, 2024 •

edited

Loading

xiongqisong commented Dec 16, 2024 •

edited

Loading