[Kernel]Generalize Speculative decode from Cuda #10094

xuechendi · 2024-11-06T21:35:47Z

This PR is mainly target to remove hard dependency for CUDA in speculative decoding

Done:

Remove hard dependency and select worker / modelRunner based on current_platform
per mgoin's suggestion, Enabled CPU support for Speculative Decoding

Based on discussion with @comaniac and @youkaichao , I provide a Second solution to avoid Dynamic WorkerCls => #10587

Settings:

draft model

    llm = LLM(
        model="facebook/opt-1.3b",
        speculative_model="facebook/opt-125m",
        num_speculative_tokens=5,
        use_v2_block_manager=True,
    )

medusa

    llm = LLM(
        model="JackFram/llama-68m",
        speculative_model="abhigoyal/vllm-medusa-llama-68m-random",
        num_speculative_tokens=4,
        use_v2_block_manager=True,
    )

eagle

llm = LLM(
        model="JackFram/llama-68m",
        speculative_model="abhigoyal/vllm-eagle-llama-68m-random",
        num_speculative_tokens=5,
        use_v2_block_manager=True
    )

mlp

    llm = LLM(
        model="JackFram/llama-160m",
        speculative_model="ibm-fms/llama-160m-accelerator",
        num_speculative_tokens=3,
        use_v2_block_manager=True
    )

*ngram

    llm = LLM(
        model="facebook/opt-350m",
        speculative_model="[ngram]",
        num_speculative_tokens=5,
        ngram_prompt_lookup_max=3,
        use_v2_block_manager=True,
    )

github-actions · 2024-11-06T21:35:59Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

xuechendi · 2024-11-06T21:38:21Z

Hi, @LiuXiaoxuanPKU, may you take a look of this PR
I want to remove the hard dependency in speculative decoding.

xuechendi · 2024-11-07T15:03:28Z

Hi, @simon-mo, will you check on this PR?

mergify · 2024-11-07T16:17:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xuechendi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

xuechendi · 2024-11-07T20:57:32Z

@WoosukKwon , will you take a look at this PR?

mgoin · 2024-11-07T21:20:58Z

Although it may not be practical due to the lack of compute intensity, it would be helpful for testing of the generalization to have a CPU implementation to more easily test non-CUDA

xuechendi · 2024-11-08T01:21:58Z

@mgoin , CPU supported for spec decode is added. Please help to take a review

xuechendi · 2024-11-08T15:53:53Z

@cadedaniel , may you take a look of this PR. I would like to remove hard-dependencies for spec decode to CUDA, so we can apply to other hardware

mergify · 2024-11-11T08:56:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xuechendi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

cadedaniel · 2024-11-11T20:24:58Z

@cadedaniel , may you take a look of this PR. I would like to remove hard-dependencies for spec decode to CUDA, so we can apply to other hardware

Can you share the performance improvement on AMD hardware? Cc @LiuXiaoxuanPKU @comaniac

xuechendi · 2024-11-12T16:31:17Z

@cadedaniel , may you take a look of this PR. I would like to remove hard-dependencies for spec decode to CUDA, so we can apply to other hardware

Can you share the performance improvement on AMD hardware? Cc @LiuXiaoxuanPKU @comaniac

@cadedaniel , thanks for reviewing this PR. I aimed to use this PR to firstly make it possible to run Spec Decode on other HW besides GPU.
Performance wisely, I believe different HW may need special treatment to get the optimal performance (so Maybe we can do that on another PR?) => Adding CPU support here is only to show case all hard-dependencies on GPU is cleaned. so this PR might not be the best impl for CPU

FYI, we have another proposal to provide heterogenous setup which runs draft model on CPU and target model on GPU. We can discuss about that later which may be better use case for running spec on CPU.

xuechendi · 2024-11-12T19:37:13Z

Hi, @njhill , I just learned you are one of owners for spec decode, may you help to take a review on this PR?

Signed-off-by: Chendi Xue <[email protected]>

mergify · 2024-11-20T11:00:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xuechendi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

comaniac · 2024-11-20T17:29:38Z

vllm/spec_decode/medusa_worker.py

+if current_platform.is_neuron():
+    from vllm.worker.neuron_worker import NeuronWorker as WorkerCls
+elif current_platform.is_hpu():
+    from vllm.worker.hpu_worker import HPUWorker as WorkerCls
+elif current_platform.is_openvino():
+    from vllm.worker.openvino_worker import OpenVINOWorker as WorkerCls
+elif current_platform.is_cpu():
+    from vllm.worker.cpu_worker import CPUWorker as WorkerCls
+elif current_platform.is_tpu():
+    from vllm.worker.tpu_worker import TPUWorker as WorkerCls
+elif current_platform.is_xpu():
+    from vllm.worker.xpu_worker import XPUWorker as WorkerCls
+else:
+    from vllm.worker.worker import Worker as WorkerCls


This is not a clean and concise way to support non CUDA workers, so apparently you'll need some designs.

@comaniac , I could put a worker_selector.py in either worker folder or in spec_decode folder, I didn't do that was because when I discussed this with @LiuXiaoxuanPKU , she prefer to keep this PR as simple as possible.

Would like your opinion here? The idea is that, I can extract above codes into a new file, and in spec_decode_worker, medusa_worker, simply do "from vllm.worker.selector import WorkerCls"

The problem is I don't think the current PR is simple, given that this logic is tedious and duplicated everywhere. I'm also not sure if this is reliable to derive classes based on a dynamic variable (i.e. current_platform) in a distributed environment.

Thanks @comaniac, do you mean support for heterogeneous platform in spec decode path?
Yeah, I totally Agree that current codes are tedious, do you think extract the worker_selector into a single file to simplify the codes works? or do you have other suggestion?

I am totally open to discuss about the design.

I don't mean to support heterogeneous platform. I just feel class MedusaWorker(NonLLMProposerWorkerBase, WorkerCls) that derives a dynamic WorkerCls seems not trivial and not sure if this is safe and reliable.

@comaniac, I see, alternatively, I can add all necessary API to worker_base.py and make medusa_worker / multi_step_worker and others derive from "WorkerBase" instead of "Worker"?
But the change will be tremendous that is why I am not sure If I should do that.

I tested with current way of using 'dynamic WorkerCls', it is working on CUDA and CPU, also works for HPU in my own dev.
So I considered it as a valid solution.

@comaniac , I updated this PR, now WorkerCls is added to "vllm/spec_decode/selector.py" instead of spreading them all around. Please check if this looks better?

@comaniac , I verified with distributed case as well using test below

CUDA_VISIBLE_DEVICES=0,1 pytest -v tests/spec_decode/e2e/test_integration_dist_tp2.py::test_draft_model_tp_lt_target_model_tp2

comaniac · 2024-11-20T23:16:04Z

vllm/spec_decode/selector.py

base_cls_selector.py may be a better name for this.

Can we wrap the logic to an API? For example

def get_worker_cls_by_platform(): ...

In general this is still not the best practice, but I don't have a better solution atm.
cc @youkaichao

vllm/spec_decode/spec_decode_worker.py

comaniac · 2024-11-20T23:19:59Z

vllm/spec_decode/spec_decode_worker.py

@@ -320,7 +348,7 @@ def init_device(self) -> None:
                "[Speculative Decoding] Use MQA scorer for scoring proposals.")

        self.scorer = scorer_cls(scorer_worker=self.scorer_worker,
-                                 device=self.device,
+                                 device=self.device.type,


The argument is device so you shouldn't pass "device type". You could take the device type in scorer_cls and don't need to change this line.

Hi, @comaniac , the reason I changed that is because the device type is str in Scorer_cls init, but for some reason, it passed device=> so it failed mypy test

https://github.com/vllm-project/vllm/blob/main/vllm/spec_decode/interfaces.py#L78-L79

vllm/spec_decode/ngram_worker.py

Signed-off-by: Chendi Xue <[email protected]>

youkaichao · 2024-11-21T17:25:30Z

vllm/spec_decode/selector.py

+        ModelInputForNeuron as ModelInputCls)
+    from vllm.worker.neuron_model_runner import (  # noqa: F401
+        NeuronModelRunner as ModelRunnerCls)
+    from vllm.worker.neuron_worker import (  # noqa: F401


oh I actually plan to add some arguments like --worker-cls auto and let every platform select there own worker class. we should do that.

@youkaichao, is something I can refer to? Or is this file works, currently, I put it under spec_decode folder, it also makes sense to put under worker folder.

xuechendi · 2024-11-21T20:31:02Z

@comaniac , I resolved most of your comments, and left two TODOs:

change 'device.type' back to 'device'. The reason I changed to 'device.type' is a type fix captured during mypy check, SpeculativeScorer init function requires device type as 'str', change back to 'device' failed mypy check.
define get_worker_cls_by_platform(): in selector.py => I saw Kaikao said he has some plan on that, I'll check with him so I left the selector.py unchanged at this moment.

comaniac · 2024-11-22T00:09:32Z

define get_worker_cls_by_platform(): in selector.py => I saw Kaikao said he has some plan on that, I'll check with him so I left the selector.py unchanged at this moment.

#10555 should fix this.

Signed-off-by: Chendi Xue <[email protected]>

mergify · 2024-11-22T05:01:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xuechendi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Chendi Xue <[email protected]>

xuechendi · 2024-11-23T02:31:08Z

Thanks, @comaniac , I created a new PR to use WorkerWrapperBase instead of Dynamic WorkerCls. Please is => #10587

xuechendi force-pushed the spec_decode_detach_hw branch 4 times, most recently from 16a98e1 to 23037b4 Compare November 6, 2024 23:19

mergify bot added the needs-rebase label Nov 7, 2024

xuechendi force-pushed the spec_decode_detach_hw branch from 615ea18 to cdd0471 Compare November 7, 2024 16:51

mergify bot removed the needs-rebase label Nov 7, 2024

xuechendi closed this Nov 7, 2024

xuechendi reopened this Nov 7, 2024

xuechendi force-pushed the spec_decode_detach_hw branch from 6247f29 to 1ea5684 Compare November 7, 2024 20:21

xuechendi force-pushed the spec_decode_detach_hw branch from 1ea5684 to ef7262e Compare November 7, 2024 21:26

xuechendi mentioned this pull request Nov 7, 2024

[Kernel]Enable HPU for Speculative Decoding #10131

Closed

xuechendi force-pushed the spec_decode_detach_hw branch from ef7262e to c31db7b Compare November 8, 2024 00:55

xuechendi changed the title ~~Generalize Speculative decode from Cuda~~ [Kernel]Generalize Speculative decode from Cuda Nov 8, 2024

xuechendi force-pushed the spec_decode_detach_hw branch from 4337679 to 77ac59a Compare November 8, 2024 16:35

mergify bot added the needs-rebase label Nov 11, 2024

xuechendi force-pushed the spec_decode_detach_hw branch from 77ac59a to 9a3bd16 Compare November 15, 2024 22:11

mergify bot removed the needs-rebase label Nov 15, 2024

xuechendi added 5 commits November 16, 2024 00:39

Spec Decode - Remove hard-dependency on GPU

e78d570

Signed-off-by: Chendi Xue <[email protected]>

Enable CPU for speculative decoding

ce2665c

Signed-off-by: Chendi Xue <[email protected]>

Fix mypy formatting issue

2f393a1

Signed-off-by: Chendi Xue <[email protected]>

forget to submit cpu_draft_model_runner. add it here

707c149

Signed-off-by: Chendi Xue <[email protected]>

Fix format

9a3bd16

Signed-off-by: Chendi Xue <[email protected]>

mergify bot added the needs-rebase label Nov 20, 2024

comaniac reviewed Nov 20, 2024

View reviewed changes

Merge branch 'main' into spec_decode_detach_hw

9162eef

mergify bot removed the needs-rebase label Nov 20, 2024

comaniac requested changes Nov 20, 2024

View reviewed changes

xuechendi added 2 commits November 21, 2024 01:58

rebase to main

17e1aa5

Signed-off-by: Chendi Xue <[email protected]>

extract platform selector into single file

3a4e912

Signed-off-by: Chendi Xue <[email protected]>

youkaichao reviewed Nov 21, 2024

View reviewed changes

use selector to define workerCls

d6e2b05

Signed-off-by: Chendi Xue <[email protected]>

mergify bot added the needs-rebase label Nov 22, 2024

Remove inflight DEVICE_TYPE in selector

73916b3

Signed-off-by: Chendi Xue <[email protected]>

xuechendi closed this Nov 25, 2024

xuechendi deleted the spec_decode_detach_hw branch December 19, 2024 21:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel]Generalize Speculative decode from Cuda #10094

[Kernel]Generalize Speculative decode from Cuda #10094

xuechendi commented Nov 6, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 6, 2024

xuechendi commented Nov 6, 2024 •

edited

Loading

xuechendi commented Nov 7, 2024 •

edited

Loading

mergify bot commented Nov 7, 2024

xuechendi commented Nov 7, 2024 •

edited

Loading

mgoin commented Nov 7, 2024

xuechendi commented Nov 8, 2024

xuechendi commented Nov 8, 2024

mergify bot commented Nov 11, 2024

cadedaniel commented Nov 11, 2024

xuechendi commented Nov 12, 2024 •

edited

Loading

xuechendi commented Nov 12, 2024

mergify bot commented Nov 20, 2024

comaniac Nov 20, 2024

xuechendi Nov 20, 2024 •

edited

Loading

comaniac Nov 20, 2024

xuechendi Nov 20, 2024

comaniac Nov 20, 2024

xuechendi Nov 20, 2024

xuechendi Nov 20, 2024

xuechendi Nov 20, 2024

comaniac Nov 20, 2024

comaniac Nov 20, 2024

xuechendi Nov 21, 2024

youkaichao Nov 21, 2024

xuechendi Nov 21, 2024

xuechendi commented Nov 21, 2024

comaniac commented Nov 22, 2024

mergify bot commented Nov 22, 2024

xuechendi commented Nov 23, 2024

[Kernel]Generalize Speculative decode from Cuda #10094

[Kernel]Generalize Speculative decode from Cuda #10094

Conversation

xuechendi commented Nov 6, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 6, 2024

xuechendi commented Nov 6, 2024 • edited Loading

xuechendi commented Nov 7, 2024 • edited Loading

mergify bot commented Nov 7, 2024

xuechendi commented Nov 7, 2024 • edited Loading

mgoin commented Nov 7, 2024

xuechendi commented Nov 8, 2024

xuechendi commented Nov 8, 2024

mergify bot commented Nov 11, 2024

cadedaniel commented Nov 11, 2024

xuechendi commented Nov 12, 2024 • edited Loading

xuechendi commented Nov 12, 2024

mergify bot commented Nov 20, 2024

Choose a reason for hiding this comment

xuechendi Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuechendi commented Nov 21, 2024

comaniac commented Nov 22, 2024

mergify bot commented Nov 22, 2024

xuechendi commented Nov 23, 2024

xuechendi commented Nov 6, 2024 •

edited by github-actions bot

Loading

xuechendi commented Nov 6, 2024 •

edited

Loading

xuechendi commented Nov 7, 2024 •

edited

Loading

xuechendi commented Nov 7, 2024 •

edited

Loading

xuechendi commented Nov 12, 2024 •

edited

Loading

xuechendi Nov 20, 2024 •

edited

Loading