-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel]Generalize Speculative decode from Cuda #10094
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Hi, @LiuXiaoxuanPKU, may you take a look of this PR |
16a98e1
to
23037b4
Compare
Hi, @simon-mo, will you check on this PR? |
This pull request has merge conflicts that must be resolved before it can be |
615ea18
to
cdd0471
Compare
6247f29
to
1ea5684
Compare
@WoosukKwon , will you take a look at this PR? |
Although it may not be practical due to the lack of compute intensity, it would be helpful for testing of the generalization to have a CPU implementation to more easily test non-CUDA |
1ea5684
to
ef7262e
Compare
ef7262e
to
c31db7b
Compare
@mgoin , CPU supported for spec decode is added. Please help to take a review |
@cadedaniel , may you take a look of this PR. I would like to remove hard-dependencies for spec decode to CUDA, so we can apply to other hardware |
4337679
to
77ac59a
Compare
This pull request has merge conflicts that must be resolved before it can be |
Can you share the performance improvement on AMD hardware? Cc @LiuXiaoxuanPKU @comaniac |
@cadedaniel , thanks for reviewing this PR. I aimed to use this PR to firstly make it possible to run Spec Decode on other HW besides GPU. FYI, we have another proposal to provide heterogenous setup which runs draft model on CPU and target model on GPU. We can discuss about that later which may be better use case for running spec on CPU. |
Hi, @njhill , I just learned you are one of owners for spec decode, may you help to take a review on this PR? |
77ac59a
to
9a3bd16
Compare
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
vllm/spec_decode/medusa_worker.py
Outdated
if current_platform.is_neuron(): | ||
from vllm.worker.neuron_worker import NeuronWorker as WorkerCls | ||
elif current_platform.is_hpu(): | ||
from vllm.worker.hpu_worker import HPUWorker as WorkerCls | ||
elif current_platform.is_openvino(): | ||
from vllm.worker.openvino_worker import OpenVINOWorker as WorkerCls | ||
elif current_platform.is_cpu(): | ||
from vllm.worker.cpu_worker import CPUWorker as WorkerCls | ||
elif current_platform.is_tpu(): | ||
from vllm.worker.tpu_worker import TPUWorker as WorkerCls | ||
elif current_platform.is_xpu(): | ||
from vllm.worker.xpu_worker import XPUWorker as WorkerCls | ||
else: | ||
from vllm.worker.worker import Worker as WorkerCls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a clean and concise way to support non CUDA workers, so apparently you'll need some designs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comaniac , I could put a worker_selector.py in either worker folder or in spec_decode folder, I didn't do that was because when I discussed this with @LiuXiaoxuanPKU , she prefer to keep this PR as simple as possible.
Would like your opinion here? The idea is that, I can extract above codes into a new file, and in spec_decode_worker, medusa_worker, simply do "from vllm.worker.selector import WorkerCls"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is I don't think the current PR is simple, given that this logic is tedious and duplicated everywhere. I'm also not sure if this is reliable to derive classes based on a dynamic variable (i.e. current_platform
) in a distributed environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @comaniac, do you mean support for heterogeneous platform in spec decode path?
Yeah, I totally Agree that current codes are tedious, do you think extract the worker_selector into a single file to simplify the codes works? or do you have other suggestion?
I am totally open to discuss about the design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mean to support heterogeneous platform. I just feel class MedusaWorker(NonLLMProposerWorkerBase, WorkerCls)
that derives a dynamic WorkerCls
seems not trivial and not sure if this is safe and reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comaniac, I see, alternatively, I can add all necessary API to worker_base.py and make medusa_worker / multi_step_worker and others derive from "WorkerBase" instead of "Worker"?
But the change will be tremendous that is why I am not sure If I should do that.
I tested with current way of using 'dynamic WorkerCls', it is working on CUDA and CPU, also works for HPU in my own dev.
So I considered it as a valid solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comaniac , I updated this PR, now WorkerCls is added to "vllm/spec_decode/selector.py" instead of spreading them all around. Please check if this looks better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comaniac , I verified with distributed case as well using test below
CUDA_VISIBLE_DEVICES=0,1 pytest -v tests/spec_decode/e2e/test_integration_dist_tp2.py::test_draft_model_tp_lt_target_model_tp2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
base_cls_selector.py
may be a better name for this.
Can we wrap the logic to an API? For example
def get_worker_cls_by_platform():
...
In general this is still not the best practice, but I don't have a better solution atm.
cc @youkaichao
@@ -320,7 +348,7 @@ def init_device(self) -> None: | |||
"[Speculative Decoding] Use MQA scorer for scoring proposals.") | |||
|
|||
self.scorer = scorer_cls(scorer_worker=self.scorer_worker, | |||
device=self.device, | |||
device=self.device.type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument is device
so you shouldn't pass "device type". You could take the device type in scorer_cls
and don't need to change this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @comaniac , the reason I changed that is because the device type is str in Scorer_cls init, but for some reason, it passed device=> so it failed mypy test
https://github.com/vllm-project/vllm/blob/main/vllm/spec_decode/interfaces.py#L78-L79
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
ModelInputForNeuron as ModelInputCls) | ||
from vllm.worker.neuron_model_runner import ( # noqa: F401 | ||
NeuronModelRunner as ModelRunnerCls) | ||
from vllm.worker.neuron_worker import ( # noqa: F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I actually plan to add some arguments like --worker-cls auto
and let every platform select there own worker class. we should do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao, is something I can refer to? Or is this file works, currently, I put it under spec_decode folder, it also makes sense to put under worker folder.
@comaniac , I resolved most of your comments, and left two TODOs:
|
#10555 should fix this. |
Signed-off-by: Chendi Xue <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Chendi Xue <[email protected]>
This PR is mainly target to remove hard dependency for CUDA in speculative decoding
Done:
Based on discussion with @comaniac and @youkaichao , I provide a Second solution to avoid Dynamic WorkerCls => #10587
Settings:
*ngram