-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. #6485
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. #6485
Conversation
Pull from head
👋 Hi! Thank you for contributing to the vLLM project. Full CI run is still required to merge this PR so once the PR is ready to go, please make sure to run it. If you need all test signals in between PR commits, you can trigger full CI as well. To run full CI, you can do one of these:
🚀 |
@cadedaniel this pr is now ready for review. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great stuff, thanks! all feedback is code cleanliness/comments
|
||
|
||
class TargetModelRunner(ModelRunner): | ||
"""Specialized model runner for speculative decoding target model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add comment explaining why we do this
In speculative decoding, the logprobs selected may not be the same ones as
selected by the target model sampling. This means that the time spent in the
logprob calculation of the target model is time wasted, since we calculate
logprobs after deciding which tokens are accepted. For this reason we disable
logprobs in the target model so scoring is faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Thanks for the review. Addressed all comments. PTAL. |
Enabling auto merge |
… for both draft and target models. (vllm-project#6485)
… for both draft and target models. (vllm-project#6485)
… for both draft and target models. (vllm-project#6485)
… for both draft and target models. (vllm-project#6485) Signed-off-by: Alvant <[email protected]>
… for both draft and target models. (vllm-project#6485)
In this PR we disable the serialization of the LogProbs to CPU for both draft and target models. To that end we make the following changes