-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: speculative decoding doesn't work with online mode #6967
Comments
Hi @cadedaniel, would you by chance know how to overcome this or who should I tag about it? Thank you very much! |
Hi @stas00. Can you share what attention backend your online serving setup is using? The way you use this is expected to work. |
Thank you for the follow up, @cadedaniel I launch the server with just the defaults:
do I need to specify a specific attention? the offline script didn't specify any https://docs.vllm.ai/en/stable/models/spec_decode.html#speculating-with-a-draft-model |
The client was just:
|
Yes, although I suspect the issue is in which attention backend is used. Can you share the full log when you run the server? It should say the attention backend that is automatically selected |
The log mentions no attention. Attached the log - I have just updated to 0.5.4 (same problem) |
Interesting, that log has a different failure
|
Yes, I noticed that too - as I said probably something changed in 0.5.4 - I can re-run with 0.5.3.post1 if it helps, but I don't think it mentions the attn backend either. Should I activate some debug flag to get it printed? |
@stas00 I noticed that you are passing @tjohnson31415 discovered a regression related to this and is working on a fix right now. |
@njhill, your suggestion did the trick! Thank you! it works if I remove Should it not be set for speculative decoding? Could you perhaps add a check if it must not be passed - but I suppose the client doesn't know of the server configuration so it'd be tricky to know when not to allow it. Perhaps it could query its capabilities/setup? some sort of debug mode? |
It's a regression (@tjohnson31415 thinks from #6485). I think logprobs should work with spec decoding but just don't in the current version. I think prompt logprobs might intentionally not be supported currently with spec decoding, and we should return an appropriate error message if that's the case (not sure if that's currently done or not). It would be nice to support these too however, maybe that can also be fixed. |
Oh good catch @njhill ! likely that is the top-level issue. |
Hi all. Even with my fix, I can't get logprobs to be returned for generated tokens, so there may be an additional change needed for that. |
From a latency perspective, calculating logprobs costs a lot. We may just create an issue with the regression and guide users to run the model without speculative decoding for logprob calculation until it is fixed. |
FWIW, in my particular case I didn't care for Should this doc be updated to show the online version of it as well? |
Yeah, if you are able to update the doc to have an online version @stas00 that would be amazing. |
Added here: #7243 |
Your current environment
🐛 Describe the bug
I'm able to successfully run your offline speculative example
https://docs.vllm.ai/en/stable/models/spec_decode.html#speculating-with-a-draft-model
I'm trying to make the same work with the online approach and it keeps on crashing.
I mimic the server launch with:
If I use the client example from https://docs.vllm.ai/en/stable/getting_started/examples/openai_completion_client.html the server fails to respond:
Could you please share an example of an online speculative decoding that works?
Thank you!
The text was updated successfully, but these errors were encountered: