You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
For function calling it would be extremely useful if we could use the prompt cache to cache multiple prompts and use different caches for different calls depending on which of the cached prompts we want to use.
Motivation
This would massively improve performance for function calling. I'm on an M4 Max and trying to use speculative decoding for a quantized model, which doesn't work (unclear whether it shoudl or not - one of the pages here said spec. decoding doesn't wortkf for quantized models). But with a multi-prompt cahce, I could still get significant performance juice out of my current setup.
Possible Implementation
I think it wouldn't be hard to do this. vllm supports it.
The text was updated successfully, but these errors were encountered:
Prerequisites
Feature Description
For function calling it would be extremely useful if we could use the prompt cache to cache multiple prompts and use different caches for different calls depending on which of the cached prompts we want to use.
Motivation
This would massively improve performance for function calling. I'm on an M4 Max and trying to use speculative decoding for a quantized model, which doesn't work (unclear whether it shoudl or not - one of the pages here said spec. decoding doesn't wortkf for quantized models). But with a multi-prompt cahce, I could still get significant performance juice out of my current setup.
Possible Implementation
I think it wouldn't be hard to do this. vllm supports it.
The text was updated successfully, but these errors were encountered: