Feature Request: Multiple prompts for prompt caching #10904

firelex · 2024-12-19T15:27:44Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

For function calling it would be extremely useful if we could use the prompt cache to cache multiple prompts and use different caches for different calls depending on which of the cached prompts we want to use.

Motivation

This would massively improve performance for function calling. I'm on an M4 Max and trying to use speculative decoding for a quantized model, which doesn't work (unclear whether it shoudl or not - one of the pages here said spec. decoding doesn't wortkf for quantized models). But with a multi-prompt cahce, I could still get significant performance juice out of my current setup.

Possible Implementation

I think it wouldn't be hard to do this. vllm supports it.

github-actions · 2025-02-03T01:07:14Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

firelex added the enhancement New feature or request label Dec 19, 2024

github-actions bot added the stale label Jan 19, 2025

github-actions bot closed this as completed Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Multiple prompts for prompt caching #10904

Feature Request: Multiple prompts for prompt caching #10904

firelex commented Dec 19, 2024

github-actions bot commented Feb 3, 2025

Feature Request: Multiple prompts for prompt caching #10904

Feature Request: Multiple prompts for prompt caching #10904

Comments

firelex commented Dec 19, 2024

Prerequisites

Feature Description

Motivation

Possible Implementation

github-actions bot commented Feb 3, 2025