Feature Idea: Prompt KV-Caching #258

atyshka · 2023-11-02T19:11:43Z

atyshka
Nov 2, 2023

If you have a large prompt that you use repeatedly for generation tasks, it's inefficient to re-compute the embedding of the entire prompt every time. Llama.cpp currently has a feature where you can store the KV cache to disk, run a diff against the new prompt for each run, and only re-compute the KV cache for the last part of the prompt that might have changed. Adding such a feature to TensorRT-LLM could significantly reduce latencies in such scenarios where only the end of the prompt changes between runs. The easiest solution would be caching in-memory, I think this would mainly just require changes to handle_per_step. A more extensive solution could also feature disk caching, like Llama.cpp does. I'm willing to help contribute this feature, first I just need to figure out my other issues with getting a quantized model that'll fit on my rig 😄.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Idea: Prompt KV-Caching #258

{{title}}

Replies: 0 comments

Select a reply

Feature Idea: Prompt KV-Caching #258

atyshka Nov 2, 2023

Replies: 0 comments

atyshka
Nov 2, 2023