You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you have a large prompt that you use repeatedly for generation tasks, it's inefficient to re-compute the embedding of the entire prompt every time. Llama.cpp currently has a feature where you can store the KV cache to disk, run a diff against the new prompt for each run, and only re-compute the KV cache for the last part of the prompt that might have changed. Adding such a feature to TensorRT-LLM could significantly reduce latencies in such scenarios where only the end of the prompt changes between runs. The easiest solution would be caching in-memory, I think this would mainly just require changes to handle_per_step. A more extensive solution could also feature disk caching, like Llama.cpp does. I'm willing to help contribute this feature, first I just need to figure out my other issues with getting a quantized model that'll fit on my rig 😄.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
If you have a large prompt that you use repeatedly for generation tasks, it's inefficient to re-compute the embedding of the entire prompt every time. Llama.cpp currently has a feature where you can store the KV cache to disk, run a diff against the new prompt for each run, and only re-compute the KV cache for the last part of the prompt that might have changed. Adding such a feature to TensorRT-LLM could significantly reduce latencies in such scenarios where only the end of the prompt changes between runs. The easiest solution would be caching in-memory, I think this would mainly just require changes to handle_per_step. A more extensive solution could also feature disk caching, like Llama.cpp does. I'm willing to help contribute this feature, first I just need to figure out my other issues with getting a quantized model that'll fit on my rig 😄.
Beta Was this translation helpful? Give feedback.
All reactions