You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this scenario, I want to use system prompt cache with continuous kv cache in the context phase.
I have 14 precomputed key-value pairs stored in the continuous kv cache with shape [batchsize, 2, num_kv_heads, max_seq_len, head_size]. When I first compute attention values using gpt-attention-plugin, I pass the kv cache as the parametaer past_key_value and set context_lengths, sequence_length and host_past_key_value_lengths as 14.
Since it's in the context phase, I find that kv cache with precomputed values are not used in the computing attention progress. The attention results show that gpt-attention-plugin doesn't count the precomputed values. And the present kv values computed by get-attention are stored in the kvcache[0:14], instead of being stored in the [14:28].
So I wonder how to use continuous kv cache with prefix prompt caching in gpt attention plugin in context phase?
The text was updated successfully, but these errors were encountered:
FPTMMC
changed the title
how to use continuous kv cache in gpt attention plugin in context phase?
how to use continuous kv cache with prefix prompt caching in gpt attention plugin in context phase?
Dec 19, 2024
In this scenario, I want to use system prompt cache with continuous kv cache in the context phase.
I have 14 precomputed key-value pairs stored in the continuous kv cache with shape [batchsize, 2, num_kv_heads, max_seq_len, head_size]. When I first compute attention values using gpt-attention-plugin, I pass the kv cache as the parametaer past_key_value and set context_lengths, sequence_length and host_past_key_value_lengths as 14.
Since it's in the context phase, I find that kv cache with precomputed values are not used in the computing attention progress. The attention results show that gpt-attention-plugin doesn't count the precomputed values. And the present kv values computed by get-attention are stored in the kvcache[0:14], instead of being stored in the [14:28].
So I wonder how to use continuous kv cache with prefix prompt caching in gpt attention plugin in context phase?
The text was updated successfully, but these errors were encountered: