Question: is key_state_compressed used for inference? #24

jq-wei · 2024-11-20T08:58:04Z

Hi,

Thanks for the great contribution!

I have a question about the usage of key_states_compress. If I understand correctly, key_states_compress is the topk token (clusters) from prompt (in prefilling stage). Then during inference, new query should only calculate attention with key_states_compress + some_newly_generated_key_states. However, I see flash-attn use the full prompt's key_states, and key_states_compress is not used. Is this supposed to be like this, or I miss anything?

Thank you!

jq-wei · 2024-11-20T10:53:03Z

Especially, after prefilling, there is one attention loop for seq_len - (self.max_capacity_prompt) +1 many tokens, what is this for?

After this, decoding starts, but seems using the full KV cache.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: is key_state_compressed used for inference? #24

Question: is key_state_compressed used for inference? #24

jq-wei commented Nov 20, 2024

jq-wei commented Nov 20, 2024

Question: is key_state_compressed used for inference? #24

Question: is key_state_compressed used for inference? #24

Comments

jq-wei commented Nov 20, 2024

jq-wei commented Nov 20, 2024