Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale #6081

Merged
merged 13 commits into from
Jul 16, 2024

Conversation

mgoin
Copy link
Member

@mgoin mgoin commented Jul 3, 2024

Since we already quantize key_cache and value_cache separately in PagedAttention, there is "free accuracy on the table" for FP8 KV Cache quantization as we could use separate per-tensor scales for each.

The FlashInfer FP8 attention kernel also uses separate k_scale and v_scale values, so this PR is in preparation to enable that usage. Source: https://github.com/flashinfer-ai/flashinfer/blob/dc2c76f8577d8695112b61d1fd43ef88569272ef/python/flashinfer/decode.py#L98-L101

This PR will maintain backwards compatibility with FP8 model checkpoints that currently use kv_scale and duplicates that scale for both key+value if that is all that is available. However if a checkpoint has k_scale and v_scale present on the attention module, we will prefer those values.

@comaniac comaniac self-assigned this Jul 3, 2024
@mgoin mgoin changed the title [Kernel][Attention] Separate Attention.kv_scale into key_scale and value_scale [Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale Jul 15, 2024
@mgoin mgoin marked this pull request as ready for review July 15, 2024 23:22
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 15, 2024
vllm/model_executor/model_loader/weight_utils.py Outdated Show resolved Hide resolved
vllm/model_executor/model_loader/weight_utils.py Outdated Show resolved Hide resolved
vllm/model_executor/model_loader/weight_utils.py Outdated Show resolved Hide resolved
vllm/model_executor/models/llama.py Show resolved Hide resolved
vllm/model_executor/models/mixtral.py Show resolved Hide resolved
vllm/model_executor/models/qwen2.py Show resolved Hide resolved
Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@simon-mo simon-mo merged commit 978aed5 into vllm-project:main Jul 16, 2024
69 of 73 checks passed
dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request Jul 17, 2024
fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Jul 19, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants