[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` #6081

mgoin · 2024-07-03T00:15:15Z

Since we already quantize key_cache and value_cache separately in PagedAttention, there is "free accuracy on the table" for FP8 KV Cache quantization as we could use separate per-tensor scales for each.

The FlashInfer FP8 attention kernel also uses separate k_scale and v_scale values, so this PR is in preparation to enable that usage. Source: https://github.com/flashinfer-ai/flashinfer/blob/dc2c76f8577d8695112b61d1fd43ef88569272ef/python/flashinfer/decode.py#L98-L101

This PR will maintain backwards compatibility with FP8 model checkpoints that currently use kv_scale and duplicates that scale for both key+value if that is all that is available. However if a checkpoint has k_scale and v_scale present on the attention module, we will prefer those values.

vllm/model_executor/layers/quantization/fp8.py

vllm/model_executor/model_loader/weight_utils.py

vllm/model_executor/models/llama.py

vllm/model_executor/models/mixtral.py

vllm/model_executor/models/qwen2.py

comaniac

LGTM

…`v_scale` (vllm-project#6081)

…`v_scale` (vllm-project#6081) Signed-off-by: Alvant <[email protected]>

mgoin added 2 commits July 3, 2024 00:10

Separate kv_scale into key_scale and value_scale

70f0b13

Fix tests

95fbd2c

comaniac self-assigned this Jul 3, 2024

Try adding scales to weight loader

1050a18

comaniac reviewed Jul 3, 2024

View reviewed changes

vllm/model_executor/layers/quantization/fp8.py Outdated Show resolved Hide resolved

Merge branch 'upstream-main' into separate-key-value-scales

ab09233

mgoin changed the title ~~[Kernel][Attention] Separate Attention.kv_scale into key_scale and value_scale~~ [Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale Jul 15, 2024

mgoin added 4 commits July 15, 2024 19:25

Format

3369950

Update to use kv_scale if found

b1616eb

Format

7f63906

Rename

ccc2a80

mgoin marked this pull request as ready for review July 15, 2024 23:22

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 15, 2024

mgoin added 2 commits July 15, 2024 23:23

Poke

35f02a6

Merge branch 'upstream-main' into separate-key-value-scales

24bfd84

comaniac reviewed Jul 16, 2024

View reviewed changes

Review comments to cleanup maybe_remap_kv_scale_name

d26453e

mgoin mentioned this pull request Jul 16, 2024

Separate kv_scale into k_scale and v_scale neuralmagic/AutoFP8#25

Merged

mgoin added 2 commits July 16, 2024 20:19

Add test and fix for loading

728e524

Replace key_scale and value_scale everywhere

31e15b8

comaniac approved these changes Jul 16, 2024

View reviewed changes

simon-mo merged commit 978aed5 into vllm-project:main Jul 16, 2024
69 of 73 checks passed

mgoin mentioned this pull request Jul 17, 2024

[Bug]: illegal memory access when increase max_model_length on FP8 models #6429

Open

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request Jul 17, 2024

[Kernel][Attention] Separate Attention.kv_scale into k_scale and …

86b6075

…`v_scale` (vllm-project#6081)

fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Jul 19, 2024

[Kernel][Attention] Separate Attention.kv_scale into k_scale and …

fdd5404

…`v_scale` (vllm-project#6081)

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

[Kernel][Attention] Separate Attention.kv_scale into k_scale and …

359d572

…`v_scale` (vllm-project#6081)

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Kernel][Attention] Separate Attention.kv_scale into k_scale and …

d94dcf8

…`v_scale` (vllm-project#6081) Signed-off-by: Alvant <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` #6081

[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` #6081

mgoin commented Jul 3, 2024 •

edited

Loading

comaniac left a comment

[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale #6081

[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale #6081

Conversation

mgoin commented Jul 3, 2024 • edited Loading

comaniac left a comment

Choose a reason for hiding this comment

[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` #6081

[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` #6081

mgoin commented Jul 3, 2024 •

edited

Loading