Ingest FP8 attn scales and use them in ROCm FlashAttention #338

mawong-amd · 2024-12-19T01:40:49Z

Thanks to the work of @ilia-cher in #301, Triton FA supports per-tensor quantized FP8 almost-everything (quantized first and second GEMMs and attention output <- needs per-tensor quantized Q, K, V, softmax(QK^T) and corresponding scales).

This PR enables the aforementioned quantization routines in Triton FA and ROCm PA if a quantized (text-only) Llama model contains attention output scales and an appropriate environment variable is set (VLLM_USE_ROCM_FP8_FLASH_ATTN={True/1}, off by default). Extending this to other model architectures is straightforward but not done for now. Accuracy might dip for Triton FA if not all scales are present in the model.

gshtras · 2024-12-19T15:41:44Z

vllm/model_executor/models/llama.py

@@ -428,7 +434,9 @@ def load_weights(self, weights: Iterable[Tuple[str,
                param = params_dict[scale_name]
                weight_loader = getattr(param, "weight_loader",
                                        default_weight_loader)
-                loaded_weight = loaded_weight[0]
+                if loaded_weight.shape:


Same for mllama.py?

I targeted only the models which unconditionally do this logic

gshtras

Without explicitly disabling VLLM_USE_ROCM_FP8_ATTN now quark quantized models (amd/Meta-Llama-3.1-70B-Instruct-FP8-KV) fail with a triton exception:

python: /root/.triton/llvm/llvm-c08c6a71-ubuntu-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From&) [with To = mlir::detail::TypedValue<mlir::RankedTensorType>; From = mlir::OpResult]: Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed.

…ntion for dynamic quantization

mawong-amd · 2024-12-19T22:20:05Z

Without explicitly disabling VLLM_USE_ROCM_FP8_ATTN now quark quantized models (amd/Meta-Llama-3.1-70B-Instruct-FP8-KV) fail with a triton exception:
python: /root/.triton/llvm/llvm-c08c6a71-ubuntu-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From&) [with To = mlir::detail::TypedValue<mlir::RankedTensorType>; From = mlir::OpResult]: Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed.

~~This requires a newer Triton~~, preferably one after triton-lang/triton#5362 by @ilia-cher to obtain full performance benefits.

EDIT: @ilia-cher identified the issue and provided a simple fix that works on older Triton. Still recommended to upgrade to latest.

Ingest FP8 attn scales and use them in Triton FA, if present

9ba2fab

mawong-amd force-pushed the ingest_fp8_attn_scales branch from 4e42946 to 9ba2fab Compare December 19, 2024 01:42

gshtras approved these changes Dec 19, 2024

View reviewed changes

gshtras reviewed Dec 19, 2024

View reviewed changes

gshtras requested changes Dec 19, 2024

View reviewed changes

gshtras added 3 commits December 19, 2024 14:51

Disabling calc_kv_scales if the checkoint has them. Enabling fp8 atte…

37f37d1

…ntion for dynamic quantization

q_range as an env

a283f40

format

7908e9b

Dedupe FA/PA attn toggles, set FA off by default

1ed1389

mawong-amd force-pushed the ingest_fp8_attn_scales branch from 9639307 to 1ed1389 Compare December 19, 2024 23:15

Lint again, to fixed point

06f53ba

mawong-amd changed the title ~~Ingest FP8 attn scales and use them in ROCm Flash/Paged Attention~~ Ingest FP8 attn scales and use them in ROCm FlashAttention Dec 19, 2024

gshtras previously approved these changes Dec 19, 2024

View reviewed changes

Don't calculate KV scales dynamically if Q scale is included

0bd414a

mawong-amd dismissed gshtras’s stale review via 0bd414a December 19, 2024 23:30

gshtras approved these changes Dec 19, 2024

View reviewed changes

gshtras merged commit 1dcd9fe into main Dec 20, 2024
7 of 8 checks passed

gshtras deleted the ingest_fp8_attn_scales branch December 20, 2024 00:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ingest FP8 attn scales and use them in ROCm FlashAttention #338

Ingest FP8 attn scales and use them in ROCm FlashAttention #338

Uh oh!

mawong-amd commented Dec 19, 2024 •

edited by github-actions bot

Loading

Uh oh!

gshtras Dec 19, 2024

Uh oh!

mawong-amd Dec 19, 2024

Uh oh!

gshtras left a comment •

edited

Loading

Uh oh!

mawong-amd commented Dec 19, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Ingest FP8 attn scales and use them in ROCm FlashAttention #338

Ingest FP8 attn scales and use them in ROCm FlashAttention #338

Uh oh!

Conversation

mawong-amd commented Dec 19, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gshtras Dec 19, 2024

Choose a reason for hiding this comment

Uh oh!

mawong-amd Dec 19, 2024

Choose a reason for hiding this comment

Uh oh!

gshtras left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mawong-amd commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mawong-amd commented Dec 19, 2024 •

edited by github-actions bot

Loading

gshtras left a comment •

edited

Loading

mawong-amd commented Dec 19, 2024 •

edited

Loading