[vllm] Add support for FP8 in Triton FA kernel #301

ilia-cher · 2024-12-04T02:44:10Z

Adding support for FP8 (E4M3) in Triton FA kernel, including per-tensor scaling factors.

Test:

1. Patched rocm_flash_attn.py to call FA kernel with scaling factors
   (https://gist.github.com/ilia-cher/216762889331cefeb158634a651b2fac)

2. Ran the benchmark:
   python3 benchmark_latency.py --model \
      /data/models/Llama-3.1-8B-Instruct-FP8-KV \
      --input-len 8192 \
      --output-len 1 \
      --batch-size 32 \
      --enforce-eager \
      --num-iters 10 \
      --num-iters-warmup 2 \
      --enable-chunked-prefill False \
      --dtype float16
Before:
Avg latency: 6.418297152221203 seconds
10% percentile latency: 6.380122036673129 seconds
25% percentile latency: 6.390297698322684 seconds
50% percentile latency: 6.404989298898727 seconds
75% percentile latency: 6.421127524343319 seconds
90% percentile latency: 6.4394324975088235 seconds
99% percentile latency: 6.562963163470849 seconds

After:
Avg latency: 5.162057781498879 seconds
10% percentile latency: 5.1219399653142315 seconds
25% percentile latency: 5.135780334530864 seconds
50% percentile latency: 5.151887209853157 seconds
75% percentile latency: 5.158517300733365 seconds
90% percentile latency: 5.184290232090279 seconds
99% percentile latency: 5.314461483638734 seconds

3. (Sanity) check using
   https://gist.github.com/ilia-cher/951a3d011a8bafa7c5180fbc3a151a57

4. (follow up in scaling factors loading PR) P3L perplexity check

shajrawi

Brilliant work Ilia, Kudos!

vllm/attention/ops/triton_flash_attention.py

Adding support for FP8 (E4M3) in Triton FA kernel, including per-tensor scaling factors. Test: 1. Patched rocm_flash_attn.py to call FA kernel with scaling factors (https://gist.github.com/ilia-cher/216762889331cefeb158634a651b2fac) 2. Run the benchmark: python3 benchmark_latency.py --model \ /data/models/Llama-3.1-8B-Instruct-FP8-KV \ --input-len 8192 \ --output-len 1 \ --batch-size 32 \ --enforce-eager \ --num-iters 10 \ --num-iters-warmup 2 \ --enable-chunked-prefill False \ --dtype float16 Before: Avg latency: 6.418297152221203 seconds 10% percentile latency: 6.380122036673129 seconds 25% percentile latency: 6.390297698322684 seconds 50% percentile latency: 6.404989298898727 seconds 75% percentile latency: 6.421127524343319 seconds 90% percentile latency: 6.4394324975088235 seconds 99% percentile latency: 6.562963163470849 seconds After: Avg latency: 5.162057781498879 seconds 10% percentile latency: 5.1219399653142315 seconds 25% percentile latency: 5.135780334530864 seconds 50% percentile latency: 5.151887209853157 seconds 75% percentile latency: 5.158517300733365 seconds 90% percentile latency: 5.184290232090279 seconds 99% percentile latency: 5.314461483638734 seconds 3. (Sanity) check using https://gist.github.com/ilia-cher/951a3d011a8bafa7c5180fbc3a151a57 4. (follow up in scaling factors loading PR) P3L perplexity check

ilia-cher requested review from rasmith, charlifu, divakar-amd, gshtras, mawong-amd, maleksan85, hegemanjw4amd, shajrawi, Alexei-V-Ivanov-AMD, Concurrensee, qli88 and vgokhale December 4, 2024 02:56

shajrawi approved these changes Dec 4, 2024

View reviewed changes

vgokhale reviewed Dec 4, 2024

View reviewed changes

vllm/attention/ops/triton_flash_attention.py Show resolved Hide resolved

vgokhale reviewed Dec 4, 2024

View reviewed changes

vllm/attention/ops/triton_flash_attention.py Show resolved Hide resolved

ilia-cher added 2 commits December 4, 2024 12:40

(linter)

ebcaf9d

ilia-cher force-pushed the attn_fp8 branch from 835ec72 to ebcaf9d Compare December 4, 2024 18:40

ilia-cher merged commit 97fd542 into develop Dec 4, 2024
7 of 8 checks passed

ilia-cher mentioned this pull request Dec 5, 2024

Fix kernel cache miss and add RDNA configs #246

Merged

gshtras deleted the attn_fp8 branch December 7, 2024 03:21

mawong-amd mentioned this pull request Dec 19, 2024

Ingest FP8 attn scales and use them in ROCm FlashAttention #338

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[vllm] Add support for FP8 in Triton FA kernel #301

[vllm] Add support for FP8 in Triton FA kernel #301

Uh oh!

ilia-cher commented Dec 4, 2024 •

edited by github-actions bot

Loading

Uh oh!

shajrawi left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[vllm] Add support for FP8 in Triton FA kernel #301

[vllm] Add support for FP8 in Triton FA kernel #301

Uh oh!

Conversation

ilia-cher commented Dec 4, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shajrawi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ilia-cher commented Dec 4, 2024 •

edited by github-actions bot

Loading