forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 29
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[vllm] Add support for FP8 in Triton FA kernel
Adding support for FP8 (E4M3) in Triton FA kernel, including per-tensor scaling factors. Test: 1. Patched rocm_flash_attn.py to call FA kernel with scaling factors (https://gist.github.com/ilia-cher/216762889331cefeb158634a651b2fac) 2. Run the benchmark: python3 benchmark_latency.py --model \ /data/models/Llama-3.1-8B-Instruct-FP8-KV \ --input-len 8192 \ --output-len 1 \ --batch-size 32 \ --enforce-eager \ --num-iters 10 \ --num-iters-warmup 2 \ --enable-chunked-prefill False \ --dtype float16 Before: Avg latency: 6.418297152221203 seconds 10% percentile latency: 6.380122036673129 seconds 25% percentile latency: 6.390297698322684 seconds 50% percentile latency: 6.404989298898727 seconds 75% percentile latency: 6.421127524343319 seconds 90% percentile latency: 6.4394324975088235 seconds 99% percentile latency: 6.562963163470849 seconds After: Avg latency: 5.162057781498879 seconds 10% percentile latency: 5.1219399653142315 seconds 25% percentile latency: 5.135780334530864 seconds 50% percentile latency: 5.151887209853157 seconds 75% percentile latency: 5.158517300733365 seconds 90% percentile latency: 5.184290232090279 seconds 99% percentile latency: 5.314461483638734 seconds 3. (Sanity) check using https://gist.github.com/ilia-cher/951a3d011a8bafa7c5180fbc3a151a57 4. (follow up in scaling factors loading PR) P3L perplexity check
- Loading branch information
Showing
2 changed files
with
76 additions
and
12 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters