[FEAT] Improved PagedAttention FP8 (faster kvcache dequant v2) #347

tjtanaa · 2024-12-27T16:32:02Z

Description

This is a PR to merge https://github.com/ROCm/vllm/blob/shsanyal_develop_cpa_fp8 optimized attention.cu kernel into llama_fp8_12062024 branch.

CAVEAT

Currently the attention.cu kernel does not support block size of 32 and head size of 64.
The vLLM model unittests are failing as it uses small models e.g. Gemma, Llama which has head size of 64.

Performance over this Feature PR (#346) which is another implementation of faster kvcache dequant

The following is a benchmark_throughput results of Llama-3.1-70B with fp8 dynamic quantization and kv-cache-dtype of fp8_e4m3. For sequence input token length 2048 and output token length 2048:

Branch of vll-rocmfork	Req/s	Total Tokens/s	Output Tokens/s
main	0.29	1196.2	598.1
llama-fp8-12062024	0.28	1152.46	576.23
paged-attn-fp8 #346	0.47	1932.74	966.37
this PR	0.62	2537.03	1268.51

tjtanaa · 2025-01-24T01:24:29Z

This PR has been dropped for #385 and #372

vllmellm added 3 commits December 20, 2024 06:06

merged paged attention fp8

cb224e8

updated unit-test and benchmark scripts

e31e05f

updated paged attention kernel

0aad2af

tjtanaa changed the title ~~[FEAT] Improved PagedAttention FP8 (faster kvcache dequant)~~ [FEAT] Improved PagedAttention FP8 (faster kvcache dequant v2) Dec 27, 2024

tjtanaa marked this pull request as draft December 27, 2024 16:34

vllmellm added 2 commits December 29, 2024 12:42

clean up deadcodes

0d11160

fix compilation bug

73f257d

tjtanaa force-pushed the paged-attn-updated branch from 3b816e7 to 73f257d Compare January 2, 2025 08:09

sanyalington mentioned this pull request Jan 21, 2025

Faster Custom Paged Attention kernels #372

Merged

tjtanaa closed this Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Improved PagedAttention FP8 (faster kvcache dequant v2) #347

[FEAT] Improved PagedAttention FP8 (faster kvcache dequant v2) #347

tjtanaa commented Dec 27, 2024 •

edited by github-actions bot

Loading

tjtanaa commented Jan 24, 2025

[FEAT] Improved PagedAttention FP8 (faster kvcache dequant v2) #347

[FEAT] Improved PagedAttention FP8 (faster kvcache dequant v2) #347

Conversation

tjtanaa commented Dec 27, 2024 • edited by github-actions bot Loading

Description

CAVEAT

Performance over this Feature PR (#346) which is another implementation of faster kvcache dequant

tjtanaa commented Jan 24, 2025

tjtanaa commented Dec 27, 2024 •

edited by github-actions bot

Loading