Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Improved PagedAttention FP8 (faster kvcache dequant v2) #347

Draft
wants to merge 5 commits into
base: llama_fp8_12062024
Choose a base branch
from

Conversation

tjtanaa
Copy link

@tjtanaa tjtanaa commented Dec 27, 2024

Description

This is a PR to merge https://github.com/ROCm/vllm/blob/shsanyal_develop_cpa_fp8 optimized attention.cu kernel into llama_fp8_12062024 branch.

CAVEAT

Currently the attention.cu kernel does not support block size of 32 and head size of 64.
The vLLM model unittests are failing as it uses small models e.g. Gemma, Llama which has head size of 64.

Performance over this Feature PR (#346) which is another implementation of faster kvcache dequant

The following is a benchmark_throughput results of Llama-3.1-70B with fp8 dynamic quantization and kv-cache-dtype of fp8_e4m3. For sequence input token length 2048 and output token length 2048:

Branch of vll-rocmfork Req/s Total Tokens/s Output Tokens/s
main 0.29 1196.2 598.1
llama-fp8-12062024 0.28 1152.46 576.23
paged-attn-fp8 #346 0.47 1932.74 966.37
this PR 0.62 2537.03 1268.51

@tjtanaa tjtanaa changed the title [FEAT] Improved PagedAttention FP8 (faster kvcache dequant) [FEAT] Improved PagedAttention FP8 (faster kvcache dequant v2) Dec 27, 2024
@tjtanaa tjtanaa marked this pull request as draft December 27, 2024 16:34
@tjtanaa tjtanaa force-pushed the paged-attn-updated branch from 3b816e7 to 73f257d Compare January 2, 2025 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants