-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature][Hardware][AMD] Enable Scaled FP8 GEMM on ROCm #6006
base: main
Are you sure you want to change the base?
Conversation
[Feature][AMD] Adding fp8 Gemm Computation
Hi @HaiShaw thanks for pushing up this chunk of work. Is there a reason you haven't tried enabling AMD explicitly through the existing "fp8" quantization backend with the current checkpoint format? It seems within your "Fp8Fnuz" method that torch._scaled_mm is actually a valid else case, so could you take advantage of its usage already in the "fp8" backend for an easier starting point? |
@mgoin thanks for your question! There were couple of reasons that we did not reuse the same backend as exact, other than different internal (HW) format and gemm implementations, not to consider |
Enable Scaled FP8 GEMM on ROCm (AMD GPU)
As part of a series of FP8 development in vLLM, this pull request introduces latest acceleration with FP8 computations on newer AMD hardware (MI30x and later).
Design Reference:
Introducing Quark - AMD Quantizer:
Please refer to: AMD Quark landing page
Performance Tuning:
Please refer to: AMD vLLM performance tuning guide
Usage and Examples:
To get started, please refer to:
Performance and Accuracy:
With FP8 KV cache together, we observed up to ~50% performance increases on top of FP16 Llama2 baseline, in favor of larger batch sizes and sequence, even on the quantized 70B model served on a single MI300X.
LLM-Q&A, Llama2-70b, dataset: OPEN ORCA on 8 MI300X GPUs (TP=8)