[Feature][Hardware][AMD] Enable Scaled FP8 GEMM on ROCm #6006

HaiShaw · 2024-06-30T20:28:34Z

Enable Scaled FP8 GEMM on ROCm (AMD GPU)

As part of a series of FP8 development in vLLM, this pull request introduces latest acceleration with FP8 computations on newer AMD hardware (MI30x and later).

Using OCP FP8 inference data type (float8_e4m3fn) at interface and file exchange level, compatible with OCP FP8 quantized model checkpoints.
Other than PTQ weights, static scaling factors are used for activations (and KV caches), via calibration process by Quark - the AMD quantizer, or AMMO from Nvidia.
This is ROCm/hipBLASLt based Implementation of scaled FP8 GEMM, adding to previously implemented scaled FP8 KV cache. In case multiple weight matrices are concatenated to a bigger GEMM for performance, this implementation suboptimally uses one vs. multiple scaling factors (to be addressed later). Note - AMD Quark can be configured s.t. certain matrices can be virtually merged prior to their quantization.
Largely follows the vLLM FP8 RFC: FP8 in vLLM #2461. Specifically, linear and projection layers are covered, while FP8 computation within self attention itself is left for future extension. Current GEMM takes FP8 as input and defaults to output float16/bfloat16. Further optimizations are working in progress, include but not limited to float16/bfloat16 ingress (in kernel conversion), direct FP8 egress to KV cache, etc.
Note - this feature will not work on MI2xx or older GPUs lacking FP8 MFMA instructions.

Design Reference:

Note - Quark may add AutoFP8 compatible export, by then we will extend the support accordingly.
RFC: FP8 Quantization Schema in vLLM update #5802
RFC: FP8 Quantization Schema in vLLM #3218
RFC: FP8 in vLLM #2461

Introducing Quark - AMD Quantizer:

Please refer to: AMD Quark landing page

Performance Tuning:

Please refer to: AMD vLLM performance tuning guide

Usage and Examples:

To get started, please refer to:

./examples/fp8/quantizer/README.md

Performance and Accuracy:

With FP8 KV cache together, we observed up to ~50% performance increases on top of FP16 Llama2 baseline, in favor of larger batch sizes and sequence, even on the quantized 70B model served on a single MI300X.

LLM-Q&A, Llama2-70b, dataset: OPEN ORCA on 8 MI300X GPUs (TP=8)

GEMM Types	Rouge-1	Rouge-2	Rouge-L
FP16	44.4860	N/A	28.6992
FP8 scaled	44.5001	22.0853	28.7140

…charlifu/fp8

…lifu/fp8

…ss scaling

[Feature][AMD] Adding fp8 Gemm Computation

mgoin · 2024-07-01T17:59:00Z

Hi @HaiShaw thanks for pushing up this chunk of work. Is there a reason you haven't tried enabling AMD explicitly through the existing "fp8" quantization backend with the current checkpoint format? It seems within your "Fp8Fnuz" method that torch._scaled_mm is actually a valid else case, so could you take advantage of its usage already in the "fp8" backend for an easier starting point?

HaiShaw · 2024-07-01T22:55:50Z

Hi @HaiShaw thanks for pushing up this chunk of work. Is there a reason you haven't tried enabling AMD explicitly through the existing "fp8" quantization backend with the current checkpoint format? It seems within your "Fp8Fnuz" method that torch._scaled_mm is actually a valid else case, so could you take advantage of its usage already in the "fp8" backend for an easier starting point?

@mgoin thanks for your question! There were couple of reasons that we did not reuse the same backend as exact, other than different internal (HW) format and gemm implementations, not to consider dynamic scaling is a main reason (and we don't prefer to mixup CUDA backend too much in code). In terms of model loading, we started with AMMO support, now AMD Quark, and will be extended to AutoFP8 compatible checkpoint support once RFC #5802 is landed in Quark. Some discrepancy we have is due to the moving nature or completeness of several quantizers that we deal here, arising from different design ideas.

…tions it's using

charlifu and others added 30 commits May 14, 2024 15:13

adding rocm fp8

f43b42f

Merge branch 'vllm-project:main' into charlifu/fp8

cb4083b

Merge branch 'vllm-project:main' into charlifu/fp8

43b4a00

Merge branch 'vllm-project:main' into charlifu/fp8

c3e3967

Merge branch 'charlifu/fp8' of https://github.com/charlifu/vllm into …

d1a9067

…charlifu/fp8

Merge branch 'vllm-project:main' into charlifu/fp8

108565f

Merge branch 'vllm-project:main' into charlifu/fp8

8eedfc1

Merge branch 'vllm-project:main' into charlifu/fp8

f290142

Merge branch 'vllm-project:main' into charlifu/fp8

0337638

Merge branch 'vllm-project:main' into charlifu/fp8

44f7f7c

fp8 computation

8ba26d2

Merge branch 'vllm-project:main' into charlifu/fp8

45c3a69

Using convert_fp8 kernel

81fbd0a

delete convert.cu

a16306d

Merge branch 'vllm-project:main' into charlifu/fp8

1f8cba4

Merge branch 'vllm-project:main' into charlifu/fp8

d5a7cf7

Merge branch 'charlifu/fp8' of https://github.com/ROCm/vllm into char…

400ab5e

…lifu/fp8

Merge branch 'vllm-project:main' into charlifu/fp8

8b92457

clean up

966d029

clean up

6f650e8

Merge branch 'vllm-project:main' into charlifu/fp8

7e84ddf

remove extra kernels

d6ddc9f

remove int8 -> fp8 convert

d4aea7f

fix naming

f7b4e21

fix typo

a1fa17e

clean up

053c7b8

add compilation guard

5c35978

add convert_fp8 in cache_ops

3942925

Merge remote-tracking branch 'origin/fp8-gemm' into charlifu/fp8

3cc9510

clean up

38a574a

HaiShaw and others added 16 commits June 27, 2024 23:14

Fixing numeric celling for float8_e4m3fnuz

c337f53

Add TODOs to hipblasLt and its GEMM use code

573b175

Merge remote-tracking branch 'origin/fp8-gemm' into charlifu/fp8

3a04b57

Add TODOs to quant_utils

2976580

Add TODOs to determine GEMM output dtype and conditions to apply egre…

d550831

…ss scaling

Merge pull request #31 from ROCm/charlifu/fp8

290e4ab

[Feature][AMD] Adding fp8 Gemm Computation

Merge branch 'vllm-project:main' into fp8-gemm

5714d4e

Merge branch 'vllm-project:main' into fp8-gemm

4543543

Merge branch 'vllm-project:main' into fp8-gemm

0da8286

Merge branch 'vllm-project:main' into fp8-gemm

3bab374

clang-format

0948720

ruff formatting

5f3f8b2

codespell

4d46c81

isort

e74a915

Suppress warning

5f4c8f5

yapf

20988e2

DarkLight1337 added the rocm label Jul 1, 2024

DarkLight1337 requested a review from WoosukKwon July 1, 2024 08:20

HaiShaw and others added 10 commits July 1, 2024 23:57

Resolve conflicts vllm/envs.py

c3300c7

Merge branch 'main' into fp8-gemm

7b29783

Wrapping fp8 conversion call in the same ifdef as the conversion func…

a607772

…tions it's using

More define guards

a80cc1c

fix crash on nvidia fp8

7a33471

fix isort

1ea8377

Merge branch 'main' into fp8-gemm

652d49a

Fixed CUDA version of scaled fp8 convert

9c2232e

Missing template parameter

555e39c

Removed unused var

36ac2b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][Hardware][AMD] Enable Scaled FP8 GEMM on ROCm #6006

[Feature][Hardware][AMD] Enable Scaled FP8 GEMM on ROCm #6006

HaiShaw commented Jun 30, 2024 •

edited

Loading

mgoin commented Jul 1, 2024

HaiShaw commented Jul 1, 2024 •

edited

Loading

[Feature][Hardware][AMD] Enable Scaled FP8 GEMM on ROCm #6006

Are you sure you want to change the base?

[Feature][Hardware][AMD] Enable Scaled FP8 GEMM on ROCm #6006

Conversation

HaiShaw commented Jun 30, 2024 • edited Loading

Enable Scaled FP8 GEMM on ROCm (AMD GPU)

Design Reference:

Introducing Quark - AMD Quantizer:

Performance Tuning:

Usage and Examples:

Performance and Accuracy:

mgoin commented Jul 1, 2024

HaiShaw commented Jul 1, 2024 • edited Loading

HaiShaw commented Jun 30, 2024 •

edited

Loading

HaiShaw commented Jul 1, 2024 •

edited

Loading