[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin #5975

mgoin · 2024-06-28T23:11:23Z

This work expands FP8 support in vLLM from GPUs with hardware FP8 support (Hopper and Ada Lovelace) to GPUs without native support (currently Ampere) by introducing FP8 Marlin - a fast fused dequantization kernel for FP8 to BF16/FP16 conversion.

Key features:

Enables FP8 quantization on a wider range of GPUs (SM 8.0 and 8.7, Ampere)
Improves performance up to 2x in memory-bound scenarios
Maintains accuracy comparable to FP16 baselines
Reduces weight memory usage by 2x, allowing larger batches
Simple to use - just specify quantization="fp8" at runtime or use pre-quantized FP8 checkpoints

Implementation details:

Based on existing 8-bit integer support in GPTQ Marlin kernel
Packs FP8 weights into int32 doublewords (GPTQ format) and then permutes weights into Marlin format
Efficient 4xFP8 to 4xFP16/BF16 dequantization using bit arithmetic and SIMT operations

End-to-end performance and accuracy results:

Individual layer sweeps:

As shown in the graphs, FP8 Marlin can provide significant speedups with minimal accuracy impact. Performance gains are higher on GPUs with less memory bandwidth (A10, RTX 3090) and for larger models.

Notes:

This weight-only approach differs slightly from the existing W8A8 FP8 quantization, offering higher accuracy because the activations have no need to be quantized
Currently expanding scales to be channelwise; future work will revert to per-tensor scales
This does not include support for MoE models.

Testing:

Tested on H100, A100, and A10 GPUs

This enhancement enables more users to benefit from FP8 quantization without hardware restrictions, improving vLLM's performance and efficiency across a broader range of setups!

robertgshaw2-neuralmagic · 2024-07-02T01:54:39Z

This is an awesome feature!

vllm/model_executor/layers/quantization/fp8.py

comaniac

Overall LGTM. Thanks!

tests/kernels/test_marlin_gemm.py

mgoin added 12 commits June 21, 2024 20:03

Add marlin implementation for fp8 decompress

7c51175

It runs! And then crashes

a90212a

Merge branch 'upstream-main' into fp8-marlin

9c89164

Latest state, "Invalid __global__ read"

f61c75f

Cleanup

0a595a9

Fix issues with workspace and add custom packing kernel

9c825ad

Cleanup cmake debug

a14123c

Add a unit test for packing

5b424d9

Everything works!

3c91b98

Cleanup

1244ab1

It works AND it is fast

0854c81

Add unit test

2fc0c47

mgoin changed the title ~~[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin #331~~ [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin Jun 28, 2024

mgoin added 3 commits July 1, 2024 22:56

Merge branch 'upstream-main' into fp8-marlin

c4b3e0d

Fix merge

abbf3ff

Fix comment

ec58535

Merge branch 'upstream-main' into fp8-marlin

2490f06

robertgshaw2-neuralmagic reviewed Jul 3, 2024

View reviewed changes

vllm/model_executor/layers/quantization/fp8.py Outdated Show resolved Hide resolved

comaniac approved these changes Jul 3, 2024

View reviewed changes

tests/kernels/test_marlin_gemm.py Show resolved Hide resolved

mgoin added 3 commits July 3, 2024 13:58

Merge branch 'upstream-main' into fp8-marlin

ff97364

Use new platform capability interface

26ad9bb

Try to work around current_platform issues

aa52a08

mgoin enabled auto-merge (squash) July 3, 2024 16:30

mgoin merged commit 47f0954 into vllm-project:main Jul 3, 2024
70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin #5975

[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin #5975

mgoin commented Jun 28, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Jul 2, 2024

comaniac left a comment

[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin #5975

[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin #5975

Conversation

mgoin commented Jun 28, 2024 • edited Loading

robertgshaw2-neuralmagic commented Jul 2, 2024

comaniac left a comment

Choose a reason for hiding this comment

mgoin commented Jun 28, 2024 •

edited

Loading