RFC: FP8 Quantization Schema in vLLM update #5802

HaiShaw · 2024-06-24T22:52:05Z

HaiShaw
Jun 24, 2024

This RFC is to propose and clarify on a default FP8 model interface between vLLM and FP8 quantized model built on top of HuggingFace's model definitions.

The method of this proposal is different than known third party quantizer (e.g. nVIDIA AMMO), future vendor specific quantizer may or may not add support to the proposed format. This RFC also serves for a documentation purpose.

Disclaimer - quite some ideas are developed via AutoFP8 and discussions with them, but the proposed format below has things haven't covered yet in AutoFP8, we may work towards converge.

Scope:

per tensor scaling (per channel scaling or other granular in the future)
static scaling oriented to high performance inference (dynamic scaling not included)
FP8_E4M3 (OCP) as float8 format

Interface Content (and example of changes)

augment existing HF model definition (safetensors, model.safetensors.index.json, etc.), to add/replace with FP8 quantized data: weights, and scaling factors for any tensor going to or coming from FP8_E4M3 format (weight scaling factors, activation scaling factors, output scaling factors, kv_cache scaling factors)
key examples (Llama model as example)

screen shot above demonstrates the new data entries needed to model checkpoint, to enable FP8 KV Cache and facilitate FP8 compute on newer hardware (e.g. MI30x, H100), base model tensor type is still float16 or bfloat16.

Reference

RFC: FP8 in vLLM #2461
RFC: FP8 Quantization Schema in vLLM #3218

robertgshaw2-neuralmagic · 2024-06-24T23:20:45Z

robertgshaw2-neuralmagic
Jun 24, 2024
Collaborator Sponsor

Can you clarify more the need for output_scale?

3 replies

HaiShaw Jun 25, 2024
Author

@robertgshaw2-neuralmagic idea is to have quantizer run fake-quant to FP8_E4M3 at a layers output / egress and provide its corresponding output_scale at ML/op's disposal if MLApp/Op (e.g. GEMM) truly uses FP8_E4M3 as egress interface. This way we can decouple quantizer's quantization flow from runtime dataflow, and make things much simpler and cleaner.
The decision w.r.t. what a layer's egress dtype at runtime should be readily attained via MLApp, for example - when both FP8 KV Cache and compute are enabled, vLLM can choose to have Q/K/V proj (uniformly or individually) output FP8_E4M3 directly, which benefits KV cache access.

mgoin Jun 25, 2024
Collaborator Sponsor

I understand the intent to have an excess of information in the checkpoint for the runtime to choose to use, however during the calibration process of the quantizer it must quantize-dequantize the tensor during the forward pass to get the input_scale/output_scale and to accurately represent the impact of quantization on the output for subsequent layers.

The worry is if the quantizer is always producing output_scale and thus always quantizing both the inputs+outputs, if the runtime chooses to quantize just the input it will be using incorrect input_scale values in that case, since the output tensors will not be quantized by the runtime.

This is why within AutoFP8 we try to make the checkpoint defined for one specific mode of execution because it makes measuring the accuracy for that checkpoint more deterministic between the quantizer and the runtime.

comaniac Jun 25, 2024
Collaborator

Second on @mgoin. We were thinking about the same idea but ultimately found that the quantizer should generate a checkpoint along with the config (i.e. which layer is quantized and whether it's input, output or both) that achieves the desired quality. Serving frameworks like vLLM should follow what has been configured from the checkpoint to guarantee the quality is aligned. For example, the checkpoint with and without FP8 kv-cache enabled should be different.

comaniac · 2024-06-24T23:35:22Z

comaniac
Jun 24, 2024
Collaborator

Out of curiosity, how would you plan to handle FP8 gate_proj? The current issue with FP8 gate_proj on NVIDIA GPUs is the FP8 gemm kernels require the shape to be dividable by 16; while most MoE models have only 8 experts. On the other hand, padding to 16 experts is meaningless in terms of compute efficiency, so we currently keep gate_proj in FP16. Would like to know if this is also a case for AMD GPUs.

1 reply

HaiShaw Jun 25, 2024
Author

@comaniac At instruction level, Table 27. MFMA VALU Opcodes of page 50 AMD MI300 ISA Reference Guide has shown us both M/N/K: 16x16x32 and M/N/K: 32x32x16 block size at ISA level when use xLOPS/TensorCore. At GEMM level, we may check what (dim_size) is multiplied onto # of expert to construct the final M/K or N/K dimension of GEMM - if it is only the # of expert then what you said is correct, padding 8 to 16 doesn't seem to be performant at all.
Let us check the MoE kernel and GEMM interfaces, or can you share a pointer to the spot that is relevant?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: FP8 Quantization Schema in vLLM update #5802

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

RFC: FP8 Quantization Schema in vLLM update #5802

HaiShaw Jun 24, 2024

This RFC is to propose and clarify on a default FP8 model interface between vLLM and FP8 quantized model built on top of HuggingFace's model definitions.

The method of this proposal is different than known third party quantizer (e.g. nVIDIA AMMO), future vendor specific quantizer may or may not add support to the proposed format. This RFC also serves for a documentation purpose.

Scope:

Interface Content (and example of changes)

Reference

Replies: 2 comments · 4 replies

robertgshaw2-neuralmagic Jun 24, 2024 Collaborator Sponsor

HaiShaw Jun 25, 2024 Author

mgoin Jun 25, 2024 Collaborator Sponsor

comaniac Jun 25, 2024 Collaborator

comaniac Jun 24, 2024 Collaborator

HaiShaw Jun 25, 2024 Author

HaiShaw
Jun 24, 2024

Replies: 2 comments 4 replies

robertgshaw2-neuralmagic
Jun 24, 2024
Collaborator Sponsor

HaiShaw Jun 25, 2024
Author

mgoin Jun 25, 2024
Collaborator Sponsor

comaniac Jun 25, 2024
Collaborator

comaniac
Jun 24, 2024
Collaborator

HaiShaw Jun 25, 2024
Author