Replies: 2 comments 4 replies
-
Can you clarify more the need for |
Beta Was this translation helpful? Give feedback.
3 replies
-
Out of curiosity, how would you plan to handle FP8 gate_proj? The current issue with FP8 gate_proj on NVIDIA GPUs is the FP8 gemm kernels require the shape to be dividable by 16; while most MoE models have only 8 experts. On the other hand, padding to 16 experts is meaningless in terms of compute efficiency, so we currently keep gate_proj in FP16. Would like to know if this is also a case for AMD GPUs. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This RFC is to propose and clarify on a default FP8 model interface between vLLM and FP8 quantized model built on top of HuggingFace's model definitions.
The method of this proposal is different than known third party quantizer (e.g. nVIDIA AMMO), future vendor specific quantizer may or may not add support to the proposed format. This RFC also serves for a documentation purpose.
Disclaimer - quite some ideas are developed via AutoFP8 and discussions with them, but the proposed format below has things haven't covered yet in AutoFP8, we may work towards converge.
Scope:
Interface Content (and example of changes)
safetensors
,model.safetensors.index.json
, etc.), to add/replace with FP8 quantized data: weights, and scaling factors for any tensor going to or coming from FP8_E4M3 format (weight scaling factors, activation scaling factors, output scaling factors, kv_cache scaling factors)Reference
RFC: FP8 in vLLM #2461
RFC: FP8 Quantization Schema in vLLM #3218
Beta Was this translation helpful? Give feedback.
All reactions