DeepSeek-Coder-V2-Lite-Instruct not working when quantized to FP8 using AutoFP8 #29

Syst3m1cAn0maly · 2024-07-10T13:25:41Z

Hi !

I quantized DeepSeek-Coder-V2-Lite-Instruct to FP8 using AutoFP8 but when I try to run it with vLLM I get the following error :

RuntimeError: "cat_cuda" not implemented for 'Float8_e4m3fn'

I ran the quantization using this script :

from datasets import load_dataset
from transformers import AutoTokenizer
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "/path-to-models/DeepSeek-Coder-V2-Lite-Instruct"
quantized_model_dir = "/path-to-models/DeepSeek-Coder-V2-Lite-Instruct-FP8"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

# Load and tokenize 512 dataset samples for calibration of activation scales
ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")

# Define quantization config with static activation scales
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")

# Load the model, quantize, and save checkpoint
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)

and I got the following output :

Quantizing weights: 100%|██████████| 8809/8809 [00:00<00:00, 11539.48it/s]
Calibrating activation scales: 100%|██████████| 512/512 [11:04<00:00,  1.30s/it]
DeepseekV2ForCausalLM(
  (model): DeepseekV2Model(
    (embed_tokens): Embedding(102400, 2048)
    (layers): ModuleList(
      (0): DeepseekV2DecoderLayer(
        (self_attn): DeepseekV2Attention(
          (q_proj): FP8StaticLinear()
          (kv_a_proj_with_mqa): FP8StaticLinear()
          (kv_a_layernorm): DeepseekV2RMSNorm()
          (kv_b_proj): FP8StaticLinear()
          (o_proj): FP8StaticLinear()
          (rotary_emb): DeepseekV2YarnRotaryEmbedding()
        )
        (mlp): DeepseekV2MLP(
          (gate_proj): FP8StaticLinear()
          (up_proj): FP8StaticLinear()
          (down_proj): FP8StaticLinear()
          (act_fn): SiLU()
        )
        (input_layernorm): DeepseekV2RMSNorm()
        (post_attention_layernorm): DeepseekV2RMSNorm()
      )
      (1-26): 26 x DeepseekV2DecoderLayer(
        (self_attn): DeepseekV2Attention(
          (q_proj): FP8StaticLinear()
          (kv_a_proj_with_mqa): FP8StaticLinear()
          (kv_a_layernorm): DeepseekV2RMSNorm()
          (kv_b_proj): FP8StaticLinear()
          (o_proj): FP8StaticLinear()
          (rotary_emb): DeepseekV2YarnRotaryEmbedding()
        )
        (mlp): DeepseekV2MoE(
          (experts): ModuleList(
            (0-63): 64 x DeepseekV2MLP(
              (gate_proj): FP8StaticLinear()
              (up_proj): FP8StaticLinear()
              (down_proj): FP8StaticLinear()
              (act_fn): SiLU()
            )
          )
          (gate): MoEGate()
          (shared_experts): DeepseekV2MLP(
            (gate_proj): FP8StaticLinear()
            (up_proj): FP8StaticLinear()
            (down_proj): FP8StaticLinear()
            (act_fn): SiLU()
          )
        )
        (input_layernorm): DeepseekV2RMSNorm()
        (post_attention_layernorm): DeepseekV2RMSNorm()
      )
    )
    (norm): DeepseekV2RMSNorm()
  )
  (lm_head): Linear(in_features=2048, out_features=102400, bias=False)
)
Saving the model to /path-to-models/DeepSeek-Coder-V2-Lite-Instruct-FP8

What can I do to quantize correctly this kind of model ?

The text was updated successfully, but these errors were encountered:

mgoin · 2024-07-10T16:28:15Z

Hey @Syst3m1cAn0maly we don't support quantization in vLLM for non-Mixtral MoEs yet. We are currently undergoing a refactor to support Qwen2 and DeepSeek-V2 vllm-project/vllm#6088

Jiayi-Pan · 2024-07-10T17:07:50Z

Thank you for the efforts. Looking forward to FP8 support for DSv2❤️

robertgshaw2-neuralmagic · 2024-07-13T11:41:33Z

Thank you for the efforts. Looking forward to FP8 support for DSv2❤️

Working on this today

robertgshaw2-neuralmagic · 2024-07-14T11:02:34Z

This should do it for you:

[ Misc ] Apply MoE Refactor to Qwen2 + Deepseekv2 To Support Fp8 vllm-project/vllm#6417

Syst3m1cAn0maly · 2024-07-14T11:19:14Z

Thanks a lot.
I will try as soon as possible.
Do I need to change the settings to quantize this model properly with AutoFP8 or should it work as-is ? (I saw there was a specific setting for Mixtral models regarding MoE gates)

robertgshaw2-neuralmagic · 2024-07-14T11:21:58Z

You need to skip the routing gate:

# Define quantization config with static activation scales
quantize_config = BaseQuantizeConfig(
    quant_method="fp8", 
    activation_scheme="static",
    # skip the lm head and expert gate
    ignore_patterns=["re:.*lm_head", "re:.*gate.weight"],)

The other thing I'm not sure about is the following layers:

self_attn.kv_a_proj_with_mqa
self_attn.kv_b_proj

Im working on seeing how sensitive they are now

Syst3m1cAn0maly · 2024-07-14T11:27:03Z

Thanks, I will try with these settings.

robertgshaw2-neuralmagic · 2024-07-14T13:44:37Z

FYI - config above is good. But needed one more tweak on vllm side.

[ Misc ] Enable Quantizing All Layers of DeekSeekv2 vllm-project/vllm#6423

Syst3m1cAn0maly · 2024-07-14T14:08:40Z

@robertgshaw2-neuralmagic thanks a lot for the work

Syst3m1cAn0maly · 2024-07-15T14:33:00Z

I tested today and it now works as expected, thanks !

robertgshaw2-neuralmagic mentioned this issue Jul 14, 2024

[ Misc ] Apply MoE Refactor to Qwen2 + Deepseekv2 To Support Fp8 vllm-project/vllm#6417

Merged

Syst3m1cAn0maly closed this as completed Jul 15, 2024

Xu-Chen mentioned this issue Jul 18, 2024

[Feature] ADD Support for DeepSeek-V2-Chat #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSeek-Coder-V2-Lite-Instruct not working when quantized to FP8 using AutoFP8 #29

DeepSeek-Coder-V2-Lite-Instruct not working when quantized to FP8 using AutoFP8 #29

Syst3m1cAn0maly commented Jul 10, 2024

mgoin commented Jul 10, 2024

Jiayi-Pan commented Jul 10, 2024

robertgshaw2-neuralmagic commented Jul 13, 2024

robertgshaw2-neuralmagic commented Jul 14, 2024 •

edited

Loading

Syst3m1cAn0maly commented Jul 14, 2024

robertgshaw2-neuralmagic commented Jul 14, 2024 •

edited

Loading

Syst3m1cAn0maly commented Jul 14, 2024

robertgshaw2-neuralmagic commented Jul 14, 2024

Syst3m1cAn0maly commented Jul 14, 2024

Syst3m1cAn0maly commented Jul 15, 2024

DeepSeek-Coder-V2-Lite-Instruct not working when quantized to FP8 using AutoFP8 #29

DeepSeek-Coder-V2-Lite-Instruct not working when quantized to FP8 using AutoFP8 #29

Comments

Syst3m1cAn0maly commented Jul 10, 2024

mgoin commented Jul 10, 2024

Jiayi-Pan commented Jul 10, 2024

robertgshaw2-neuralmagic commented Jul 13, 2024

robertgshaw2-neuralmagic commented Jul 14, 2024 • edited Loading

Syst3m1cAn0maly commented Jul 14, 2024

robertgshaw2-neuralmagic commented Jul 14, 2024 • edited Loading

Syst3m1cAn0maly commented Jul 14, 2024

robertgshaw2-neuralmagic commented Jul 14, 2024

Syst3m1cAn0maly commented Jul 14, 2024

Syst3m1cAn0maly commented Jul 15, 2024

robertgshaw2-neuralmagic commented Jul 14, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Jul 14, 2024 •

edited

Loading