Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] ADD Support for DeepSeek-V2-Chat #32

Closed
Xu-Chen opened this issue Jul 18, 2024 · 1 comment
Closed

[Feature] ADD Support for DeepSeek-V2-Chat #32

Xu-Chen opened this issue Jul 18, 2024 · 1 comment

Comments

@Xu-Chen
Copy link

Xu-Chen commented Jul 18, 2024

OOM occurs when quantifying DeepSeek model on 8XA800。
The code used comes from #29

from datasets import load_dataset
from transformers import AutoTokenizer
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "/path-to-models/DeepSeek-Coder-V2-Lite-Instruct"
quantized_model_dir = "/path-to-models/DeepSeek-Coder-V2-Lite-Instruct-FP8"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

# Load and tokenize 512 dataset samples for calibration of activation scales
ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")

# Define quantization config with static activation scales
quantize_config = BaseQuantizeConfig(
    quant_method="fp8", 
    activation_scheme="static",
    # skip the lm head and expert gate
    ignore_patterns=["re:.*lm_head", "re:.*gate.weight"])

# Load the model, quantize, and save checkpoint
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)

Is there any way to quantify such a large model?

@Xu-Chen
Copy link
Author

Xu-Chen commented Jul 18, 2024

Try the following code, it worked for me

quantize_config = BaseQuantizeConfig(
    quant_method="fp8", 
    activation_scheme="dynamic",
    # skip the lm head and expert gate
    ignore_patterns=["re:.*lm_head", "re:.*gate.weight"])


max_memory = {i: "75GB" for i in range(8)}
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config,device_map="sequential", trust_remote_code=True, torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="flash_attention_2")

model.quantize([])

@Xu-Chen Xu-Chen closed this as completed Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant