Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supports W8A8 quantization for more models #2850

Merged
merged 2 commits into from
Dec 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 24 additions & 24 deletions docs/en/supported_models/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,47 +51,47 @@ The TurboMind engine doesn't support window attention. Therefore, for models tha
| Llama | 7B - 65B | LLM | Yes | Yes | Yes | Yes | Yes |
| Llama2 | 7B - 70B | LLM | Yes | Yes | Yes | Yes | Yes |
| Llama3 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | Yes |
| Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | No | - |
| Llama3.2 | 1B, 3B | LLM | Yes | Yes | Yes | No | - |
| Llama3.2-VL | 11B, 90B | MLLM | Yes | Yes | Yes | No | - |
| InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes | - |
| Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | Yes |
| Llama3.2 | 1B, 3B | LLM | Yes | Yes | Yes | Yes | Yes |
| Llama3.2-VL | 11B, 90B | MLLM | Yes | Yes | Yes | - | - |
| InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes | Yes |
| InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes | Yes |
| InternLM2.5 | 7B | LLM | Yes | Yes | Yes | Yes | Yes |
| Baichuan2 | 7B | LLM | Yes | Yes | Yes | Yes | No |
| Baichuan2 | 13B | LLM | Yes | Yes | Yes | No | No |
| ChatGLM2 | 6B | LLM | Yes | Yes | Yes | No | No |
| Falcon | 7B - 180B | LLM | Yes | Yes | Yes | No | No |
| YI | 6B - 34B | LLM | Yes | Yes | Yes | No | Yes |
| Mistral | 7B | LLM | Yes | Yes | Yes | No | No |
| YI | 6B - 34B | LLM | Yes | Yes | Yes | Yes | Yes |
| Mistral | 7B | LLM | Yes | Yes | Yes | Yes | Yes |
| Mixtral | 8x7B, 8x22B | LLM | Yes | Yes | Yes | No | No |
| QWen | 1.8B - 72B | LLM | Yes | Yes | Yes | No | Yes |
| QWen1.5 | 0.5B - 110B | LLM | Yes | Yes | Yes | No | Yes |
| QWen | 1.8B - 72B | LLM | Yes | Yes | Yes | Yes | Yes |
| QWen1.5 | 0.5B - 110B | LLM | Yes | Yes | Yes | Yes | Yes |
| QWen1.5-MoE | A2.7B | LLM | Yes | Yes | Yes | No | No |
| QWen2 | 0.5B - 72B | LLM | Yes | Yes | No | No | Yes |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is qwen2.5

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qwen2.5 shares the same structure with qwen2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qwen2.5 is mentioned in #2849 @zhulinJulia24

| QWen2 | 0.5B - 72B | LLM | Yes | Yes | No | Yes | Yes |
| QWen2-VL | 2B, 7B | MLLM | Yes | Yes | No | No | No |
| DeepSeek-MoE | 16B | LLM | Yes | No | No | No | No |
| DeepSeek-V2 | 16B, 236B | LLM | Yes | No | No | No | No |
| MiniCPM3 | 4B | LLM | Yes | Yes | Yes | No | No |
| MiniCPM-V-2_6 | 8B | LLM | Yes | No | No | No | Yes |
| MiniCPM-V-2_6 | 8B | LLM | Yes | No | No | Yes | Yes |
| Gemma | 2B-7B | LLM | Yes | Yes | Yes | No | No |
| Dbrx | 132B | LLM | Yes | Yes | Yes | No | No |
| StarCoder2 | 3B-15B | LLM | Yes | Yes | Yes | No | No |
| Phi-3-mini | 3.8B | LLM | Yes | Yes | Yes | No | Yes |
| Phi-3-vision | 4.2B | MLLM | Yes | Yes | Yes | No | - |
| CogVLM-Chat | 17B | MLLM | Yes | Yes | Yes | No | - |
| CogVLM2-Chat | 19B | MLLM | Yes | Yes | Yes | No | - |
| LLaVA(1.5,1.6) | 7B-34B | MLLM | Yes | Yes | Yes | No | - |
| InternVL(v1.5) | 2B-26B | MLLM | Yes | Yes | Yes | No | Yes |
| InternVL2 | 1B-40B | MLLM | Yes | Yes | Yes | No | - |
| Mono-InternVL | 2B | MLLM | Yes\* | Yes | Yes | No | - |
| ChemVLM | 8B-26B | MLLM | Yes | Yes | No | No | - |
| Gemma2 | 9B-27B | LLM | Yes | Yes | Yes | No | - |
| Phi-3-mini | 3.8B | LLM | Yes | Yes | Yes | Yes | Yes |
| Phi-3-vision | 4.2B | MLLM | Yes | Yes | Yes | - | - |
| CogVLM-Chat | 17B | MLLM | Yes | Yes | Yes | - | - |
| CogVLM2-Chat | 19B | MLLM | Yes | Yes | Yes | - | - |
| LLaVA(1.5,1.6) | 7B-34B | MLLM | Yes | Yes | Yes | - | - |
| InternVL(v1.5) | 2B-26B | MLLM | Yes | Yes | Yes | Yes | Yes |
| InternVL2 | 1B-40B | MLLM | Yes | Yes | Yes | - | - |
| Mono-InternVL | 2B | MLLM | Yes\* | Yes | Yes | - | - |
| ChemVLM | 8B-26B | MLLM | Yes | Yes | No | - | - |
| Gemma2 | 9B-27B | LLM | Yes | Yes | Yes | - | - |
| GLM4 | 9B | LLM | Yes | Yes | Yes | No | No |
| GLM-4V | 9B | MLLM | Yes | Yes | Yes | No | No |
| CodeGeeX4 | 9B | LLM | Yes | Yes | Yes | No | - |
| Phi-3.5-mini | 3.8B | LLM | Yes | Yes | No | No | - |
| Phi-3.5-MoE | 16x3.8B | LLM | Yes | Yes | No | No | - |
| Phi-3.5-vision | 4.2B | MLLM | Yes | Yes | No | No | - |
| CodeGeeX4 | 9B | LLM | Yes | Yes | Yes | - | - |
| Phi-3.5-mini | 3.8B | LLM | Yes | Yes | No | - | - |
| Phi-3.5-MoE | 16x3.8B | LLM | Yes | Yes | No | - | - |
| Phi-3.5-vision | 4.2B | MLLM | Yes | Yes | No | - | - |

```{note}
* Currently Mono-InternVL does not support FP16 due to numerical instability. Please use BF16 instead.
Expand Down
48 changes: 24 additions & 24 deletions docs/zh_cn/supported_models/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,47 +51,47 @@ turbomind 引擎不支持 window attention。所以,对于应用了 window att
| Llama | 7B - 65B | LLM | Yes | Yes | Yes | Yes | Yes |
| Llama2 | 7B - 70B | LLM | Yes | Yes | Yes | Yes | Yes |
| Llama3 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | Yes |
| Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | No | - |
| Llama3.2 | 1B, 3B | LLM | Yes | Yes | Yes | No | - |
| Llama3.2-VL | 11B, 90B | MLLM | Yes | Yes | Yes | No | - |
| InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes | - |
| Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | Yes |
| Llama3.2 | 1B, 3B | LLM | Yes | Yes | Yes | Yes | Yes |
| Llama3.2-VL | 11B, 90B | MLLM | Yes | Yes | Yes | - | - |
| InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes | Yes |
| InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes | Yes |
| InternLM2.5 | 7B | LLM | Yes | Yes | Yes | Yes | Yes |
| Baichuan2 | 7B | LLM | Yes | Yes | Yes | Yes | No |
| Baichuan2 | 13B | LLM | Yes | Yes | Yes | No | No |
| ChatGLM2 | 6B | LLM | Yes | Yes | Yes | No | No |
| Falcon | 7B - 180B | LLM | Yes | Yes | Yes | No | No |
| YI | 6B - 34B | LLM | Yes | Yes | Yes | No | Yes |
| Mistral | 7B | LLM | Yes | Yes | Yes | No | No |
| YI | 6B - 34B | LLM | Yes | Yes | Yes | Yes | Yes |
| Mistral | 7B | LLM | Yes | Yes | Yes | Yes | Yes |
| Mixtral | 8x7B, 8x22B | LLM | Yes | Yes | Yes | No | No |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixtral cannot?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pytorch engine did not support it yet.

| QWen | 1.8B - 72B | LLM | Yes | Yes | Yes | No | Yes |
| QWen1.5 | 0.5B - 110B | LLM | Yes | Yes | Yes | No | Yes |
| QWen | 1.8B - 72B | LLM | Yes | Yes | Yes | Yes | Yes |
| QWen1.5 | 0.5B - 110B | LLM | Yes | Yes | Yes | Yes | Yes |
| QWen1.5-MoE | A2.7B | LLM | Yes | Yes | Yes | No | No |
| QWen2 | 0.5B - 72B | LLM | Yes | Yes | No | No | Yes |
| QWen2 | 0.5B - 72B | LLM | Yes | Yes | No | Yes | Yes |
| QWen2-VL | 2B, 7B | MLLM | Yes | Yes | No | No | No |
| DeepSeek-MoE | 16B | LLM | Yes | No | No | No | No |
| DeepSeek-V2 | 16B, 236B | LLM | Yes | No | No | No | No |
| MiniCPM3 | 4B | LLM | Yes | Yes | Yes | No | No |
| MiniCPM-V-2_6 | 8B | LLM | Yes | No | No | No | Yes |
| MiniCPM-V-2_6 | 8B | LLM | Yes | No | No | Yes | Yes |
| Gemma | 2B-7B | LLM | Yes | Yes | Yes | No | No |
| Dbrx | 132B | LLM | Yes | Yes | Yes | No | No |
| StarCoder2 | 3B-15B | LLM | Yes | Yes | Yes | No | No |
| Phi-3-mini | 3.8B | LLM | Yes | Yes | Yes | No | Yes |
| Phi-3-vision | 4.2B | MLLM | Yes | Yes | Yes | No | - |
| CogVLM-Chat | 17B | MLLM | Yes | Yes | Yes | No | - |
| CogVLM2-Chat | 19B | MLLM | Yes | Yes | Yes | No | - |
| LLaVA(1.5,1.6) | 7B-34B | MLLM | Yes | Yes | Yes | No | - |
| InternVL(v1.5) | 2B-26B | MLLM | Yes | Yes | Yes | No | Yes |
| InternVL2 | 1B-40B | MLLM | Yes | Yes | Yes | No | - |
| Mono-InternVL | 2B | MLLM | Yes\* | Yes | Yes | No | - |
| ChemVLM | 8B-26B | MLLM | Yes | Yes | No | No | - |
| Gemma2 | 9B-27B | LLM | Yes | Yes | Yes | No | - |
| Phi-3-mini | 3.8B | LLM | Yes | Yes | Yes | Yes | Yes |
| Phi-3-vision | 4.2B | MLLM | Yes | Yes | Yes | - | - |
| CogVLM-Chat | 17B | MLLM | Yes | Yes | Yes | - | - |
| CogVLM2-Chat | 19B | MLLM | Yes | Yes | Yes | - | - |
| LLaVA(1.5,1.6) | 7B-34B | MLLM | Yes | Yes | Yes | - | - |
| InternVL(v1.5) | 2B-26B | MLLM | Yes | Yes | Yes | Yes | Yes |
| InternVL2 | 1B-40B | MLLM | Yes | Yes | Yes | - | - |
| Mono-InternVL | 2B | MLLM | Yes\* | Yes | Yes | - | - |
| ChemVLM | 8B-26B | MLLM | Yes | Yes | No | - | - |
| Gemma2 | 9B-27B | LLM | Yes | Yes | Yes | - | - |
| GLM4 | 9B | LLM | Yes | Yes | Yes | No | No |
| GLM-4V | 9B | MLLM | Yes | Yes | Yes | No | No |
| CodeGeeX4 | 9B | LLM | Yes | Yes | Yes | No | - |
| Phi-3.5-mini | 3.8B | LLM | Yes | Yes | No | No | - |
| Phi-3.5-MoE | 16x3.8B | LLM | Yes | Yes | No | No | - |
| Phi-3.5-vision | 4.2B | MLLM | Yes | Yes | No | No | - |
| CodeGeeX4 | 9B | LLM | Yes | Yes | Yes | - | - |
| Phi-3.5-mini | 3.8B | LLM | Yes | Yes | No | - | - |
| Phi-3.5-MoE | 16x3.8B | LLM | Yes | Yes | No | - | - |
| Phi-3.5-vision | 4.2B | MLLM | Yes | Yes | No | - | - |

```{note}
* Currently Mono-InternVL does not support FP16 due to numerical instability. Please use BF16 instead.
Expand Down
3 changes: 3 additions & 0 deletions lmdeploy/lite/apis/calibrate.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
'ChatGLMForConditionalGeneration': 'GLMBlock',
'MixtralForCausalLM': 'MixtralDecoderLayer',
'Qwen2VLForConditionalGeneration': 'Qwen2VLDecoderLayer',
'MistralForCausalLM': 'MistralDecoderLayer',
}

NORM_TYPE_MAP = {
Expand All @@ -44,6 +45,7 @@
'ChatGLMForConditionalGeneration': 'RMSNorm',
'MixtralForCausalLM': 'MixtralRMSNorm',
'Qwen2VLForConditionalGeneration': 'Qwen2RMSNorm',
'MistralForCausalLM': 'MistralRMSNorm',
}

HEAD_NAME_MAP = {
Expand All @@ -61,6 +63,7 @@
'ChatGLMForConditionalGeneration': 'output_layer',
'MixtralForCausalLM': 'lm_head',
'Qwen2VLForConditionalGeneration': 'lm_head',
'MistralForCausalLM': 'lm_head',
}


Expand Down
66 changes: 2 additions & 64 deletions lmdeploy/lite/apis/smooth_quant.py
Original file line number Diff line number Diff line change
@@ -1,70 +1,15 @@
# Copyright (c) OpenMMLab. All rights reserved.

import os.path as osp
import shutil

import fire
import torch
from torch import nn

import lmdeploy
from lmdeploy.lite.apis.calibrate import calibrate
from lmdeploy.lite.apis.calibrate import (LAYER_TYPE_MAP, NORM_TYPE_MAP,
calibrate)
from lmdeploy.lite.quantization.awq import (FC_FCS_MAP, NORM_FCS_MAP,
awq_layers, smooth_layers)
from lmdeploy.lite.utils import collect_target_modules
from lmdeploy.pytorch.models import QLinear, QRMSNorm

LAYER_TYPE_MAP = {
'InternLMForCausalLM': 'InternLMDecoderLayer',
'InternLM2ForCausalLM': 'InternLM2DecoderLayer',
'QWenLMHeadModel': 'QWenBlock',
'BaiChuanForCausalLM': 'DecoderLayer',
'LlamaForCausalLM': 'LlamaDecoderLayer',
'ChatGLMForConditionalGeneration': 'GLMBlock',
}
NORM_TYPE_MAP = {
'InternLMForCausalLM': 'InternLMRMSNorm',
'InternLM2ForCausalLM': 'InternLM2RMSNorm',
'QWenLMHeadModel': 'RMSNorm',
'BaiChuanForCausalLM': 'RMSNorm',
'LlamaForCausalLM': 'LlamaRMSNorm',
'ChatGLMForConditionalGeneration': 'RMSNorm',
}

LMDEPLOY_ROOT = lmdeploy.__path__[0]

MODEL_PATH_MAP = {
'InternLMForCausalLM':
osp.join(LMDEPLOY_ROOT, 'pytorch/modeling/modeling_internlm.py'),
'InternLM2ForCausalLM':
osp.join(LMDEPLOY_ROOT, 'pytorch/modeling/modeling_internlm2.py'),
'LlamaForCausalLM':
osp.join(LMDEPLOY_ROOT, 'pytorch/modeling/modeling_llama.py'),
'BaiChuanForCausalLM':
osp.join(LMDEPLOY_ROOT, 'pytorch/modeling/modeling_baichuan.py')
}

AUTO_MAP = {
'InternLMForCausalLM': {
'AutoConfig': 'configuration_internlm.InternLMConfig',
'AutoModel': 'modeling_internlm.InternLMForCausalLM',
'AutoModelForCausalLM': 'modeling_internlm.InternLMForCausalLM'
},
'InternLM2ForCausalLM': {
'AutoConfig': 'configuration_internlm2.InternLMConfig',
'AutoModelForCausalLM': 'modeling_internlm2.InternLM2ForCausalLM',
'AutoModel': 'modeling_internlm2.InternLM2ForCausalLM'
},
'LlamaForCausalLM': {
'AutoModel': 'modeling_llama.LlamaForCausalLM',
'AutoModelForCausalLM': 'modeling_llama.LlamaForCausalLM'
},
'BaiChuanForCausalLM': {
'AutoConfig': 'configuration_baichuan.BaiChuanConfig',
'AutoModelForCausalLM': 'modeling_baichuan.BaiChuanForCausalLM'
}
}


def smooth_quant(model: str,
work_dir: str = './work_dir',
Expand Down Expand Up @@ -146,11 +91,6 @@ def smooth_quant(model: str,
setattr(parent, child_name, q_norm)
norm.to('cpu')

if hasattr(model.config, 'auto_map'):
model.config.auto_map.update(AUTO_MAP[type(model).__name__])
else:
model.config.auto_map = AUTO_MAP[type(model).__name__]

if vl_model:
from .auto_awq import save_vl_model
save_vl_model(vl_model, model_path, work_dir)
Expand All @@ -162,8 +102,6 @@ def smooth_quant(model: str,
safe_serialization=False)
tokenizer.save_pretrained(work_dir)

shutil.copy(MODEL_PATH_MAP[type(model).__name__], work_dir)


if __name__ == '__main__':
fire.Fire(smooth_quant)
11 changes: 10 additions & 1 deletion lmdeploy/lite/quantization/awq.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,12 @@
'input_layernorm':
['self_attn.k_proj', 'self_attn.q_proj', 'self_attn.v_proj'],
'post_attention_layernorm': ['mlp.gate_proj', 'mlp.up_proj']
}
},
'MistralDecoderLayer': {
'input_layernorm':
['self_attn.k_proj', 'self_attn.q_proj', 'self_attn.v_proj'],
'post_attention_layernorm': ['mlp.gate_proj', 'mlp.up_proj']
},
}

FC_FCS_MAP = {
Expand Down Expand Up @@ -92,6 +97,10 @@
'Qwen2VLDecoderLayer': {
'self_attn.v_proj': ['self_attn.o_proj'],
'mlp.up_proj': ['mlp.down_proj']
},
'MistralDecoderLayer': {
'self_attn.v_proj': ['self_attn.o_proj'],
'mlp.up_proj': ['mlp.down_proj']
}
}

Expand Down
1 change: 0 additions & 1 deletion lmdeploy/pytorch/modeling/__init__.py

This file was deleted.

59 changes: 0 additions & 59 deletions lmdeploy/pytorch/modeling/convert_to_qmodules.py

This file was deleted.

Loading