Skip to content

Latest commit

 

History

History
126 lines (116 loc) · 11.1 KB

supported_models.md

File metadata and controls

126 lines (116 loc) · 11.1 KB

Supported Models

The following tables detail the models supported by LMDeploy's TurboMind engine and PyTorch engine across different platforms.

TurboMind on CUDA Platform

Model Size Type FP16/BF16 KV INT8 KV INT4 W4A16
Llama 7B - 65B LLM Yes Yes Yes Yes
Llama2 7B - 70B LLM Yes Yes Yes Yes
Llama3 8B, 70B LLM Yes Yes Yes Yes
Llama3.1 8B, 70B LLM Yes Yes Yes Yes
Llama3.2[2] 1B, 3B LLM Yes Yes* Yes* Yes
InternLM 7B - 20B LLM Yes Yes Yes Yes
InternLM2 7B - 20B LLM Yes Yes Yes Yes
InternLM2.5 7B LLM Yes Yes Yes Yes
InternLM-XComposer2 7B, 4khd-7B MLLM Yes Yes Yes Yes
InternLM-XComposer2.5 7B MLLM Yes Yes Yes Yes
Qwen 1.8B - 72B LLM Yes Yes Yes Yes
Qwen1.5[1] 1.8B - 110B LLM Yes Yes Yes Yes
Qwen2[2] 0.5B - 72B LLM Yes Yes* Yes* Yes
Qwen2-MoE 57BA14B LLM Yes Yes Yes Yes
Qwen2.5[2] 0.5B - 72B LLM Yes Yes* Yes* Yes
Mistral[1] 7B LLM Yes Yes Yes No
Mixtral 8x7B, 8x22B LLM Yes Yes Yes Yes
DeepSeek-V2 16B, 236B LLM Yes Yes Yes No
DeepSeek-V2.5 236B LLM Yes Yes Yes No
Qwen-VL 7B MLLM Yes Yes Yes Yes
DeepSeek-VL 7B MLLM Yes Yes Yes Yes
Baichuan 7B LLM Yes Yes Yes Yes
Baichuan2 7B LLM Yes Yes Yes Yes
Code Llama 7B - 34B LLM Yes Yes Yes No
YI 6B - 34B LLM Yes Yes Yes Yes
LLaVA(1.5,1.6) 7B - 34B MLLM Yes Yes Yes Yes
InternVL v1.1 - v1.5 MLLM Yes Yes Yes Yes
InternVL2[2] 1 - 2B, 8B - 76B MLLM Yes Yes* Yes* Yes
InternVL2.5(MPO)[2] 1 - 78B MLLM Yes Yes* Yes* Yes
ChemVLM 8B - 26B MLLM Yes Yes Yes Yes
MiniCPM-Llama3-V-2_5 - MLLM Yes Yes Yes Yes
MiniCPM-V-2_6 - MLLM Yes Yes Yes Yes
MiniGeminiLlama 7B MLLM Yes - - Yes
GLM4 9B LLM Yes Yes Yes Yes
CodeGeeX4 9B LLM Yes Yes Yes -
Molmo 7B-D,72B MLLM Yes Yes Yes No

"-" means not verified yet.

* [1] The TurboMind engine doesn't support window attention. Therefore, for models that have applied window attention and have the corresponding switch "use_sliding_window" enabled, such as Mistral, Qwen1.5 and etc., please choose the PyTorch engine for inference.
* [2] When the head_dim of a model is not 128, such as llama3.2-1B, qwen2-0.5B and internvl2-1B, turbomind doesn't support its kv cache 4/8 bit quantization and inference

PyTorchEngine on CUDA Platform

Model Size Type FP16/BF16 KV INT8 KV INT4 W8A8 W4A16
Llama 7B - 65B LLM Yes Yes Yes Yes Yes
Llama2 7B - 70B LLM Yes Yes Yes Yes Yes
Llama3 8B, 70B LLM Yes Yes Yes Yes Yes
Llama3.1 8B, 70B LLM Yes Yes Yes Yes Yes
Llama3.2 1B, 3B LLM Yes Yes Yes Yes Yes
Llama3.2-VL 11B, 90B MLLM Yes Yes Yes - -
InternLM 7B - 20B LLM Yes Yes Yes Yes Yes
InternLM2 7B - 20B LLM Yes Yes Yes Yes Yes
InternLM2.5 7B LLM Yes Yes Yes Yes Yes
Baichuan2 7B LLM Yes Yes Yes Yes No
Baichuan2 13B LLM Yes Yes Yes No No
ChatGLM2 6B LLM Yes Yes Yes No No
Falcon 7B - 180B LLM Yes Yes Yes No No
YI 6B - 34B LLM Yes Yes Yes Yes Yes
Mistral 7B LLM Yes Yes Yes Yes Yes
Mixtral 8x7B, 8x22B LLM Yes Yes Yes No No
QWen 1.8B - 72B LLM Yes Yes Yes Yes Yes
QWen1.5 0.5B - 110B LLM Yes Yes Yes Yes Yes
QWen1.5-MoE A2.7B LLM Yes Yes Yes No No
QWen2 0.5B - 72B LLM Yes Yes No Yes Yes
Qwen2.5 0.5B - 72B LLM Yes Yes No Yes Yes
QWen2-VL 2B, 7B MLLM Yes Yes No No Yes
DeepSeek-MoE 16B LLM Yes No No No No
DeepSeek-V2 16B, 236B LLM Yes No No No No
DeepSeek-V2.5 236B LLM Yes No No No No
MiniCPM3 4B LLM Yes Yes Yes No No
MiniCPM-V-2_6 8B LLM Yes No No No Yes
Gemma 2B-7B LLM Yes Yes Yes No No
Dbrx 132B LLM Yes Yes Yes No No
StarCoder2 3B-15B LLM Yes Yes Yes No No
Phi-3-mini 3.8B LLM Yes Yes Yes Yes Yes
Phi-3-vision 4.2B MLLM Yes Yes Yes - -
CogVLM-Chat 17B MLLM Yes Yes Yes - -
CogVLM2-Chat 19B MLLM Yes Yes Yes - -
LLaVA(1.5,1.6)[2] 7B-34B MLLM No No No No No
InternVL(v1.5) 2B-26B MLLM Yes Yes Yes No Yes
InternVL2 1B-76B MLLM Yes Yes Yes - -
InternVL2.5(MPO) 1B-78B MLLM Yes Yes Yes - -
Mono-InternVL[1] 2B MLLM Yes Yes Yes - -
ChemVLM 8B-26B MLLM Yes Yes No - -
Gemma2 9B-27B LLM Yes Yes Yes - -
GLM4 9B LLM Yes Yes Yes No No
GLM-4V 9B MLLM Yes Yes Yes No Yes
CodeGeeX4 9B LLM Yes Yes Yes - -
Phi-3.5-mini 3.8B LLM Yes Yes No - -
Phi-3.5-MoE 16x3.8B LLM Yes Yes No - -
Phi-3.5-vision 4.2B MLLM Yes Yes No - -
* [1] Currently Mono-InternVL does not support FP16 due to numerical instability. Please use BF16 instead.
* [2] PyTorch engine removes the support of original llava models after v0.6.4. Please use their corresponding transformers models instead, which can be found in https://huggingface.co/llava-hf

PyTorchEngine on Huawei Ascend Platform

Model Size Type FP16/BF16(eager) FP16/BF16(graph) W4A16(eager)
Llama2 7B - 70B LLM Yes Yes Yes
Llama3 8B LLM Yes Yes Yes
Llama3.1 8B LLM Yes Yes Yes
InternLM2 7B - 20B LLM Yes Yes Yes
InternLM2.5 7B - 20B LLM Yes Yes Yes
Mixtral 8x7B LLM Yes Yes No
QWen1.5-MoE A2.7B LLM Yes - No
QWen2(.5) 7B LLM Yes Yes No
QWen2-MoE A14.57B LLM Yes - No
InternVL(v1.5) 2B-26B MLLM Yes - Yes
InternVL2 1B-40B MLLM Yes Yes Yes
CogVLM2-chat 19B MLLM Yes No -
GLM4V 9B MLLM Yes No -