You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following tables detail the models supported by LMDeploy's TurboMind engine and PyTorch engine across different platforms.
TurboMind on CUDA Platform
Model
Size
Type
FP16/BF16
KV INT8
KV INT4
W4A16
Llama
7B - 65B
LLM
Yes
Yes
Yes
Yes
Llama2
7B - 70B
LLM
Yes
Yes
Yes
Yes
Llama3
8B, 70B
LLM
Yes
Yes
Yes
Yes
Llama3.1
8B, 70B
LLM
Yes
Yes
Yes
Yes
Llama3.2[2]
1B, 3B
LLM
Yes
Yes*
Yes*
Yes
InternLM
7B - 20B
LLM
Yes
Yes
Yes
Yes
InternLM2
7B - 20B
LLM
Yes
Yes
Yes
Yes
InternLM2.5
7B
LLM
Yes
Yes
Yes
Yes
InternLM-XComposer2
7B, 4khd-7B
MLLM
Yes
Yes
Yes
Yes
InternLM-XComposer2.5
7B
MLLM
Yes
Yes
Yes
Yes
Qwen
1.8B - 72B
LLM
Yes
Yes
Yes
Yes
Qwen1.5[1]
1.8B - 110B
LLM
Yes
Yes
Yes
Yes
Qwen2[2]
0.5B - 72B
LLM
Yes
Yes*
Yes*
Yes
Qwen2-MoE
57BA14B
LLM
Yes
Yes
Yes
Yes
Qwen2.5[2]
0.5B - 72B
LLM
Yes
Yes*
Yes*
Yes
Mistral[1]
7B
LLM
Yes
Yes
Yes
No
Mixtral
8x7B, 8x22B
LLM
Yes
Yes
Yes
Yes
DeepSeek-V2
16B, 236B
LLM
Yes
Yes
Yes
No
DeepSeek-V2.5
236B
LLM
Yes
Yes
Yes
No
Qwen-VL
7B
MLLM
Yes
Yes
Yes
Yes
DeepSeek-VL
7B
MLLM
Yes
Yes
Yes
Yes
Baichuan
7B
LLM
Yes
Yes
Yes
Yes
Baichuan2
7B
LLM
Yes
Yes
Yes
Yes
Code Llama
7B - 34B
LLM
Yes
Yes
Yes
No
YI
6B - 34B
LLM
Yes
Yes
Yes
Yes
LLaVA(1.5,1.6)
7B - 34B
MLLM
Yes
Yes
Yes
Yes
InternVL
v1.1 - v1.5
MLLM
Yes
Yes
Yes
Yes
InternVL2[2]
1 - 2B, 8B - 76B
MLLM
Yes
Yes*
Yes*
Yes
InternVL2.5(MPO)[2]
1 - 78B
MLLM
Yes
Yes*
Yes*
Yes
ChemVLM
8B - 26B
MLLM
Yes
Yes
Yes
Yes
MiniCPM-Llama3-V-2_5
-
MLLM
Yes
Yes
Yes
Yes
MiniCPM-V-2_6
-
MLLM
Yes
Yes
Yes
Yes
MiniGeminiLlama
7B
MLLM
Yes
-
-
Yes
GLM4
9B
LLM
Yes
Yes
Yes
Yes
CodeGeeX4
9B
LLM
Yes
Yes
Yes
-
Molmo
7B-D,72B
MLLM
Yes
Yes
Yes
No
"-" means not verified yet.
* [1] The TurboMind engine doesn't support window attention. Therefore, for models that have applied window attention and have the corresponding switch "use_sliding_window" enabled, such as Mistral, Qwen1.5 and etc., please choose the PyTorch engine for inference.
* [2] When the head_dim of a model is not 128, such as llama3.2-1B, qwen2-0.5B and internvl2-1B, turbomind doesn't support its kv cache 4/8 bit quantization and inference
PyTorchEngine on CUDA Platform
Model
Size
Type
FP16/BF16
KV INT8
KV INT4
W8A8
W4A16
Llama
7B - 65B
LLM
Yes
Yes
Yes
Yes
Yes
Llama2
7B - 70B
LLM
Yes
Yes
Yes
Yes
Yes
Llama3
8B, 70B
LLM
Yes
Yes
Yes
Yes
Yes
Llama3.1
8B, 70B
LLM
Yes
Yes
Yes
Yes
Yes
Llama3.2
1B, 3B
LLM
Yes
Yes
Yes
Yes
Yes
Llama3.2-VL
11B, 90B
MLLM
Yes
Yes
Yes
-
-
InternLM
7B - 20B
LLM
Yes
Yes
Yes
Yes
Yes
InternLM2
7B - 20B
LLM
Yes
Yes
Yes
Yes
Yes
InternLM2.5
7B
LLM
Yes
Yes
Yes
Yes
Yes
Baichuan2
7B
LLM
Yes
Yes
Yes
Yes
No
Baichuan2
13B
LLM
Yes
Yes
Yes
No
No
ChatGLM2
6B
LLM
Yes
Yes
Yes
No
No
Falcon
7B - 180B
LLM
Yes
Yes
Yes
No
No
YI
6B - 34B
LLM
Yes
Yes
Yes
Yes
Yes
Mistral
7B
LLM
Yes
Yes
Yes
Yes
Yes
Mixtral
8x7B, 8x22B
LLM
Yes
Yes
Yes
No
No
QWen
1.8B - 72B
LLM
Yes
Yes
Yes
Yes
Yes
QWen1.5
0.5B - 110B
LLM
Yes
Yes
Yes
Yes
Yes
QWen1.5-MoE
A2.7B
LLM
Yes
Yes
Yes
No
No
QWen2
0.5B - 72B
LLM
Yes
Yes
No
Yes
Yes
Qwen2.5
0.5B - 72B
LLM
Yes
Yes
No
Yes
Yes
QWen2-VL
2B, 7B
MLLM
Yes
Yes
No
No
Yes
DeepSeek-MoE
16B
LLM
Yes
No
No
No
No
DeepSeek-V2
16B, 236B
LLM
Yes
No
No
No
No
DeepSeek-V2.5
236B
LLM
Yes
No
No
No
No
MiniCPM3
4B
LLM
Yes
Yes
Yes
No
No
MiniCPM-V-2_6
8B
LLM
Yes
No
No
No
Yes
Gemma
2B-7B
LLM
Yes
Yes
Yes
No
No
Dbrx
132B
LLM
Yes
Yes
Yes
No
No
StarCoder2
3B-15B
LLM
Yes
Yes
Yes
No
No
Phi-3-mini
3.8B
LLM
Yes
Yes
Yes
Yes
Yes
Phi-3-vision
4.2B
MLLM
Yes
Yes
Yes
-
-
CogVLM-Chat
17B
MLLM
Yes
Yes
Yes
-
-
CogVLM2-Chat
19B
MLLM
Yes
Yes
Yes
-
-
LLaVA(1.5,1.6)[2]
7B-34B
MLLM
No
No
No
No
No
InternVL(v1.5)
2B-26B
MLLM
Yes
Yes
Yes
No
Yes
InternVL2
1B-76B
MLLM
Yes
Yes
Yes
-
-
InternVL2.5(MPO)
1B-78B
MLLM
Yes
Yes
Yes
-
-
Mono-InternVL[1]
2B
MLLM
Yes
Yes
Yes
-
-
ChemVLM
8B-26B
MLLM
Yes
Yes
No
-
-
Gemma2
9B-27B
LLM
Yes
Yes
Yes
-
-
GLM4
9B
LLM
Yes
Yes
Yes
No
No
GLM-4V
9B
MLLM
Yes
Yes
Yes
No
Yes
CodeGeeX4
9B
LLM
Yes
Yes
Yes
-
-
Phi-3.5-mini
3.8B
LLM
Yes
Yes
No
-
-
Phi-3.5-MoE
16x3.8B
LLM
Yes
Yes
No
-
-
Phi-3.5-vision
4.2B
MLLM
Yes
Yes
No
-
-
* [1] Currently Mono-InternVL does not support FP16 due to numerical instability. Please use BF16 instead.
* [2] PyTorch engine removes the support of original llava models after v0.6.4. Please use their corresponding transformers models instead, which can be found in https://huggingface.co/llava-hf