diff --git a/README.md b/README.md
index d160338aa..8ef7b7994 100644
--- a/README.md
+++ b/README.md
@@ -125,6 +125,8 @@ For detailed inference benchmarks in more devices and more settings, please refe
Qwen1.5 (0.5B - 110B)
Qwen1.5 - MoE (0.5B - 72B)
Qwen2 (0.5B - 72B)
+ Qwen2-MoE (57BA14B)
+ Qwen2.5 (0.5B - 32B)
Baichuan (7B)
Baichuan2 (7B-13B)
Code Llama (7B - 34B)
@@ -136,6 +138,7 @@ For detailed inference benchmarks in more devices and more settings, please refe
Mistral (7B)
DeepSeek-MoE (16B)
DeepSeek-V2 (16B, 236B)
+ DeepSeek-V2.5 (236B)
Mixtral (8x7B, 8x22B)
Gemma (2B - 7B)
Dbrx (132B)
diff --git a/README_ja.md b/README_ja.md
index fda176229..77badaac3 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -122,6 +122,8 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
Qwen1.5 (0.5B - 110B)
Qwen1.5 - MoE (0.5B - 72B)
Qwen2 (0.5B - 72B)
+ Qwen2-MoE (57BA14B)
+ Qwen2.5 (0.5B - 32B)
Baichuan (7B)
Baichuan2 (7B-13B)
Code Llama (7B - 34B)
@@ -133,6 +135,7 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
Mistral (7B)
DeepSeek-MoE (16B)
DeepSeek-V2 (16B, 236B)
+ DeepSeek-V2.5 (236B)
Mixtral (8x7B, 8x22B)
Gemma (2B - 7B)
Dbrx (132B)
diff --git a/README_zh-CN.md b/README_zh-CN.md
index 6c24b2e50..9f3cd40a6 100644
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -126,6 +126,8 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
Qwen1.5 (0.5B - 110B)
Qwen1.5 - MoE (0.5B - 72B)
Qwen2 (0.5B - 72B)
+ Qwen2-MoE (57BA14B)
+ Qwen2.5 (0.5B - 32B)
Baichuan (7B)
Baichuan2 (7B-13B)
Code Llama (7B - 34B)
@@ -137,6 +139,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
Mistral (7B)
DeepSeek-MoE (16B)
DeepSeek-V2 (16B, 236B)
+ DeepSeek-V2.5 (236B)
Mixtral (8x7B, 8x22B)
Gemma (2B - 7B)
Dbrx (132B)
diff --git a/docs/en/supported_models/supported_models.md b/docs/en/supported_models/supported_models.md
index 469ece487..dd8ceb4ff 100644
--- a/docs/en/supported_models/supported_models.md
+++ b/docs/en/supported_models/supported_models.md
@@ -10,7 +10,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
| Llama2 | 7B - 70B | LLM | Yes | Yes | Yes | Yes |
| Llama3 | 8B, 70B | LLM | Yes | Yes | Yes | Yes |
| Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | Yes |
-| Llama3.2 | 1B, 3B | LLM | Yes | Yes | Yes | Yes |
+| Llama3.2 | 1B, 3B | LLM | Yes | Yes\* | Yes\* | Yes |
| InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes |
| InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes |
| InternLM2.5 | 7B | LLM | Yes | Yes | Yes | Yes |
@@ -18,9 +18,13 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
| InternLM-XComposer2.5 | 7B | MLLM | Yes | Yes | Yes | Yes |
| Qwen | 1.8B - 72B | LLM | Yes | Yes | Yes | Yes |
| Qwen1.5 | 1.8B - 110B | LLM | Yes | Yes | Yes | Yes |
-| Qwen2 | 0.5B - 72B | LLM | Yes | Yes | Yes | Yes |
+| Qwen2 | 0.5B - 72B | LLM | Yes | Yes\* | Yes\* | Yes |
+| Qwen2-MoE | 57BA14B | LLM | Yes | Yes | Yes | Yes |
+| Qwen2.5 | 0.5B - 72B | LLM | Yes | Yes | Yes | Yes |
| Mistral | 7B | LLM | Yes | Yes | Yes | No |
| Mixtral | 8x7B, 8x22B | LLM | Yes | Yes | Yes | Yes |
+| DeepSeek-V2 | 16B, 236B | LLM | Yes | Yes | Yes | No |
+| DeepSeek-V2.5 | 236B | LLM | Yes | Yes | Yes | No |
| Qwen-VL | 7B | MLLM | Yes | Yes | Yes | Yes |
| DeepSeek-VL | 7B | MLLM | Yes | Yes | Yes | Yes |
| Baichuan | 7B | LLM | Yes | Yes | Yes | Yes |
@@ -29,7 +33,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
| YI | 6B - 34B | LLM | Yes | Yes | Yes | Yes |
| LLaVA(1.5,1.6) | 7B - 34B | MLLM | Yes | Yes | Yes | Yes |
| InternVL | v1.1 - v1.5 | MLLM | Yes | Yes | Yes | Yes |
-| InternVL2 | 1-2B, 8B - 76B | MLLM | Yes | Yes | Yes | Yes |
+| InternVL2 | 1-2B, 8B - 76B | MLLM | Yes | Yes\* | Yes\* | Yes |
| ChemVLM | 8B - 26B | MLLM | Yes | Yes | Yes | Yes |
| MiniCPM-Llama3-V-2_5 | - | MLLM | Yes | Yes | Yes | Yes |
| MiniCPM-V-2_6 | - | MLLM | Yes | Yes | Yes | Yes |
@@ -41,7 +45,8 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
"-" means not verified yet.
```{note}
-The TurboMind engine doesn't support window attention. Therefore, for models that have applied window attention and have the corresponding switch "use_sliding_window" enabled, such as Mistral, Qwen1.5 and etc., please choose the PyTorch engine for inference.
+* The TurboMind engine doesn't support window attention. Therefore, for models that have applied window attention and have the corresponding switch "use_sliding_window" enabled, such as Mistral, Qwen1.5 and etc., please choose the PyTorch engine for inference.
+* When the head_dim of a model is not 128, such as llama3.2-1B, qwen2-0.5B and internvl2-1B, turbomind doesn't support its kv cache 4/8 bit quantization and inference
```
## PyTorchEngine on CUDA Platform
@@ -68,11 +73,13 @@ The TurboMind engine doesn't support window attention. Therefore, for models tha
| QWen1.5 | 0.5B - 110B | LLM | Yes | Yes | Yes | Yes | Yes |
| QWen1.5-MoE | A2.7B | LLM | Yes | Yes | Yes | No | No |
| QWen2 | 0.5B - 72B | LLM | Yes | Yes | No | Yes | Yes |
+| Qwen2.5 | 0.5B - 72B | LLM | Yes | Yes | No | Yes | Yes |
| QWen2-VL | 2B, 7B | MLLM | Yes | Yes | No | No | No |
| DeepSeek-MoE | 16B | LLM | Yes | No | No | No | No |
| DeepSeek-V2 | 16B, 236B | LLM | Yes | No | No | No | No |
+| DeepSeek-V2.5 | 236B | LLM | Yes | No | No | No | No |
| MiniCPM3 | 4B | LLM | Yes | Yes | Yes | No | No |
-| MiniCPM-V-2_6 | 8B | LLM | Yes | No | No | Yes | Yes |
+| MiniCPM-V-2_6 | 8B | LLM | Yes | No | No | No | Yes |
| Gemma | 2B-7B | LLM | Yes | Yes | Yes | No | No |
| Dbrx | 132B | LLM | Yes | Yes | Yes | No | No |
| StarCoder2 | 3B-15B | LLM | Yes | Yes | Yes | No | No |
@@ -81,7 +88,7 @@ The TurboMind engine doesn't support window attention. Therefore, for models tha
| CogVLM-Chat | 17B | MLLM | Yes | Yes | Yes | - | - |
| CogVLM2-Chat | 19B | MLLM | Yes | Yes | Yes | - | - |
| LLaVA(1.5,1.6) | 7B-34B | MLLM | Yes | Yes | Yes | - | - |
-| InternVL(v1.5) | 2B-26B | MLLM | Yes | Yes | Yes | Yes | Yes |
+| InternVL(v1.5) | 2B-26B | MLLM | Yes | Yes | Yes | No | Yes |
| InternVL2 | 1B-40B | MLLM | Yes | Yes | Yes | - | - |
| Mono-InternVL | 2B | MLLM | Yes\* | Yes | Yes | - | - |
| ChemVLM | 8B-26B | MLLM | Yes | Yes | No | - | - |
diff --git a/docs/zh_cn/supported_models/supported_models.md b/docs/zh_cn/supported_models/supported_models.md
index d73452328..3ec3688e1 100644
--- a/docs/zh_cn/supported_models/supported_models.md
+++ b/docs/zh_cn/supported_models/supported_models.md
@@ -10,7 +10,7 @@
| Llama2 | 7B - 70B | LLM | Yes | Yes | Yes | Yes |
| Llama3 | 8B, 70B | LLM | Yes | Yes | Yes | Yes |
| Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | Yes |
-| Llama3.2 | 1B, 3B | LLM | Yes | Yes | Yes | Yes |
+| Llama3.2 | 1B, 3B | LLM | Yes | Yes\* | Yes\* | Yes |
| InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes |
| InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes |
| InternLM2.5 | 7B | LLM | Yes | Yes | Yes | Yes |
@@ -18,9 +18,13 @@
| InternLM-XComposer2.5 | 7B | MLLM | Yes | Yes | Yes | Yes |
| Qwen | 1.8B - 72B | LLM | Yes | Yes | Yes | Yes |
| Qwen1.5 | 1.8B - 110B | LLM | Yes | Yes | Yes | Yes |
-| Qwen2 | 0.5B - 72B | LLM | Yes | Yes | Yes | Yes |
+| Qwen2 | 0.5B - 72B | LLM | Yes | Yes\* | Yes\* | Yes |
+| Qwen2-MoE | 57BA14B | LLM | Yes | Yes | Yes | Yes |
+| Qwen2.5 | 0.5B - 72B | LLM | Yes | Yes | Yes | Yes |
| Mistral | 7B | LLM | Yes | Yes | Yes | No |
| Mixtral | 8x7B, 8x22B | LLM | Yes | Yes | Yes | Yes |
+| DeepSeek-V2 | 16B, 236B | LLM | Yes | Yes | Yes | No |
+| DeepSeek-V2.5 | 236B | LLM | Yes | Yes | Yes | No |
| Qwen-VL | 7B | MLLM | Yes | Yes | Yes | Yes |
| DeepSeek-VL | 7B | MLLM | Yes | Yes | Yes | Yes |
| Baichuan | 7B | LLM | Yes | Yes | Yes | Yes |
@@ -29,7 +33,7 @@
| YI | 6B - 34B | LLM | Yes | Yes | Yes | Yes |
| LLaVA(1.5,1.6) | 7B - 34B | MLLM | Yes | Yes | Yes | Yes |
| InternVL | v1.1 - v1.5 | MLLM | Yes | Yes | Yes | Yes |
-| InternVL2 | 1-2B, 8B - 76B | MLLM | Yes | Yes | Yes | Yes |
+| InternVL2 | 1-2B, 8B - 76B | MLLM | Yes | Yes\* | Yes\* | Yes |
| ChemVLM | 8B - 26B | MLLM | Yes | Yes | Yes | Yes |
| MiniCPM-Llama3-V-2_5 | - | MLLM | Yes | Yes | Yes | Yes |
| MiniCPM-V-2_6 | - | MLLM | Yes | Yes | Yes | Yes |
@@ -41,7 +45,8 @@
“-” 表示还没有验证。
```{note}
-turbomind 引擎不支持 window attention。所以,对于应用了 window attention,并开启了对应的开关"use_sliding_window"的模型,比如 Mistral、Qwen1.5 等,在推理时,请选择 pytorch engine
+* turbomind 引擎不支持 window attention。所以,对于应用了 window attention,并开启了对应的开关"use_sliding_window"的模型,比如 Mistral、Qwen1.5 等,在推理时,请选择 pytorch engine
+* 当模型的 head_dim 非 128 时,turbomind 不支持它的 kv cache 4/8 bit 量化和推理。比如,llama3.2-1B,qwen2-0.5B,internvl2-1B 等等
```
## PyTorchEngine CUDA 平台
@@ -68,11 +73,13 @@ turbomind 引擎不支持 window attention。所以,对于应用了 window att
| QWen1.5 | 0.5B - 110B | LLM | Yes | Yes | Yes | Yes | Yes |
| QWen1.5-MoE | A2.7B | LLM | Yes | Yes | Yes | No | No |
| QWen2 | 0.5B - 72B | LLM | Yes | Yes | No | Yes | Yes |
+| Qwen2.5 | 0.5B - 72B | LLM | Yes | Yes | No | Yes | Yes |
| QWen2-VL | 2B, 7B | MLLM | Yes | Yes | No | No | No |
| DeepSeek-MoE | 16B | LLM | Yes | No | No | No | No |
| DeepSeek-V2 | 16B, 236B | LLM | Yes | No | No | No | No |
+| DeepSeek-V2.5 | 236B | LLM | Yes | No | No | No | No |
| MiniCPM3 | 4B | LLM | Yes | Yes | Yes | No | No |
-| MiniCPM-V-2_6 | 8B | LLM | Yes | No | No | Yes | Yes |
+| MiniCPM-V-2_6 | 8B | LLM | Yes | No | No | No | Yes |
| Gemma | 2B-7B | LLM | Yes | Yes | Yes | No | No |
| Dbrx | 132B | LLM | Yes | Yes | Yes | No | No |
| StarCoder2 | 3B-15B | LLM | Yes | Yes | Yes | No | No |
@@ -81,7 +88,7 @@ turbomind 引擎不支持 window attention。所以,对于应用了 window att
| CogVLM-Chat | 17B | MLLM | Yes | Yes | Yes | - | - |
| CogVLM2-Chat | 19B | MLLM | Yes | Yes | Yes | - | - |
| LLaVA(1.5,1.6) | 7B-34B | MLLM | Yes | Yes | Yes | - | - |
-| InternVL(v1.5) | 2B-26B | MLLM | Yes | Yes | Yes | Yes | Yes |
+| InternVL(v1.5) | 2B-26B | MLLM | Yes | Yes | Yes | No | Yes |
| InternVL2 | 1B-40B | MLLM | Yes | Yes | Yes | - | - |
| Mono-InternVL | 2B | MLLM | Yes\* | Yes | Yes | - | - |
| ChemVLM | 8B-26B | MLLM | Yes | Yes | No | - | - |
@@ -94,7 +101,7 @@ turbomind 引擎不支持 window attention。所以,对于应用了 window att
| Phi-3.5-vision | 4.2B | MLLM | Yes | Yes | No | - | - |
```{note}
-* Currently Mono-InternVL does not support FP16 due to numerical instability. Please use BF16 instead.
+* 目前,Mono-InternVL不支持FP16,因为数值不稳定。请改用BF16。
```
## PyTorchEngine 华为昇腾平台