diff --git a/README.md b/README.md index d160338aa..8ef7b7994 100644 --- a/README.md +++ b/README.md @@ -125,6 +125,8 @@ For detailed inference benchmarks in more devices and more settings, please refe
  • Qwen1.5 (0.5B - 110B)
  • Qwen1.5 - MoE (0.5B - 72B)
  • Qwen2 (0.5B - 72B)
  • +
  • Qwen2-MoE (57BA14B)
  • +
  • Qwen2.5 (0.5B - 32B)
  • Baichuan (7B)
  • Baichuan2 (7B-13B)
  • Code Llama (7B - 34B)
  • @@ -136,6 +138,7 @@ For detailed inference benchmarks in more devices and more settings, please refe
  • Mistral (7B)
  • DeepSeek-MoE (16B)
  • DeepSeek-V2 (16B, 236B)
  • +
  • DeepSeek-V2.5 (236B)
  • Mixtral (8x7B, 8x22B)
  • Gemma (2B - 7B)
  • Dbrx (132B)
  • diff --git a/README_ja.md b/README_ja.md index fda176229..77badaac3 100644 --- a/README_ja.md +++ b/README_ja.md @@ -122,6 +122,8 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
  • Qwen1.5 (0.5B - 110B)
  • Qwen1.5 - MoE (0.5B - 72B)
  • Qwen2 (0.5B - 72B)
  • +
  • Qwen2-MoE (57BA14B)
  • +
  • Qwen2.5 (0.5B - 32B)
  • Baichuan (7B)
  • Baichuan2 (7B-13B)
  • Code Llama (7B - 34B)
  • @@ -133,6 +135,7 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
  • Mistral (7B)
  • DeepSeek-MoE (16B)
  • DeepSeek-V2 (16B, 236B)
  • +
  • DeepSeek-V2.5 (236B)
  • Mixtral (8x7B, 8x22B)
  • Gemma (2B - 7B)
  • Dbrx (132B)
  • diff --git a/README_zh-CN.md b/README_zh-CN.md index 6c24b2e50..9f3cd40a6 100644 --- a/README_zh-CN.md +++ b/README_zh-CN.md @@ -126,6 +126,8 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
  • Qwen1.5 (0.5B - 110B)
  • Qwen1.5 - MoE (0.5B - 72B)
  • Qwen2 (0.5B - 72B)
  • +
  • Qwen2-MoE (57BA14B)
  • +
  • Qwen2.5 (0.5B - 32B)
  • Baichuan (7B)
  • Baichuan2 (7B-13B)
  • Code Llama (7B - 34B)
  • @@ -137,6 +139,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
  • Mistral (7B)
  • DeepSeek-MoE (16B)
  • DeepSeek-V2 (16B, 236B)
  • +
  • DeepSeek-V2.5 (236B)
  • Mixtral (8x7B, 8x22B)
  • Gemma (2B - 7B)
  • Dbrx (132B)
  • diff --git a/docs/en/supported_models/supported_models.md b/docs/en/supported_models/supported_models.md index 469ece487..dd8ceb4ff 100644 --- a/docs/en/supported_models/supported_models.md +++ b/docs/en/supported_models/supported_models.md @@ -10,7 +10,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine | Llama2 | 7B - 70B | LLM | Yes | Yes | Yes | Yes | | Llama3 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | | Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | -| Llama3.2 | 1B, 3B | LLM | Yes | Yes | Yes | Yes | +| Llama3.2 | 1B, 3B | LLM | Yes | Yes\* | Yes\* | Yes | | InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes | | InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes | | InternLM2.5 | 7B | LLM | Yes | Yes | Yes | Yes | @@ -18,9 +18,13 @@ The following tables detail the models supported by LMDeploy's TurboMind engine | InternLM-XComposer2.5 | 7B | MLLM | Yes | Yes | Yes | Yes | | Qwen | 1.8B - 72B | LLM | Yes | Yes | Yes | Yes | | Qwen1.5 | 1.8B - 110B | LLM | Yes | Yes | Yes | Yes | -| Qwen2 | 0.5B - 72B | LLM | Yes | Yes | Yes | Yes | +| Qwen2 | 0.5B - 72B | LLM | Yes | Yes\* | Yes\* | Yes | +| Qwen2-MoE | 57BA14B | LLM | Yes | Yes | Yes | Yes | +| Qwen2.5 | 0.5B - 72B | LLM | Yes | Yes | Yes | Yes | | Mistral | 7B | LLM | Yes | Yes | Yes | No | | Mixtral | 8x7B, 8x22B | LLM | Yes | Yes | Yes | Yes | +| DeepSeek-V2 | 16B, 236B | LLM | Yes | Yes | Yes | No | +| DeepSeek-V2.5 | 236B | LLM | Yes | Yes | Yes | No | | Qwen-VL | 7B | MLLM | Yes | Yes | Yes | Yes | | DeepSeek-VL | 7B | MLLM | Yes | Yes | Yes | Yes | | Baichuan | 7B | LLM | Yes | Yes | Yes | Yes | @@ -29,7 +33,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine | YI | 6B - 34B | LLM | Yes | Yes | Yes | Yes | | LLaVA(1.5,1.6) | 7B - 34B | MLLM | Yes | Yes | Yes | Yes | | InternVL | v1.1 - v1.5 | MLLM | Yes | Yes | Yes | Yes | -| InternVL2 | 1-2B, 8B - 76B | MLLM | Yes | Yes | Yes | Yes | +| InternVL2 | 1-2B, 8B - 76B | MLLM | Yes | Yes\* | Yes\* | Yes | | ChemVLM | 8B - 26B | MLLM | Yes | Yes | Yes | Yes | | MiniCPM-Llama3-V-2_5 | - | MLLM | Yes | Yes | Yes | Yes | | MiniCPM-V-2_6 | - | MLLM | Yes | Yes | Yes | Yes | @@ -41,7 +45,8 @@ The following tables detail the models supported by LMDeploy's TurboMind engine "-" means not verified yet. ```{note} -The TurboMind engine doesn't support window attention. Therefore, for models that have applied window attention and have the corresponding switch "use_sliding_window" enabled, such as Mistral, Qwen1.5 and etc., please choose the PyTorch engine for inference. +* The TurboMind engine doesn't support window attention. Therefore, for models that have applied window attention and have the corresponding switch "use_sliding_window" enabled, such as Mistral, Qwen1.5 and etc., please choose the PyTorch engine for inference. +* When the head_dim of a model is not 128, such as llama3.2-1B, qwen2-0.5B and internvl2-1B, turbomind doesn't support its kv cache 4/8 bit quantization and inference ``` ## PyTorchEngine on CUDA Platform @@ -68,11 +73,13 @@ The TurboMind engine doesn't support window attention. Therefore, for models tha | QWen1.5 | 0.5B - 110B | LLM | Yes | Yes | Yes | Yes | Yes | | QWen1.5-MoE | A2.7B | LLM | Yes | Yes | Yes | No | No | | QWen2 | 0.5B - 72B | LLM | Yes | Yes | No | Yes | Yes | +| Qwen2.5 | 0.5B - 72B | LLM | Yes | Yes | No | Yes | Yes | | QWen2-VL | 2B, 7B | MLLM | Yes | Yes | No | No | No | | DeepSeek-MoE | 16B | LLM | Yes | No | No | No | No | | DeepSeek-V2 | 16B, 236B | LLM | Yes | No | No | No | No | +| DeepSeek-V2.5 | 236B | LLM | Yes | No | No | No | No | | MiniCPM3 | 4B | LLM | Yes | Yes | Yes | No | No | -| MiniCPM-V-2_6 | 8B | LLM | Yes | No | No | Yes | Yes | +| MiniCPM-V-2_6 | 8B | LLM | Yes | No | No | No | Yes | | Gemma | 2B-7B | LLM | Yes | Yes | Yes | No | No | | Dbrx | 132B | LLM | Yes | Yes | Yes | No | No | | StarCoder2 | 3B-15B | LLM | Yes | Yes | Yes | No | No | @@ -81,7 +88,7 @@ The TurboMind engine doesn't support window attention. Therefore, for models tha | CogVLM-Chat | 17B | MLLM | Yes | Yes | Yes | - | - | | CogVLM2-Chat | 19B | MLLM | Yes | Yes | Yes | - | - | | LLaVA(1.5,1.6) | 7B-34B | MLLM | Yes | Yes | Yes | - | - | -| InternVL(v1.5) | 2B-26B | MLLM | Yes | Yes | Yes | Yes | Yes | +| InternVL(v1.5) | 2B-26B | MLLM | Yes | Yes | Yes | No | Yes | | InternVL2 | 1B-40B | MLLM | Yes | Yes | Yes | - | - | | Mono-InternVL | 2B | MLLM | Yes\* | Yes | Yes | - | - | | ChemVLM | 8B-26B | MLLM | Yes | Yes | No | - | - | diff --git a/docs/zh_cn/supported_models/supported_models.md b/docs/zh_cn/supported_models/supported_models.md index d73452328..3ec3688e1 100644 --- a/docs/zh_cn/supported_models/supported_models.md +++ b/docs/zh_cn/supported_models/supported_models.md @@ -10,7 +10,7 @@ | Llama2 | 7B - 70B | LLM | Yes | Yes | Yes | Yes | | Llama3 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | | Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | -| Llama3.2 | 1B, 3B | LLM | Yes | Yes | Yes | Yes | +| Llama3.2 | 1B, 3B | LLM | Yes | Yes\* | Yes\* | Yes | | InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes | | InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes | | InternLM2.5 | 7B | LLM | Yes | Yes | Yes | Yes | @@ -18,9 +18,13 @@ | InternLM-XComposer2.5 | 7B | MLLM | Yes | Yes | Yes | Yes | | Qwen | 1.8B - 72B | LLM | Yes | Yes | Yes | Yes | | Qwen1.5 | 1.8B - 110B | LLM | Yes | Yes | Yes | Yes | -| Qwen2 | 0.5B - 72B | LLM | Yes | Yes | Yes | Yes | +| Qwen2 | 0.5B - 72B | LLM | Yes | Yes\* | Yes\* | Yes | +| Qwen2-MoE | 57BA14B | LLM | Yes | Yes | Yes | Yes | +| Qwen2.5 | 0.5B - 72B | LLM | Yes | Yes | Yes | Yes | | Mistral | 7B | LLM | Yes | Yes | Yes | No | | Mixtral | 8x7B, 8x22B | LLM | Yes | Yes | Yes | Yes | +| DeepSeek-V2 | 16B, 236B | LLM | Yes | Yes | Yes | No | +| DeepSeek-V2.5 | 236B | LLM | Yes | Yes | Yes | No | | Qwen-VL | 7B | MLLM | Yes | Yes | Yes | Yes | | DeepSeek-VL | 7B | MLLM | Yes | Yes | Yes | Yes | | Baichuan | 7B | LLM | Yes | Yes | Yes | Yes | @@ -29,7 +33,7 @@ | YI | 6B - 34B | LLM | Yes | Yes | Yes | Yes | | LLaVA(1.5,1.6) | 7B - 34B | MLLM | Yes | Yes | Yes | Yes | | InternVL | v1.1 - v1.5 | MLLM | Yes | Yes | Yes | Yes | -| InternVL2 | 1-2B, 8B - 76B | MLLM | Yes | Yes | Yes | Yes | +| InternVL2 | 1-2B, 8B - 76B | MLLM | Yes | Yes\* | Yes\* | Yes | | ChemVLM | 8B - 26B | MLLM | Yes | Yes | Yes | Yes | | MiniCPM-Llama3-V-2_5 | - | MLLM | Yes | Yes | Yes | Yes | | MiniCPM-V-2_6 | - | MLLM | Yes | Yes | Yes | Yes | @@ -41,7 +45,8 @@ “-” 表示还没有验证。 ```{note} -turbomind 引擎不支持 window attention。所以,对于应用了 window attention,并开启了对应的开关"use_sliding_window"的模型,比如 Mistral、Qwen1.5 等,在推理时,请选择 pytorch engine +* turbomind 引擎不支持 window attention。所以,对于应用了 window attention,并开启了对应的开关"use_sliding_window"的模型,比如 Mistral、Qwen1.5 等,在推理时,请选择 pytorch engine +* 当模型的 head_dim 非 128 时,turbomind 不支持它的 kv cache 4/8 bit 量化和推理。比如,llama3.2-1B,qwen2-0.5B,internvl2-1B 等等 ``` ## PyTorchEngine CUDA 平台 @@ -68,11 +73,13 @@ turbomind 引擎不支持 window attention。所以,对于应用了 window att | QWen1.5 | 0.5B - 110B | LLM | Yes | Yes | Yes | Yes | Yes | | QWen1.5-MoE | A2.7B | LLM | Yes | Yes | Yes | No | No | | QWen2 | 0.5B - 72B | LLM | Yes | Yes | No | Yes | Yes | +| Qwen2.5 | 0.5B - 72B | LLM | Yes | Yes | No | Yes | Yes | | QWen2-VL | 2B, 7B | MLLM | Yes | Yes | No | No | No | | DeepSeek-MoE | 16B | LLM | Yes | No | No | No | No | | DeepSeek-V2 | 16B, 236B | LLM | Yes | No | No | No | No | +| DeepSeek-V2.5 | 236B | LLM | Yes | No | No | No | No | | MiniCPM3 | 4B | LLM | Yes | Yes | Yes | No | No | -| MiniCPM-V-2_6 | 8B | LLM | Yes | No | No | Yes | Yes | +| MiniCPM-V-2_6 | 8B | LLM | Yes | No | No | No | Yes | | Gemma | 2B-7B | LLM | Yes | Yes | Yes | No | No | | Dbrx | 132B | LLM | Yes | Yes | Yes | No | No | | StarCoder2 | 3B-15B | LLM | Yes | Yes | Yes | No | No | @@ -81,7 +88,7 @@ turbomind 引擎不支持 window attention。所以,对于应用了 window att | CogVLM-Chat | 17B | MLLM | Yes | Yes | Yes | - | - | | CogVLM2-Chat | 19B | MLLM | Yes | Yes | Yes | - | - | | LLaVA(1.5,1.6) | 7B-34B | MLLM | Yes | Yes | Yes | - | - | -| InternVL(v1.5) | 2B-26B | MLLM | Yes | Yes | Yes | Yes | Yes | +| InternVL(v1.5) | 2B-26B | MLLM | Yes | Yes | Yes | No | Yes | | InternVL2 | 1B-40B | MLLM | Yes | Yes | Yes | - | - | | Mono-InternVL | 2B | MLLM | Yes\* | Yes | Yes | - | - | | ChemVLM | 8B-26B | MLLM | Yes | Yes | No | - | - | @@ -94,7 +101,7 @@ turbomind 引擎不支持 window attention。所以,对于应用了 window att | Phi-3.5-vision | 4.2B | MLLM | Yes | Yes | No | - | - | ```{note} -* Currently Mono-InternVL does not support FP16 due to numerical instability. Please use BF16 instead. +* 目前,Mono-InternVL不支持FP16,因为数值不稳定。请改用BF16。 ``` ## PyTorchEngine 华为昇腾平台