Support Mono-InternVL with PyTorch backend (#2727)

* support Mono-InternVL; fix typos * update readme * add assertion for FP16 * add assertion for FP16 * update _SUPPORTED_ARCHS
InternLM · Nov 11, 2024 · 06aea5d · 06aea5d
1 parent 78ab485
commit 06aea5d
Show file tree

Hide file tree

Showing 33 changed files with 458 additions and 54 deletions.
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
@@ -1,6 +1,6 @@
-## Contributing to InternLM
+## Contributing to LMDeploy
 
-Welcome to the InternLM community, all kinds of contributions are welcomed, including but not limited to
+Welcome to the LMDeploy community, all kinds of contributions are welcomed, including but not limited to
 
 **Fix bug**
 
@@ -56,7 +56,7 @@ upstream	[email protected]:InternLM/lmdeploy.git (push)
 
 #### 2. Configure pre-commit
 
-You should configure [pre-commit](https://pre-commit.com/#intro) in the local development environment to make sure the code style matches that of InternLM. **Note**: The following code should be executed under the lmdeploy directory.
+You should configure [pre-commit](https://pre-commit.com/#intro) in the local development environment to make sure the code style matches that of LMDeploy. **Note**: The following code should be executed under the lmdeploy directory.
 
 ```shell
 pip install -U pre-commit
@@ -96,7 +96,7 @@ git checkout -b yhc/refactor_contributing_doc
 In subsequent development, if the master branch of the local repository is behind the master branch of "upstream", we need to pull the upstream for synchronization, and then execute the above command:
 
 ```shell
-git pull upstream master
+git pull upstream main
 ```
 
 #### 4. Commit the code and pass the unit test
@@ -151,7 +151,7 @@ Find more details about Pull Request description in [pull request guidelines](#p
 
 <img src="https://user-images.githubusercontent.com/57566630/167307490-f9ebf9fa-63c0-4d83-8ba1-081ea169eb3a.png" width="1200">
 
-IternLM will run unit test for the posted Pull Request on different platforms (Linux, Window, Mac), based on different versions of Python, PyTorch, CUDA to make sure the code is correct. We can see the specific test information by clicking `Details` in the above image so that we can modify the code.
+LMDeploy will run unit test for the posted Pull Request on different platforms (Linux, Window, Mac), based on different versions of Python, PyTorch, CUDA to make sure the code is correct. We can see the specific test information by clicking `Details` in the above image so that we can modify the code.
 
 (3) If the Pull Request passes the CI, then you can wait for the review from other developers. You'll modify the code based on the reviewer's comments, and repeat the steps [4](#4-commit-the-code-and-pass-the-unit-test)-[5](#5-push-the-code-to-remote) until all reviewers approve it. Then, we will merge it ASAP.
 
@@ -163,14 +163,14 @@ If your local branch conflicts with the latest master branch of "upstream", you'
 
 ```shell
 git fetch --all --prune
-git rebase upstream/master
+git rebase upstream/main
 ```
 
 or
 
 ```shell
 git fetch --all --prune
-git merge upstream/master
+git merge upstream/main
 ```
 
 If you are very good at handling conflicts, then you can use rebase to resolve conflicts, as this will keep your commit logs tidy. If you are not familiar with `rebase`, then you can use `merge` to resolve conflicts.

diff --git a/README.md b/README.md
@@ -26,6 +26,7 @@ ______________________________________________________________________
 <details open>
 <summary><b>2024</b></summary>
 
+- \[2024/11\] Support Mono-InternVL with PyTorch engine
 - \[2024/10\] PyTorchEngine supports graph mode on ascend platform, doubling the inference speed
 - \[2024/09\] LMDeploy PyTorchEngine adds support for [Huawei Ascend](./docs/en/get_started/ascend/get_started.md). See supported models [here](docs/en/supported_models/supported_models.md)
 - \[2024/09\] LMDeploy PyTorchEngine achieves 1.3x faster on Llama3-8B inference by introducing CUDA graph
@@ -155,6 +156,7 @@ For detailed inference benchmarks in more devices and more settings, please refe
   <li>DeepSeek-VL (7B)</li>
   <li>InternVL-Chat (v1.1-v1.5)</li>
   <li>InternVL2 (1B-76B)</li>
+  <li>Mono-InternVL (2B)</li>
   <li>MiniGeminiLlama (7B)</li>
   <li>CogVLM-Chat (17B)</li>
   <li>CogVLM2-Chat (19B)</li>

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -26,6 +26,7 @@ ______________________________________________________________________
 <details open>
 <summary><b>2024</b></summary>
 
+- \[2024/11\] PyTorch engine 支持 Mono-InternVL 模型
 - \[2024/10\] PyTorchEngine 在 ascend 平台上支持了图模式，推理性能提高了 1 倍
 - \[2024/09\] LMDeploy PyTorchEngine 增加了对 [华为 Ascend](docs/zh_cn/get_started/ascend/get_started.md) 的支持。支持的模型请见[这里](docs/zh_cn/supported_models/supported_models.md)
 - \[2024/09\] 通过引入 CUDA Graph，LMDeploy PyTorchEngine 在 Llama3-8B 推理上实现了 1.3 倍的加速
@@ -156,6 +157,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力，在各种规模的模型
   <li>DeepSeek-VL (7B)</li>
   <li>InternVL-Chat (v1.1-v1.5)</li>
   <li>InternVL2 (1B-76B)</li>
+  <li>Mono-InternVL (2B)</li>
   <li>MiniGeminiLlama (7B)</li>
   <li>CogVLM-Chat (17B)</li>
   <li>CogVLM2-Chat (19B)</li>

diff --git a/docs/en/multi_modal/internvl.md b/docs/en/multi_modal/internvl.md
@@ -2,12 +2,13 @@
 
 LMDeploy supports the following InternVL series of models, which are detailed in the table below:
 
-|    Model    |    Size    | Supported Inference Engine |
-| :---------: | :--------: | :------------------------: |
-|  InternVL   |  13B-19B   |         TurboMind          |
-| InternVL1.5 |   2B-26B   |     TurboMind, PyTorch     |
-|  InternVL2  |   1B, 4B   |          PyTorch           |
-|  InternVL2  | 2B, 8B-76B |     TurboMind, PyTorch     |
+|     Model     |    Size    | Supported Inference Engine |
+| :-----------: | :--------: | :------------------------: |
+|   InternVL    |  13B-19B   |         TurboMind          |
+|  InternVL1.5  |   2B-26B   |     TurboMind, PyTorch     |
+|   InternVL2   |   1B, 4B   |          PyTorch           |
+|   InternVL2   | 2B, 8B-76B |     TurboMind, PyTorch     |
+| Mono-InternVL |     2B     |          PyTorch           |
 
 The next chapter demonstrates how to deploy an InternVL model using LMDeploy, with [InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B) as an example.
 

diff --git a/docs/en/multi_modal/vl_pipeline.md b/docs/en/multi_modal/vl_pipeline.md
@@ -9,6 +9,7 @@ Currently, it supports the following models.
 - [Yi-VL](https://huggingface.co/01-ai/Yi-VL-6B)
 - [DeepSeek-VL](https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat)
 - [InternVL](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)
+- [Mono-InternVL](https://huggingface.co/OpenGVLab/Mono-InternVL-2B)
 - [MGM](https://huggingface.co/YanweiLi/MGM-7B)
 - [XComposer](https://huggingface.co/internlm/internlm-xcomposer2-vl-7b)
 - [CogVLM](https://github.com/InternLM/lmdeploy/tree/main/docs/en/multi_modal/cogvlm.md)

diff --git a/docs/en/supported_models/supported_models.md b/docs/en/supported_models/supported_models.md
@@ -80,6 +80,7 @@ The TurboMind engine doesn't support window attention. Therefore, for models tha
 | LLaVA(1.5,1.6) |   7B-34B    | MLLM |    Yes    |   Yes   |   Yes   |  No  |   -   |
 | InternVL(v1.5) |   2B-26B    | MLLM |    Yes    |   Yes   |   Yes   |  No  |  Yes  |
 |   InternVL2    |   1B-40B    | MLLM |    Yes    |   Yes   |   Yes   |  No  |   -   |
+| Mono-InternVL  |     2B      | MLLM |   Yes\*   |   Yes   |   Yes   |  No  |   -   |
 |     Gemma2     |   9B-27B    | LLM  |    Yes    |   Yes   |   Yes   |  No  |   -   |
 |      GLM4      |     9B      | LLM  |    Yes    |   Yes   |   Yes   |  No  |  No   |
 |     GLM-4V     |     9B      | MLLM |    Yes    |   Yes   |   Yes   |  No  |  No   |
@@ -88,6 +89,10 @@ The TurboMind engine doesn't support window attention. Therefore, for models tha
 |  Phi-3.5-MoE   |   16x3.8B   | LLM  |    Yes    |   Yes   |   No    |  No  |   -   |
 | Phi-3.5-vision |    4.2B     | MLLM |    Yes    |   Yes   |   No    |  No  |   -   |
 
+```{note}
+* Currently Mono-InternVL does not support FP16 due to numerical instability. Please use BF16 instead.
+```
+
 ## PyTorchEngine on Huawei Ascend Platform
 
 |     Model      |   Size   | Type | FP16/BF16 | W4A16 |

diff --git a/docs/zh_cn/multi_modal/internvl.md b/docs/zh_cn/multi_modal/internvl.md
@@ -2,14 +2,15 @@
 
 LMDeploy 支持 InternVL 系列模型，具体如下：
 
-|    Model    |    Size    | Supported Inference Engine |
-| :---------: | :--------: | :------------------------: |
-|  InternVL   |  13B-19B   |         TurboMind          |
-| InternVL1.5 |   2B-26B   |     TurboMind, PyTorch     |
-|  InternVL2  |   1B, 4B   |          PyTorch           |
-|  InternVL2  | 2B, 8B-76B |     TurboMind, PyTorch     |
-
-本文将以[InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B)为例，演示使用 LMDeploy 部署 InternVL 系列模型的方法
+|     Model     |    Size    | Supported Inference Engine |
+| :-----------: | :--------: | :------------------------: |
+|   InternVL    |  13B-19B   |         TurboMind          |
+|  InternVL1.5  |   2B-26B   |     TurboMind, PyTorch     |
+|   InternVL2   |   1B, 4B   |          PyTorch           |
+|   InternVL2   | 2B, 8B-76B |     TurboMind, PyTorch     |
+| Mono-InternVL |     2B     |          PyTorch           |
+
+本文将以[InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B)为例，演示使用 LMDeploy 部署 InternVL 系列模型的方法。
 
 ## 安装
 

diff --git a/docs/zh_cn/multi_modal/vl_pipeline.md b/docs/zh_cn/multi_modal/vl_pipeline.md
@@ -9,6 +9,7 @@ LMDeploy 把视觉-语言模型（VLM）复杂的推理过程，抽象为简单
 - [Yi-VL](https://huggingface.co/01-ai/Yi-VL-6B)
 - [DeepSeek-VL](https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat)
 - [InternVL](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)
+- [Mono-InternVL](https://huggingface.co/OpenGVLab/Mono-InternVL-2B)
 - [MGM](https://huggingface.co/YanweiLi/MGM-7B)
 - [XComposer](https://huggingface.co/internlm/internlm-xcomposer2-vl-7b)
 - [CogVLM](https://github.com/InternLM/lmdeploy/tree/main/docs/zh_cn/multi_modal/cogvlm.md)

diff --git a/docs/zh_cn/supported_models/supported_models.md b/docs/zh_cn/supported_models/supported_models.md
@@ -80,6 +80,7 @@ turbomind 引擎不支持 window attention。所以，对于应用了 window att
 | LLaVA(1.5,1.6) |   7B-34B    | MLLM |    Yes    |   Yes   |   Yes   |  No  |   -   |
 | InternVL(v1.5) |   2B-26B    | MLLM |    Yes    |   Yes   |   Yes   |  No  |  Yes  |
 |   InternVL2    |   1B-40B    | MLLM |    Yes    |   Yes   |   Yes   |  No  |   -   |
+| Mono-InternVL  |     2B      | MLLM |   Yes\*   |   Yes   |   Yes   |  No  |   -   |
 |     Gemma2     |   9B-27B    | LLM  |    Yes    |   Yes   |   Yes   |  No  |   -   |
 |      GLM4      |     9B      | LLM  |    Yes    |   Yes   |   Yes   |  No  |  No   |
 |     GLM-4V     |     9B      | MLLM |    Yes    |   Yes   |   Yes   |  No  |  No   |
@@ -88,6 +89,10 @@ turbomind 引擎不支持 window attention。所以，对于应用了 window att
 |  Phi-3.5-MoE   |   16x3.8B   | LLM  |    Yes    |   Yes   |   No    |  No  |   -   |
 | Phi-3.5-vision |    4.2B     | MLLM |    Yes    |   Yes   |   No    |  No  |   -   |
 
+```{note}
+* Currently Mono-InternVL does not support FP16 due to numerical instability. Please use BF16 instead.
+```
+
 ## PyTorchEngine 华为昇腾平台
 
 |     Model      |   Size   | Type | FP16/BF16 | W4A16 |

diff --git a/lmdeploy/model.py b/lmdeploy/model.py
@@ -578,7 +578,8 @@ def match(cls, model_path: str) -> Optional[str]:
             model_path (str): the model path used for matching.
         """
         path = model_path.lower()
-        if 'internvl2' in path and 'internvl2-4b' not in path:
+        if ('internvl2' in path
+                and 'internvl2-4b' not in path) or 'mono-internvl' in path:
             return 'internvl2-internlm2'
 
 

diff --git a/lmdeploy/pytorch/models/baichuan.py b/lmdeploy/pytorch/models/baichuan.py
@@ -167,7 +167,7 @@ def __init__(self,
         # build attention layer
         self.self_attn = BaichuanAttention(config, dtype=dtype, device=device)
 
-        # builf MLP
+        # build MLP
         self.mlp = MLP(config, dtype=dtype, device=device)
 
         # build input layer norm

diff --git a/lmdeploy/pytorch/models/chatglm2.py b/lmdeploy/pytorch/models/chatglm2.py
@@ -279,7 +279,7 @@ def __init__(self,
         # build attention layer
         self.self_attention = SelfAttention(config, dtype=dtype, device=device)
 
-        # builf MLP
+        # build MLP
         self.mlp = MLP(config, dtype=dtype, device=device)
 
         # build input layer norm

diff --git a/lmdeploy/pytorch/models/cogvlm.py b/lmdeploy/pytorch/models/cogvlm.py
@@ -263,7 +263,7 @@ def __init__(self,
                                                dtype=dtype,
                                                device=device)
 
-        # builf MLP
+        # build MLP
         self.mlp = VisionExpertMLP(config, dtype=dtype, device=device)
 
         # build input layer norm

diff --git a/lmdeploy/pytorch/models/dbrx.py b/lmdeploy/pytorch/models/dbrx.py
@@ -301,7 +301,7 @@ def __init__(self,
                                                     dtype=dtype,
                                                     device=device)
 
-        # builf MLP
+        # build MLP
         self.ffn = DbrxFFN(config, dtype=dtype, device=device)
 
     def forward(

diff --git a/lmdeploy/pytorch/models/deepseek.py b/lmdeploy/pytorch/models/deepseek.py
@@ -250,7 +250,7 @@ def __init__(self,
         # build attention layer
         self.self_attn = DeepseekAttention(config, dtype=dtype, device=device)
 
-        # builf MLP
+        # build MLP
         self.mlp = (DeepseekMoE(config, dtype=dtype, device=device) if
                     (config.n_routed_experts is not None
                      and layer_idx >= config.first_k_dense_replace

diff --git a/lmdeploy/pytorch/models/falcon.py b/lmdeploy/pytorch/models/falcon.py
@@ -179,7 +179,7 @@ def __init__(self,
                                               dtype=dtype,
                                               device=device)
 
-        # builf MLP
+        # build MLP
         self.mlp = FalconMLP(config, dtype=dtype, device=device)
 
         if not hasattr(config, 'num_ln_in_parallel_attn'):

diff --git a/lmdeploy/pytorch/models/gemma.py b/lmdeploy/pytorch/models/gemma.py
@@ -177,7 +177,7 @@ def __init__(self,
                                         dtype=dtype,
                                         device=device)
 
-        # builf MLP
+        # build MLP
         self.mlp = GemmaMLP(config, dtype=dtype, device=device)
 
         # build input layer norm

diff --git a/lmdeploy/pytorch/models/internlm.py b/lmdeploy/pytorch/models/internlm.py
@@ -161,7 +161,7 @@ def __init__(self,
         # build attention layer
         self.self_attn = InternLMAttention(config, dtype=dtype, device=device)
 
-        # builf MLP
+        # build MLP
         self.mlp = InternLMMLP(config, dtype=dtype, device=device)
 
         # build input layer norm

diff --git a/lmdeploy/pytorch/models/internlm2.py b/lmdeploy/pytorch/models/internlm2.py
@@ -160,7 +160,7 @@ def __init__(self,
         # build attention layer
         self.attention = InternLM2Attention(config, dtype=dtype, device=device)
 
-        # builf MLP
+        # build MLP
         self.feed_forward = InternLM2MLP(config, dtype=dtype, device=device)
 
         # build input layer norm