InternLM · lvhan028 · Nov 22, 2023 · Nov 10, 2023 · Nov 10, 2023 · Nov 10, 2023
diff --git a/.gitignore b/.gitignore
@@ -58,6 +58,7 @@ work_dir*/
 *.bin
 *config.json
 *generate_config.json
+!lmdeploy/turbomind/hf_repo/config.json
 
 # Pytorch
 *.pth

diff --git a/README.md b/README.md
@@ -20,6 +20,7 @@ ______________________________________________________________________
 
 ## News 🎉
 
+- \[2023/11\] Turbomind supports loading hf model directly. Click [here](./docs/en/load_hf.md) for details.
 - \[2023/11\] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
 - \[2023/09\] TurboMind supports Qwen-14B
 - \[2023/09\] TurboMind supports InternLM-20B
@@ -114,30 +115,18 @@ pip install lmdeploy
 
 ### Deploy InternLM
 
-#### Get InternLM model
+To use TurboMind inference engine, you need to first convert the model into TurboMind format. Currently, we support online conversion and offline conversion. With online conversion, TurboMind can load the Huggingface model directly. While with offline conversion, you should save the converted model first before using it.
 
-```shell
-# 1. Download InternLM model
-
-# Make sure you have git-lfs installed (https://git-lfs.com)
-git lfs install
-git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b
-
-# if you want to clone without large files – just their pointers
-# prepend your git clone with the following env var:
-GIT_LFS_SKIP_SMUDGE=1
-
-# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
-lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
-
-```
+The following use [internlm/internlm-chat-7b-v1_1](https://huggingface.co/internlm/internlm-chat-7b-v1_1) as a example to show how to use turbomind with online conversion. You can refer to [load_hf.md](docs/en/load_hf.md) for other methods.
 
 #### Inference by TurboMind
 
 ```shell
-lmdeploy chat turbomind ./workspace
+lmdeploy chat turbomind internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
 ```
 
+> **Note**<br /> The internlm/internlm-chat-7b-v1_1 model will be downloaded under `.cache` folder. You can also use a local path here.
+
 > **Note**<br />
 > When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. <br />
 > It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
@@ -152,7 +141,7 @@ lmdeploy chat turbomind ./workspace
 # install lmdeploy with extra dependencies
 pip install lmdeploy[serve]
 
-lmdeploy serve gradio ./workspace
+lmdeploy serve gradio internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
 ```
 
 ![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
@@ -165,13 +154,13 @@ Launch inference server by:
 # install lmdeploy with extra dependencies
 pip install lmdeploy[serve]
 
-lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
+lmdeploy serve api_server internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b --instance_num 32 --tp 1
 ```
 
 Then, you can communicate with it by command line,
 
 ```shell
-# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
+# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
 lmdeploy serve api_client api_server_url
 ```
 
@@ -186,29 +175,6 @@ lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port
 
 Refer to [restful_api.md](docs/en/restful_api.md) for more details.
 
-#### Serving with Triton Inference Server
-
-Launch inference server by:
-
-```shell
-bash workspace/service_docker_up.sh
-```
-
-Then, you can communicate with the inference server by command line,
-
-```shell
-python3 -m pip install tritonclient[grpc]
-lmdeploy serve triton_client {server_ip_addresss}:33337
-```
-
-or webui,
-
-```shell
-lmdeploy serve gradio {server_ip_addresss}:33337
-```
-
-For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
-
 ### Inference with PyTorch
 
 For detailed instructions on Inference pytorch models, see [here](docs/en/pytorch.md).

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -20,6 +20,7 @@ ______________________________________________________________________
 
 ## 更新 🎉
 
+- \[2023/11\] Turbomind 支持直接读取 Huggingface 模型。点击[这里](./docs/en/load_hf.md)查看使用方法
 - \[2023/11\] TurboMind 重磅升级。包括：Paged Attention、更快的且不受序列最大长度限制的 attention kernel、2+倍快的 KV8 kernels、Split-K decoding (Flash Decoding) 和 支持 sm_75 架构的 W4A16
 - \[2023/09\] TurboMind 支持 Qwen-14B
 - \[2023/09\] TurboMind 支持 InternLM-20B 模型
@@ -114,30 +115,18 @@ pip install lmdeploy
 
 ### 部署 InternLM
 
-#### 获取 InternLM 模型
+使用 TurboMind 推理模型需要先将模型转化为 TurboMind 的格式，目前支持在线转换和离线转换两种形式。在线转换可以直接加载 Huggingface 模型，离线转换需需要先保存模型再加载。
 
-```shell
-# 1. 下载 InternLM 模型
-
-# Make sure you have git-lfs installed (https://git-lfs.com)
-git lfs install
-git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b
-
-# if you want to clone without large files – just their pointers
-# prepend your git clone with the following env var:
-GIT_LFS_SKIP_SMUDGE=1
-
-# 2. 转换为 trubomind 要求的格式。默认存放路径为 ./workspace
-lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
-
-```
+下面以 [internlm/internlm-chat-7b-v1_1](https://huggingface.co/internlm/internlm-chat-7b-v1_1) 为例，展示在线转换的使用方式。其他方式可参考[load_hf.md](docs/zh_cn/load_hf.md)
 
 #### 使用 turbomind 推理
 
 ```shell
-lmdeploy chat turbomind ./workspace
+lmdeploy chat turbomind internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
 ```
 
+> **Note**<br /> internlm/internlm-chat-7b-v1_1 会自动下载到 `.cache` 文件夹，这里也可以传下载好的路径。
+
 > **Note**<br />
 > turbomind 在使用 FP16 精度推理 InternLM-7B 模型时，显存开销至少需要 15.7G。建议使用 3090, V100，A100等型号的显卡。<br />
 > 关闭显卡的 ECC 可以腾出 10% 显存，执行 `sudo nvidia-smi --ecc-config=0` 重启系统生效。
@@ -151,7 +140,7 @@ lmdeploy chat turbomind ./workspace
 # 安装lmdeploy额外依赖
 pip install lmdeploy[serve]
 
-lmdeploy serve gradio ./workspace
+lmdeploy serve gradio internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
 ```
 
 ![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
@@ -164,13 +153,13 @@ lmdeploy serve gradio ./workspace
 # 安装lmdeploy额外依赖
 pip install lmdeploy[serve]
 
-lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${server_port} --instance_num 32 --tp 1
+lmdeploy serve api_server internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b --instance_num 32 --tp 1
 ```
 
 你可以通过命令行方式与推理服务进行对话：
 
 ```shell
-# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
+# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
 lmdeploy serve api_client api_server_url
 ```
 
@@ -185,29 +174,6 @@ lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port
 
 更多详情可以查阅 [restful_api.md](docs/zh_cn/restful_api.md)。
 
-#### 通过容器部署推理服务
-
-使用下面的命令启动推理服务：
-
-```shell
-bash workspace/service_docker_up.sh
-```
-
-你可以通过命令行方式与推理服务进行对话：
-
-```shell
-python3 -m pip install tritonclient[grpc]
-lmdeploy serve triton_client {server_ip_addresss}:33337
-```
-
-也可以通过 WebUI 方式来对话：
-
-```shell
-lmdeploy serve gradio {server_ip_addresss}:33337
-```
-
-其他模型的部署方式，比如 LLaMA，LLaMA-2，vicuna等等，请参考[这里](docs/zh_cn/serving.md)
-
 ### 基于 PyTorch 的推理
 
 你必须确保环境中有安装 deepspeed：

diff --git a/docs/en/load_hf.md b/docs/en/load_hf.md
@@ -0,0 +1,72 @@
+# Load huggingface model directly
+
+Before v0.0.14, if you want to serving or inference by TurboMind, you should first convert the model to TurboMind format. Through offline conversion, the model can be loaded faster, but it isn't user-friendly. Therefore, LMDeploy adds the ability of online conversion and support loading huggingface model directly.
+
+## Supported model type
+
+Currently, Turbomind support loading three types of model:
+
+1. A lmdeploy-quantized model hosted on huggingface.co, such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc.
+2. Other hot LM models on huggingface.co like Qwen/Qwen-7B-Chat
+3. A model converted by `lmdeploy convert`, old format
+
+## Usage
+
+### 1) A quantized model managed by lmdeploy / internlm
+
+For quantized models managed by lmdeploy or internlm, the parameters required for online conversion are already exist in config.json, so you only need to pass the repo_id or local path when using it.
+
+> If config.json has not been updated in time, you need to pass the `--model-name` parameter, please refer to 2)
+
+```
+repo_id=lmdeploy/qwen-chat-7b-4bit
+# or
+# repo_id=/path/to/managed_model
+
+# Inference by TurboMind
+lmdeploy chat turbomind $repo_id
+
+# Serving with gradio
+lmdeploy serve gradio $repo_id
+
+# Serving with Restful API
+lmdeploy serve api_server $repo_id --instance_num 32 --tp 1
+```
+
+### 2) Other hot LM models
+
+For other popular models such as Qwen/Qwen-7B-Chat or baichuan-inc/Baichuan2-7B-Chat, the name of the model needs to be passed in. LMDeploy supported models can be viewed through `lmdeploy list`.
+
+```
+repo_id=Qwen/Qwen-7B-Chat
+model_name=qwen-7b
+# or
+# repo_id=/path/to/Qwen-7B-Chat/local_path
+
+# Inference by TurboMind
+lmdeploy chat turbomind $repo_id --model-name $model_name
+
+# Serving with gradio
+lmdeploy serve gradio $repo_id --model-name $model_name
+
+# Serving with Restful API
+lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
+```
+
+### 3) A model converted by `lmdeploy convert`
+
+The usage is like previous
+
+```
+# Convert a model
+lmdeploy convert /path/to/model ./workspace --model-name MODEL_NAME
+
+# Inference by TurboMind
+lmdeploy chat turbomind ./workspace
+
+# Serving with gradio
+lmdeploy serve gradio ./workspace
+
+# Serving with Restful API
+lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
+```
diff --git a/docs/zh_cn/load_hf.md b/docs/zh_cn/load_hf.md
@@ -0,0 +1,72 @@
+# 直接读取 huggingface 模型
+
+在 V0.0.14 版本之前，若想使用 LMDeploy 进行推理或者部署，需要先使用命令 `lmdeploy convert` 将模型离线转换为 TurboMind 推理引擎支持的格式，转换后的模型可以更快地进行加载，但对用户使用来说并不友好，因此，LDMdeploy 决定增加在线转换的功能，支持直接读取 Huggingface 的模型。
+
+## 支持的类型
+
+目前，TurboMind 支持加载三种类型的模型：
+
+1. 在 huggingface.co 上面通过 lmdeploy 量化的模型，如 [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit)
+2. huggingface.co 上面其他 LM 模型，如Qwen/Qwen-7B-Chat
+3. 通过 `lmdeploy convert` 命令转换好的模型，兼容旧格式
+
+## 使用方式
+
+### 1) lmdeploy / internlm 所管理的量化模型
+
+lmdeploy / internlm 所管理的模型，config.json 中已经有在线转换需要的参数，所以使用时只需要传入 repo_id 或者本地路径即可。
+
+> 如果 config.json 还未及时更新，还需要传入`--model-name` 参数，可参考 2)
+
+```
+repo_id=lmdeploy/qwen-chat-7b-4bit
+# or
+# repo_id=/path/to/managed_model
+
+# Inference by TurboMind
+lmdeploy chat turbomind $repo_id
+
+# Serving with gradio
+lmdeploy serve gradio $repo_id
+
+# Serving with Restful API
+lmdeploy serve api_server $repo_id --instance_num 32 --tp 1
+```
+
+### 2) 其他的 LM 模型
+
+其他的比较热门的模型比如 Qwen/Qwen-7B-Chat, baichuan-inc/Baichuan2-7B-Chat，需要传入模型的名字。LMDeploy 模型支持情况可通过 `lmdeploy list` 查看。
+
+```
+repo_id=Qwen/Qwen-7B-Chat
+model_name=qwen-7b
+# or
+# repo_id=/path/to/Qwen-7B-Chat/local_path
+
+# Inference by TurboMind
+lmdeploy chat turbomind $repo_id --model-name $model_name
+
+# Serving with gradio
+lmdeploy serve gradio $repo_id --model-name $model_name
+
+# Serving with Restful API
+lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
+```
+
+### 3) 通过 `lmdeploy convert` 命令转换好的模型
+
+使用方式与之前相同
+
+```
+# Convert a model
+lmdeploy convert /path/to/model ./workspace --model-name MODEL_NAME
+
+# Inference by TurboMind
+lmdeploy chat turbomind ./workspace
+
+# Serving with gradio
+lmdeploy serve gradio ./workspace
+
+# Serving with Restful API
+lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
+```