InternLM · lvhan028 · Nov 22, 2023 · Nov 10, 2023 · Nov 10, 2023 · Nov 10, 2023
diff --git a/.gitignore b/.gitignore
@@ -58,6 +58,7 @@ work_dir*/
 *.bin
 *config.json
 *generate_config.json
+!lmdeploy/turbomind/hf_repo/config.json
 
 # Pytorch
 *.pth

diff --git a/README.md b/README.md
@@ -20,6 +20,7 @@ ______________________________________________________________________
 
 ## News 🎉
 
+- \[2023/11\] Turbomind supports loading hf model directly. Click [here](./docs/en/load_hf.md) for details.
 - \[2023/11\] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
 - \[2023/09\] TurboMind supports Qwen-14B
 - \[2023/09\] TurboMind supports InternLM-20B
@@ -114,30 +115,18 @@ pip install lmdeploy
 
 ### Deploy InternLM
 
-#### Get InternLM model
+To use TurboMind inference engine, you need to first convert the model into TurboMind format. Currently, we support online conversion and offline conversion. With online conversion, TurboMind can load the Huggingface model directly. While with offline conversion, you should save the converted model first before using it.
 
-```shell
-# 1. Download InternLM model
-
-# Make sure you have git-lfs installed (https://git-lfs.com)
-git lfs install
-git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b
-
-# if you want to clone without large files – just their pointers
-# prepend your git clone with the following env var:
-GIT_LFS_SKIP_SMUDGE=1
-
-# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
-lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
-
-```
+The following use [internlm/internlm-chat-7b-v1_1](https://huggingface.co/internlm/internlm-chat-7b-v1_1) as a example to show how to use turbomind with online conversion. You can refer to [load_hf.md](docs/en/load_hf.md) for other methods.
 
 #### Inference by TurboMind
 
 ```shell
-lmdeploy chat turbomind ./workspace
+lmdeploy chat turbomind internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
 ```
 
+> **Note**<br /> The internlm/internlm-chat-7b-v1_1 model will be downloaded under `.cache` folder. You can also use a local path here.
+
 > **Note**<br />
 > When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. <br />
 > It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
@@ -152,7 +141,7 @@ lmdeploy chat turbomind ./workspace
 # install lmdeploy with extra dependencies
 pip install lmdeploy[serve]
 
-lmdeploy serve gradio ./workspace
+lmdeploy serve gradio internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
 ```
 
 ![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
@@ -165,13 +154,13 @@ Launch inference server by:
 # install lmdeploy with extra dependencies
 pip install lmdeploy[serve]
 
-lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
+lmdeploy serve api_server internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b --instance_num 32 --tp 1
 ```
 
 Then, you can communicate with it by command line,
 
 ```shell
-# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
+# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
 lmdeploy serve api_client api_server_url
 ```
 
@@ -186,29 +175,6 @@ lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port
 
 Refer to [restful_api.md](docs/en/restful_api.md) for more details.
 
-#### Serving with Triton Inference Server
-
-Launch inference server by:
-
-```shell
-bash workspace/service_docker_up.sh
-```
-
-Then, you can communicate with the inference server by command line,
-
-```shell
-python3 -m pip install tritonclient[grpc]
-lmdeploy serve triton_client {server_ip_addresss}:33337
-```
-
-or webui,
-
-```shell
-lmdeploy serve gradio {server_ip_addresss}:33337
-```
-
-For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
-
 ### Inference with PyTorch
 
 For detailed instructions on Inference pytorch models, see [here](docs/en/pytorch.md).

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -20,6 +20,7 @@ ______________________________________________________________________
 
 ## 更新 🎉
 
+- \[2023/11\] Turbomind 支持直接读取 Huggingface 模型。点击[这里](./docs/en/load_hf.md)查看使用方法
 - \[2023/11\] TurboMind 重磅升级。包括：Paged Attention、更快的且不受序列最大长度限制的 attention kernel、2+倍快的 KV8 kernels、Split-K decoding (Flash Decoding) 和 支持 sm_75 架构的 W4A16
 - \[2023/09\] TurboMind 支持 Qwen-14B
 - \[2023/09\] TurboMind 支持 InternLM-20B 模型
@@ -114,30 +115,18 @@ pip install lmdeploy
 
 ### 部署 InternLM
 
-#### 获取 InternLM 模型
+使用 TurboMind 推理模型需要先将模型转化为 TurboMind 的格式，目前支持在线转换和离线转换两种形式。在线转换可以直接加载 Huggingface 模型，离线转换需需要先保存模型再加载。
 
-```shell
-# 1. 下载 InternLM 模型
-
-# Make sure you have git-lfs installed (https://git-lfs.com)
-git lfs install
-git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b
-
-# if you want to clone without large files – just their pointers
-# prepend your git clone with the following env var:
-GIT_LFS_SKIP_SMUDGE=1
-
-# 2. 转换为 trubomind 要求的格式。默认存放路径为 ./workspace
-lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
-
-```
+下面以 [internlm/internlm-chat-7b-v1_1](https://huggingface.co/internlm/internlm-chat-7b-v1_1) 为例，展示在线转换的使用方式。其他方式可参考[load_hf.md](docs/zh_cn/load_hf.md)
 
 #### 使用 turbomind 推理
 
 ```shell
-lmdeploy chat turbomind ./workspace
+lmdeploy chat turbomind internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
 ```
 
+> **Note**<br /> internlm/internlm-chat-7b-v1_1 会自动下载到 `.cache` 文件夹，这里也可以传下载好的路径。
+
 > **Note**<br />
 > turbomind 在使用 FP16 精度推理 InternLM-7B 模型时，显存开销至少需要 15.7G。建议使用 3090, V100，A100等型号的显卡。<br />
 > 关闭显卡的 ECC 可以腾出 10% 显存，执行 `sudo nvidia-smi --ecc-config=0` 重启系统生效。
@@ -151,7 +140,7 @@ lmdeploy chat turbomind ./workspace
 # 安装lmdeploy额外依赖
 pip install lmdeploy[serve]
 
-lmdeploy serve gradio ./workspace
+lmdeploy serve gradio internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
 ```
 
 ![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
@@ -164,13 +153,13 @@ lmdeploy serve gradio ./workspace
 # 安装lmdeploy额外依赖
 pip install lmdeploy[serve]
 
-lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${server_port} --instance_num 32 --tp 1
+lmdeploy serve api_server internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b --instance_num 32 --tp 1
 ```
 
 你可以通过命令行方式与推理服务进行对话：
 
 ```shell
-# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
+# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
 lmdeploy serve api_client api_server_url
 ```
 
@@ -185,29 +174,6 @@ lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port
 
 更多详情可以查阅 [restful_api.md](docs/zh_cn/restful_api.md)。
 
-#### 通过容器部署推理服务
-
-使用下面的命令启动推理服务：
-
-```shell
-bash workspace/service_docker_up.sh
-```
-
-你可以通过命令行方式与推理服务进行对话：
-
-```shell
-python3 -m pip install tritonclient[grpc]
-lmdeploy serve triton_client {server_ip_addresss}:33337
-```
-
-也可以通过 WebUI 方式来对话：
-
-```shell
-lmdeploy serve gradio {server_ip_addresss}:33337
-```
-
-其他模型的部署方式，比如 LLaMA，LLaMA-2，vicuna等等，请参考[这里](docs/zh_cn/serving.md)
-
 ### 基于 PyTorch 的推理
 
 你必须确保环境中有安装 deepspeed：

diff --git a/docs/en/load_hf.md b/docs/en/load_hf.md
@@ -0,0 +1,71 @@
+# Load huggingface model directly
+
+Starting from v0.1.0, Turbomind adds the ability to pre-process the model parameters on-the-fly while loading them from huggingface style models.
+
+## Supported model type
+
+Currently, Turbomind support loading three types of model:
+
+1. A lmdeploy-quantized model hosted on huggingface.co, such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc.
+2. Other LM models on huggingface.co like Qwen/Qwen-7B-Chat
+3. A model converted by `lmdeploy convert`, legacy format
+
+## Usage
+
+### 1) A lmdeploy-quantized model
+
+For models quantized by `lmdeploy.lite` such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc.
+
+```
+repo_id=internlm/internlm-chat-20b-4bit
+model_name=internlm-chat-20b
+# or
+# repo_id=/path/to/downloaded_model
+
+# Inference by TurboMind
+lmdeploy chat turbomind $repo_id --model-name $model_name
+
+# Serving with gradio
+lmdeploy serve gradio $repo_id --model-name $model_name
+
+# Serving with Restful API
+lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
+```
+
+### 2) Other LM models
+
+For other LM models such as Qwen/Qwen-7B-Chat or baichuan-inc/Baichuan2-7B-Chat. LMDeploy supported models can be viewed through `lmdeploy list`.
+
+```
+repo_id=Qwen/Qwen-7B-Chat
+model_name=qwen-7b
+# or
+# repo_id=/path/to/Qwen-7B-Chat/local_path
+
+# Inference by TurboMind
+lmdeploy chat turbomind $repo_id --model-name $model_name
+
+# Serving with gradio
+lmdeploy serve gradio $repo_id --model-name $model_name
+
+# Serving with Restful API
+lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
+```
+
+### 3) A model converted by `lmdeploy convert`
+
+The usage is like previous
+
+```
+# Convert a model
+lmdeploy convert /path/to/model ./workspace --model-name MODEL_NAME
+
+# Inference by TurboMind
+lmdeploy chat turbomind ./workspace
+
+# Serving with gradio
+lmdeploy serve gradio ./workspace
+
+# Serving with Restful API
+lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
+```
diff --git a/docs/zh_cn/load_hf.md b/docs/zh_cn/load_hf.md
@@ -0,0 +1,72 @@
+# 直接读取 huggingface 模型
+
+从 v0.1.0 开始，Turbomid 添加了直接读取 Huggingface 格式权重的能力。
+
+## 支持的类型
+
+目前，TurboMind 支持加载三种类型的模型：
+
+1. 在 huggingface.co 上面通过 lmdeploy 量化的模型，如 [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit)
+2. huggingface.co 上面其他 LM 模型，如Qwen/Qwen-7B-Chat
+3. 通过 `lmdeploy convert` 命令转换好的模型，兼容旧格式
+
+## 使用方式
+
+### 1) 通过 lmdeploy 量化的模型
+
+对于通过 `lmdeploy.lite` 量化的模型，TurboMind 可以直接加载，比如 [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit).
+
+```
+repo_id=internlm/internlm-chat-20b-4bit
+model_name=internlm-chat-20b
+
+# or
+# repo_id=/path/to/downloaded_model
+
+# Inference by TurboMind
+lmdeploy chat turbomind $repo_id --model-name $model_name
+
+# Serving with gradio
+lmdeploy serve gradio $repo_id --model-name $model_name
+
+# Serving with Restful API
+lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
+```
+
+### 2) 其他的 LM 模型
+
+其他 LM 模型比如 Qwen/Qwen-7B-Chat, baichuan-inc/Baichuan2-7B-Chat。LMDeploy 模型支持情况可通过 `lmdeploy list` 查看。
+
+```
+repo_id=Qwen/Qwen-7B-Chat
+model_name=qwen-7b
+# or
+# repo_id=/path/to/Qwen-7B-Chat/local_path
+
+# Inference by TurboMind
+lmdeploy chat turbomind $repo_id --model-name $model_name
+
+# Serving with gradio
+lmdeploy serve gradio $repo_id --model-name $model_name
+
+# Serving with Restful API
+lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
+```
+
+### 3) 通过 `lmdeploy convert` 命令转换好的模型
+
+使用方式与之前相同
+
+```
+# Convert a model
+lmdeploy convert /path/to/model ./workspace --model-name MODEL_NAME
+
+# Inference by TurboMind
+lmdeploy chat turbomind ./workspace
+
+# Serving with gradio
+lmdeploy serve gradio ./workspace
+
+# Serving with Restful API
+lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
+```