Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loading hf model directly #685

Merged
merged 33 commits into from
Nov 22, 2023
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
0f7a35b
turbomind support export model params
irexyc Nov 10, 2023
542882a
fix overflow
irexyc Nov 10, 2023
9f30d4f
support turbomind.from_pretrained
irexyc Nov 10, 2023
8605521
fix tp
irexyc Nov 10, 2023
a050907
support AutoModel
irexyc Nov 13, 2023
a3f5fc5
support load kv qparams
irexyc Nov 13, 2023
4426c0a
update auto_awq
irexyc Nov 13, 2023
f24c905
udpate docstring
irexyc Nov 14, 2023
371320a
export lmdeploy version
irexyc Nov 14, 2023
fe81ce9
update doc
irexyc Nov 14, 2023
1c4db1e
remove download_hf_repo
irexyc Nov 15, 2023
85f0ed0
LmdeployForCausalLM -> LmdeployForCausalLM
irexyc Nov 15, 2023
5619b44
refactor turbomind.py
irexyc Nov 15, 2023
00b21d4
update comment
irexyc Nov 15, 2023
d868694
Merge remote-tracking branch 'origin/main' into from_pretrained2
irexyc Nov 15, 2023
a827412
add bfloat16 convert back
irexyc Nov 15, 2023
197133d
support gradio run_locl load hf
irexyc Nov 15, 2023
47fd6a8
support resuful api server load hf
irexyc Nov 15, 2023
8dd4876
add docs
irexyc Nov 15, 2023
ae67e87
support loading previous quantized model
irexyc Nov 15, 2023
db51a06
adapt pr 690
irexyc Nov 15, 2023
68962ce
udpate docs
irexyc Nov 16, 2023
c6176f3
resolve conflict in auto_awq.py
irexyc Nov 16, 2023
4c4ae26
not export turbomind config when quantize a model
irexyc Nov 17, 2023
2562724
check model_name when can not get it from config.json
irexyc Nov 17, 2023
f41dce4
update readme
irexyc Nov 17, 2023
7fa302c
remove model_name in auto_awq
irexyc Nov 20, 2023
4db0e25
Merge remote-tracking branch 'origin/main' into from_pretrained
irexyc Nov 20, 2023
4e82cdf
update
irexyc Nov 21, 2023
0f9c6f0
update
irexyc Nov 21, 2023
b470f06
udpate
irexyc Nov 22, 2023
d3c5d01
fix build
irexyc Nov 22, 2023
6ce951f
absolute import
irexyc Nov 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ work_dir*/
*.bin
*config.json
*generate_config.json
!lmdeploy/turbomind/hf_repo/config.json

# Pytorch
*.pth
Expand Down
52 changes: 9 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ ______________________________________________________________________

## News 🎉

- \[2023/11\] Turbomind supports loading hf model directly. Click [here](./docs/en/load_hf.md) for details.
- \[2023/11\] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
- \[2023/09\] TurboMind supports Qwen-14B
- \[2023/09\] TurboMind supports InternLM-20B
Expand Down Expand Up @@ -114,30 +115,18 @@ pip install lmdeploy

### Deploy InternLM

#### Get InternLM model
To use TurboMind inference engine, you need to first convert the model into TurboMind format. Currently, we support online conversion and offline conversion. With online conversion, TurboMind can load the Huggingface model directly. While with offline conversion, you should save the converted model first before using it.

```shell
# 1. Download InternLM model

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b

```
The following use [internlm/internlm-chat-7b-v1_1](https://huggingface.co/internlm/internlm-chat-7b-v1_1) as a example to show how to use turbomind with online conversion. You can refer to [load_hf.md](docs/en/load_hf.md) for other methods.

#### Inference by TurboMind

```shell
lmdeploy chat turbomind ./workspace
lmdeploy chat turbomind internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
```

> **Note**<br /> The internlm/internlm-chat-7b-v1_1 model will be downloaded under `.cache` folder. You can also use a local path here.

> **Note**<br />
> When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. <br />
> It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
Expand All @@ -152,7 +141,7 @@ lmdeploy chat turbomind ./workspace
# install lmdeploy with extra dependencies
pip install lmdeploy[serve]

lmdeploy serve gradio ./workspace
lmdeploy serve gradio internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
Expand All @@ -165,13 +154,13 @@ Launch inference server by:
# install lmdeploy with extra dependencies
pip install lmdeploy[serve]

lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
lmdeploy serve api_server internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b --instance_num 32 --tp 1
```

Then, you can communicate with it by command line,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client api_server_url
```

Expand All @@ -186,29 +175,6 @@ lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port

Refer to [restful_api.md](docs/en/restful_api.md) for more details.

#### Serving with Triton Inference Server

Launch inference server by:

```shell
bash workspace/service_docker_up.sh
```

Then, you can communicate with the inference server by command line,

```shell
python3 -m pip install tritonclient[grpc]
lmdeploy serve triton_client {server_ip_addresss}:33337
```

or webui,

```shell
lmdeploy serve gradio {server_ip_addresss}:33337
```

For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)

### Inference with PyTorch

For detailed instructions on Inference pytorch models, see [here](docs/en/pytorch.md).
Expand Down
52 changes: 9 additions & 43 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ ______________________________________________________________________

## 更新 🎉

- \[2023/11\] Turbomind 支持直接读取 Huggingface 模型。点击[这里](./docs/en/load_hf.md)查看使用方法
- \[2023/11\] TurboMind 重磅升级。包括:Paged Attention、更快的且不受序列最大长度限制的 attention kernel、2+倍快的 KV8 kernels、Split-K decoding (Flash Decoding) 和 支持 sm_75 架构的 W4A16
- \[2023/09\] TurboMind 支持 Qwen-14B
- \[2023/09\] TurboMind 支持 InternLM-20B 模型
Expand Down Expand Up @@ -114,30 +115,18 @@ pip install lmdeploy

### 部署 InternLM

#### 获取 InternLM 模型
使用 TurboMind 推理模型需要先将模型转化为 TurboMind 的格式,目前支持在线转换和离线转换两种形式。在线转换可以直接加载 Huggingface 模型,离线转换需需要先保存模型再加载。

```shell
# 1. 下载 InternLM 模型

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

# 2. 转换为 trubomind 要求的格式。默认存放路径为 ./workspace
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b

```
下面以 [internlm/internlm-chat-7b-v1_1](https://huggingface.co/internlm/internlm-chat-7b-v1_1) 为例,展示在线转换的使用方式。其他方式可参考[load_hf.md](docs/zh_cn/load_hf.md)

#### 使用 turbomind 推理

```shell
lmdeploy chat turbomind ./workspace
lmdeploy chat turbomind internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
```

> **Note**<br /> internlm/internlm-chat-7b-v1_1 会自动下载到 `.cache` 文件夹,这里也可以传下载好的路径。

> **Note**<br />
> turbomind 在使用 FP16 精度推理 InternLM-7B 模型时,显存开销至少需要 15.7G。建议使用 3090, V100,A100等型号的显卡。<br />
> 关闭显卡的 ECC 可以腾出 10% 显存,执行 `sudo nvidia-smi --ecc-config=0` 重启系统生效。
Expand All @@ -151,7 +140,7 @@ lmdeploy chat turbomind ./workspace
# 安装lmdeploy额外依赖
pip install lmdeploy[serve]

lmdeploy serve gradio ./workspace
lmdeploy serve gradio internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
Expand All @@ -164,13 +153,13 @@ lmdeploy serve gradio ./workspace
# 安装lmdeploy额外依赖
pip install lmdeploy[serve]

lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${server_port} --instance_num 32 --tp 1
lmdeploy serve api_server internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b --instance_num 32 --tp 1
```

你可以通过命令行方式与推理服务进行对话:

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client api_server_url
```

Expand All @@ -185,29 +174,6 @@ lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port

更多详情可以查阅 [restful_api.md](docs/zh_cn/restful_api.md)。

#### 通过容器部署推理服务

使用下面的命令启动推理服务:

```shell
bash workspace/service_docker_up.sh
```

你可以通过命令行方式与推理服务进行对话:

```shell
python3 -m pip install tritonclient[grpc]
lmdeploy serve triton_client {server_ip_addresss}:33337
```

也可以通过 WebUI 方式来对话:

```shell
lmdeploy serve gradio {server_ip_addresss}:33337
```

其他模型的部署方式,比如 LLaMA,LLaMA-2,vicuna等等,请参考[这里](docs/zh_cn/serving.md)

### 基于 PyTorch 的推理

你必须确保环境中有安装 deepspeed:
Expand Down
72 changes: 72 additions & 0 deletions docs/en/load_hf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Load huggingface model directly

Before v0.0.14, if you want to serving or inference by TurboMind, you should first convert the model to TurboMind format. Through offline conversion, the model can be loaded faster, but it isn't user-friendly. Therefore, LMDeploy adds the ability of online conversion and support loading huggingface model directly.
irexyc marked this conversation as resolved.
Show resolved Hide resolved

## Supported model type

Currently, Turbomind support loading three types of model:

1. A lmdeploy-quantized model hosted on huggingface.co, such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc.
2. Other hot LM models on huggingface.co like Qwen/Qwen-7B-Chat
3. A model converted by `lmdeploy convert`, old format
irexyc marked this conversation as resolved.
Show resolved Hide resolved

## Usage

### 1) A quantized model managed by lmdeploy / internlm

For quantized models managed by lmdeploy or internlm, the parameters required for online conversion are already exist in config.json, so you only need to pass the repo_id or local path when using it.

> If config.json has not been updated in time, you need to pass the `--model-name` parameter, please refer to 2)

```
repo_id=lmdeploy/qwen-chat-7b-4bit
# or
# repo_id=/path/to/managed_model

# Inference by TurboMind
lmdeploy chat turbomind $repo_id

# Serving with gradio
lmdeploy serve gradio $repo_id

# Serving with Restful API
lmdeploy serve api_server $repo_id --instance_num 32 --tp 1
```

### 2) Other hot LM models

For other popular models such as Qwen/Qwen-7B-Chat or baichuan-inc/Baichuan2-7B-Chat, the name of the model needs to be passed in. LMDeploy supported models can be viewed through `lmdeploy list`.
irexyc marked this conversation as resolved.
Show resolved Hide resolved

```
repo_id=Qwen/Qwen-7B-Chat
model_name=qwen-7b
# or
# repo_id=/path/to/Qwen-7B-Chat/local_path

# Inference by TurboMind
lmdeploy chat turbomind $repo_id --model-name $model_name

# Serving with gradio
lmdeploy serve gradio $repo_id --model-name $model_name

# Serving with Restful API
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
```

### 3) A model converted by `lmdeploy convert`

The usage is like previous

```
# Convert a model
lmdeploy convert /path/to/model ./workspace --model-name MODEL_NAME

# Inference by TurboMind
lmdeploy chat turbomind ./workspace

# Serving with gradio
lmdeploy serve gradio ./workspace

# Serving with Restful API
lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
```
72 changes: 72 additions & 0 deletions docs/zh_cn/load_hf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# 直接读取 huggingface 模型

在 V0.0.14 版本之前,若想使用 LMDeploy 进行推理或者部署,需要先使用命令 `lmdeploy convert` 将模型离线转换为 TurboMind 推理引擎支持的格式,转换后的模型可以更快地进行加载,但对用户使用来说并不友好,因此,LDMdeploy 决定增加在线转换的功能,支持直接读取 Huggingface 的模型。

## 支持的类型

目前,TurboMind 支持加载三种类型的模型:

1. 在 huggingface.co 上面通过 lmdeploy 量化的模型,如 [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit)
2. huggingface.co 上面其他 LM 模型,如Qwen/Qwen-7B-Chat
3. 通过 `lmdeploy convert` 命令转换好的模型,兼容旧格式

## 使用方式

### 1) lmdeploy / internlm 所管理的量化模型

lmdeploy / internlm 所管理的模型,config.json 中已经有在线转换需要的参数,所以使用时只需要传入 repo_id 或者本地路径即可。

> 如果 config.json 还未及时更新,还需要传入`--model-name` 参数,可参考 2)

```
repo_id=lmdeploy/qwen-chat-7b-4bit
# or
# repo_id=/path/to/managed_model

# Inference by TurboMind
lmdeploy chat turbomind $repo_id

# Serving with gradio
lmdeploy serve gradio $repo_id

# Serving with Restful API
lmdeploy serve api_server $repo_id --instance_num 32 --tp 1
```

### 2) 其他的 LM 模型

其他的比较热门的模型比如 Qwen/Qwen-7B-Chat, baichuan-inc/Baichuan2-7B-Chat,需要传入模型的名字。LMDeploy 模型支持情况可通过 `lmdeploy list` 查看。

```
repo_id=Qwen/Qwen-7B-Chat
model_name=qwen-7b
# or
# repo_id=/path/to/Qwen-7B-Chat/local_path

# Inference by TurboMind
lmdeploy chat turbomind $repo_id --model-name $model_name

# Serving with gradio
lmdeploy serve gradio $repo_id --model-name $model_name

# Serving with Restful API
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
```

### 3) 通过 `lmdeploy convert` 命令转换好的模型

使用方式与之前相同

```
# Convert a model
lmdeploy convert /path/to/model ./workspace --model-name MODEL_NAME

# Inference by TurboMind
lmdeploy chat turbomind ./workspace

# Serving with gradio
lmdeploy serve gradio ./workspace

# Serving with Restful API
lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
```
Loading
Loading