Skip to content

Commit

Permalink
Support loading hf model directly (#685)
Browse files Browse the repository at this point in the history
* turbomind support export model params

* fix overflow

* support turbomind.from_pretrained

* fix tp

* support AutoModel

* support load kv qparams

* update auto_awq

* udpate docstring

* export lmdeploy version

* update doc

* remove download_hf_repo

* LmdeployForCausalLM -> LmdeployForCausalLM

* refactor turbomind.py

* update comment

* add bfloat16 convert back

* support gradio run_locl load hf

* support resuful api server load hf

* add docs

* support loading previous quantized model

* adapt pr 690

* udpate docs

* not export turbomind config when quantize a model

* check model_name when can not get it from config.json

* update readme

* remove model_name in auto_awq

* update

* update

* udpate

* fix build

* absolute import
  • Loading branch information
irexyc authored Nov 22, 2023
1 parent 42e57c8 commit 6b00f62
Show file tree
Hide file tree
Showing 29 changed files with 1,196 additions and 232 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ work_dir*/
*.bin
*config.json
*generate_config.json
!lmdeploy/turbomind/hf_repo/config.json

# Pytorch
*.pth
Expand Down
52 changes: 9 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ ______________________________________________________________________

## News 🎉

- \[2023/11\] Turbomind supports loading hf model directly. Click [here](./docs/en/load_hf.md) for details.
- \[2023/11\] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
- \[2023/09\] TurboMind supports Qwen-14B
- \[2023/09\] TurboMind supports InternLM-20B
Expand Down Expand Up @@ -114,30 +115,18 @@ pip install lmdeploy
### Deploy InternLM

#### Get InternLM model
To use TurboMind inference engine, you need to first convert the model into TurboMind format. Currently, we support online conversion and offline conversion. With online conversion, TurboMind can load the Huggingface model directly. While with offline conversion, you should save the converted model first before using it.

```shell
# 1. Download InternLM model

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b

```
The following use [internlm/internlm-chat-7b-v1_1](https://huggingface.co/internlm/internlm-chat-7b-v1_1) as a example to show how to use turbomind with online conversion. You can refer to [load_hf.md](docs/en/load_hf.md) for other methods.

#### Inference by TurboMind

```shell
lmdeploy chat turbomind ./workspace
lmdeploy chat turbomind internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
```

> **Note**<br /> The internlm/internlm-chat-7b-v1_1 model will be downloaded under `.cache` folder. You can also use a local path here.
> **Note**<br />
> When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. <br />
> It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
Expand All @@ -152,7 +141,7 @@ lmdeploy chat turbomind ./workspace
# install lmdeploy with extra dependencies
pip install lmdeploy[serve]

lmdeploy serve gradio ./workspace
lmdeploy serve gradio internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
Expand All @@ -165,13 +154,13 @@ Launch inference server by:
# install lmdeploy with extra dependencies
pip install lmdeploy[serve]

lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
lmdeploy serve api_server internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b --instance_num 32 --tp 1
```

Then, you can communicate with it by command line,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client api_server_url
```

Expand All @@ -186,29 +175,6 @@ lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port

Refer to [restful_api.md](docs/en/restful_api.md) for more details.

#### Serving with Triton Inference Server

Launch inference server by:

```shell
bash workspace/service_docker_up.sh
```

Then, you can communicate with the inference server by command line,

```shell
python3 -m pip install tritonclient[grpc]
lmdeploy serve triton_client {server_ip_addresss}:33337
```

or webui,

```shell
lmdeploy serve gradio {server_ip_addresss}:33337
```

For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)

### Inference with PyTorch

For detailed instructions on Inference pytorch models, see [here](docs/en/pytorch.md).
Expand Down
52 changes: 9 additions & 43 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ ______________________________________________________________________

## 更新 🎉

- \[2023/11\] Turbomind 支持直接读取 Huggingface 模型。点击[这里](./docs/en/load_hf.md)查看使用方法
- \[2023/11\] TurboMind 重磅升级。包括:Paged Attention、更快的且不受序列最大长度限制的 attention kernel、2+倍快的 KV8 kernels、Split-K decoding (Flash Decoding) 和 支持 sm_75 架构的 W4A16
- \[2023/09\] TurboMind 支持 Qwen-14B
- \[2023/09\] TurboMind 支持 InternLM-20B 模型
Expand Down Expand Up @@ -114,30 +115,18 @@ pip install lmdeploy
### 部署 InternLM

#### 获取 InternLM 模型
使用 TurboMind 推理模型需要先将模型转化为 TurboMind 的格式,目前支持在线转换和离线转换两种形式。在线转换可以直接加载 Huggingface 模型,离线转换需需要先保存模型再加载。

```shell
# 1. 下载 InternLM 模型

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

# 2. 转换为 trubomind 要求的格式。默认存放路径为 ./workspace
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b

```
下面以 [internlm/internlm-chat-7b-v1_1](https://huggingface.co/internlm/internlm-chat-7b-v1_1) 为例,展示在线转换的使用方式。其他方式可参考[load_hf.md](docs/zh_cn/load_hf.md)

#### 使用 turbomind 推理

```shell
lmdeploy chat turbomind ./workspace
lmdeploy chat turbomind internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
```

> **Note**<br /> internlm/internlm-chat-7b-v1_1 会自动下载到 `.cache` 文件夹,这里也可以传下载好的路径。
> **Note**<br />
> turbomind 在使用 FP16 精度推理 InternLM-7B 模型时,显存开销至少需要 15.7G。建议使用 3090, V100,A100等型号的显卡。<br />
> 关闭显卡的 ECC 可以腾出 10% 显存,执行 `sudo nvidia-smi --ecc-config=0` 重启系统生效。
Expand All @@ -151,7 +140,7 @@ lmdeploy chat turbomind ./workspace
# 安装lmdeploy额外依赖
pip install lmdeploy[serve]

lmdeploy serve gradio ./workspace
lmdeploy serve gradio internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
Expand All @@ -164,13 +153,13 @@ lmdeploy serve gradio ./workspace
# 安装lmdeploy额外依赖
pip install lmdeploy[serve]

lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${server_port} --instance_num 32 --tp 1
lmdeploy serve api_server internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b --instance_num 32 --tp 1
```

你可以通过命令行方式与推理服务进行对话:

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client api_server_url
```

Expand All @@ -185,29 +174,6 @@ lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port

更多详情可以查阅 [restful_api.md](docs/zh_cn/restful_api.md)

#### 通过容器部署推理服务

使用下面的命令启动推理服务:

```shell
bash workspace/service_docker_up.sh
```

你可以通过命令行方式与推理服务进行对话:

```shell
python3 -m pip install tritonclient[grpc]
lmdeploy serve triton_client {server_ip_addresss}:33337
```

也可以通过 WebUI 方式来对话:

```shell
lmdeploy serve gradio {server_ip_addresss}:33337
```

其他模型的部署方式,比如 LLaMA,LLaMA-2,vicuna等等,请参考[这里](docs/zh_cn/serving.md)

### 基于 PyTorch 的推理

你必须确保环境中有安装 deepspeed:
Expand Down
71 changes: 71 additions & 0 deletions docs/en/load_hf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Load huggingface model directly

Starting from v0.1.0, Turbomind adds the ability to pre-process the model parameters on-the-fly while loading them from huggingface style models.

## Supported model type

Currently, Turbomind support loading three types of model:

1. A lmdeploy-quantized model hosted on huggingface.co, such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc.
2. Other LM models on huggingface.co like Qwen/Qwen-7B-Chat
3. A model converted by `lmdeploy convert`, legacy format

## Usage

### 1) A lmdeploy-quantized model

For models quantized by `lmdeploy.lite` such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc.

```
repo_id=internlm/internlm-chat-20b-4bit
model_name=internlm-chat-20b
# or
# repo_id=/path/to/downloaded_model
# Inference by TurboMind
lmdeploy chat turbomind $repo_id --model-name $model_name
# Serving with gradio
lmdeploy serve gradio $repo_id --model-name $model_name
# Serving with Restful API
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
```

### 2) Other LM models

For other LM models such as Qwen/Qwen-7B-Chat or baichuan-inc/Baichuan2-7B-Chat. LMDeploy supported models can be viewed through `lmdeploy list`.

```
repo_id=Qwen/Qwen-7B-Chat
model_name=qwen-7b
# or
# repo_id=/path/to/Qwen-7B-Chat/local_path
# Inference by TurboMind
lmdeploy chat turbomind $repo_id --model-name $model_name
# Serving with gradio
lmdeploy serve gradio $repo_id --model-name $model_name
# Serving with Restful API
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
```

### 3) A model converted by `lmdeploy convert`

The usage is like previous

```
# Convert a model
lmdeploy convert /path/to/model ./workspace --model-name MODEL_NAME
# Inference by TurboMind
lmdeploy chat turbomind ./workspace
# Serving with gradio
lmdeploy serve gradio ./workspace
# Serving with Restful API
lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
```
72 changes: 72 additions & 0 deletions docs/zh_cn/load_hf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# 直接读取 huggingface 模型

从 v0.1.0 开始,Turbomid 添加了直接读取 Huggingface 格式权重的能力。

## 支持的类型

目前,TurboMind 支持加载三种类型的模型:

1. 在 huggingface.co 上面通过 lmdeploy 量化的模型,如 [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit)
2. huggingface.co 上面其他 LM 模型,如Qwen/Qwen-7B-Chat
3. 通过 `lmdeploy convert` 命令转换好的模型,兼容旧格式

## 使用方式

### 1) 通过 lmdeploy 量化的模型

对于通过 `lmdeploy.lite` 量化的模型,TurboMind 可以直接加载,比如 [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit).

```
repo_id=internlm/internlm-chat-20b-4bit
model_name=internlm-chat-20b
# or
# repo_id=/path/to/downloaded_model
# Inference by TurboMind
lmdeploy chat turbomind $repo_id --model-name $model_name
# Serving with gradio
lmdeploy serve gradio $repo_id --model-name $model_name
# Serving with Restful API
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
```

### 2) 其他的 LM 模型

其他 LM 模型比如 Qwen/Qwen-7B-Chat, baichuan-inc/Baichuan2-7B-Chat。LMDeploy 模型支持情况可通过 `lmdeploy list` 查看。

```
repo_id=Qwen/Qwen-7B-Chat
model_name=qwen-7b
# or
# repo_id=/path/to/Qwen-7B-Chat/local_path
# Inference by TurboMind
lmdeploy chat turbomind $repo_id --model-name $model_name
# Serving with gradio
lmdeploy serve gradio $repo_id --model-name $model_name
# Serving with Restful API
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
```

### 3) 通过 `lmdeploy convert` 命令转换好的模型

使用方式与之前相同

```
# Convert a model
lmdeploy convert /path/to/model ./workspace --model-name MODEL_NAME
# Inference by TurboMind
lmdeploy chat turbomind ./workspace
# Serving with gradio
lmdeploy serve gradio ./workspace
# Serving with Restful API
lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
```
Loading

0 comments on commit 6b00f62

Please sign in to comment.