Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loading hf model directly #685

Merged
merged 33 commits into from
Nov 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
0f7a35b
turbomind support export model params
irexyc Nov 10, 2023
542882a
fix overflow
irexyc Nov 10, 2023
9f30d4f
support turbomind.from_pretrained
irexyc Nov 10, 2023
8605521
fix tp
irexyc Nov 10, 2023
a050907
support AutoModel
irexyc Nov 13, 2023
a3f5fc5
support load kv qparams
irexyc Nov 13, 2023
4426c0a
update auto_awq
irexyc Nov 13, 2023
f24c905
udpate docstring
irexyc Nov 14, 2023
371320a
export lmdeploy version
irexyc Nov 14, 2023
fe81ce9
update doc
irexyc Nov 14, 2023
1c4db1e
remove download_hf_repo
irexyc Nov 15, 2023
85f0ed0
LmdeployForCausalLM -> LmdeployForCausalLM
irexyc Nov 15, 2023
5619b44
refactor turbomind.py
irexyc Nov 15, 2023
00b21d4
update comment
irexyc Nov 15, 2023
d868694
Merge remote-tracking branch 'origin/main' into from_pretrained2
irexyc Nov 15, 2023
a827412
add bfloat16 convert back
irexyc Nov 15, 2023
197133d
support gradio run_locl load hf
irexyc Nov 15, 2023
47fd6a8
support resuful api server load hf
irexyc Nov 15, 2023
8dd4876
add docs
irexyc Nov 15, 2023
ae67e87
support loading previous quantized model
irexyc Nov 15, 2023
db51a06
adapt pr 690
irexyc Nov 15, 2023
68962ce
udpate docs
irexyc Nov 16, 2023
c6176f3
resolve conflict in auto_awq.py
irexyc Nov 16, 2023
4c4ae26
not export turbomind config when quantize a model
irexyc Nov 17, 2023
2562724
check model_name when can not get it from config.json
irexyc Nov 17, 2023
f41dce4
update readme
irexyc Nov 17, 2023
7fa302c
remove model_name in auto_awq
irexyc Nov 20, 2023
4db0e25
Merge remote-tracking branch 'origin/main' into from_pretrained
irexyc Nov 20, 2023
4e82cdf
update
irexyc Nov 21, 2023
0f9c6f0
update
irexyc Nov 21, 2023
b470f06
udpate
irexyc Nov 22, 2023
d3c5d01
fix build
irexyc Nov 22, 2023
6ce951f
absolute import
irexyc Nov 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ work_dir*/
*.bin
*config.json
*generate_config.json
!lmdeploy/turbomind/hf_repo/config.json

# Pytorch
*.pth
Expand Down
52 changes: 9 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ ______________________________________________________________________

## News 🎉

- \[2023/11\] Turbomind supports loading hf model directly. Click [here](./docs/en/load_hf.md) for details.
- \[2023/11\] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
- \[2023/09\] TurboMind supports Qwen-14B
- \[2023/09\] TurboMind supports InternLM-20B
Expand Down Expand Up @@ -114,30 +115,18 @@ pip install lmdeploy

### Deploy InternLM

#### Get InternLM model
To use TurboMind inference engine, you need to first convert the model into TurboMind format. Currently, we support online conversion and offline conversion. With online conversion, TurboMind can load the Huggingface model directly. While with offline conversion, you should save the converted model first before using it.

```shell
# 1. Download InternLM model

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b

```
The following use [internlm/internlm-chat-7b-v1_1](https://huggingface.co/internlm/internlm-chat-7b-v1_1) as a example to show how to use turbomind with online conversion. You can refer to [load_hf.md](docs/en/load_hf.md) for other methods.

#### Inference by TurboMind

```shell
lmdeploy chat turbomind ./workspace
lmdeploy chat turbomind internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
```

> **Note**<br /> The internlm/internlm-chat-7b-v1_1 model will be downloaded under `.cache` folder. You can also use a local path here.

> **Note**<br />
> When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. <br />
> It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
Expand All @@ -152,7 +141,7 @@ lmdeploy chat turbomind ./workspace
# install lmdeploy with extra dependencies
pip install lmdeploy[serve]

lmdeploy serve gradio ./workspace
lmdeploy serve gradio internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
Expand All @@ -165,13 +154,13 @@ Launch inference server by:
# install lmdeploy with extra dependencies
pip install lmdeploy[serve]

lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
lmdeploy serve api_server internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b --instance_num 32 --tp 1
```

Then, you can communicate with it by command line,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client api_server_url
```

Expand All @@ -186,29 +175,6 @@ lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port

Refer to [restful_api.md](docs/en/restful_api.md) for more details.

#### Serving with Triton Inference Server

Launch inference server by:

```shell
bash workspace/service_docker_up.sh
```

Then, you can communicate with the inference server by command line,

```shell
python3 -m pip install tritonclient[grpc]
lmdeploy serve triton_client {server_ip_addresss}:33337
```

or webui,

```shell
lmdeploy serve gradio {server_ip_addresss}:33337
```

For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)

### Inference with PyTorch

For detailed instructions on Inference pytorch models, see [here](docs/en/pytorch.md).
Expand Down
52 changes: 9 additions & 43 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ ______________________________________________________________________

## 更新 🎉

- \[2023/11\] Turbomind 支持直接读取 Huggingface 模型。点击[这里](./docs/en/load_hf.md)查看使用方法
- \[2023/11\] TurboMind 重磅升级。包括:Paged Attention、更快的且不受序列最大长度限制的 attention kernel、2+倍快的 KV8 kernels、Split-K decoding (Flash Decoding) 和 支持 sm_75 架构的 W4A16
- \[2023/09\] TurboMind 支持 Qwen-14B
- \[2023/09\] TurboMind 支持 InternLM-20B 模型
Expand Down Expand Up @@ -114,30 +115,18 @@ pip install lmdeploy

### 部署 InternLM

#### 获取 InternLM 模型
使用 TurboMind 推理模型需要先将模型转化为 TurboMind 的格式,目前支持在线转换和离线转换两种形式。在线转换可以直接加载 Huggingface 模型,离线转换需需要先保存模型再加载。

```shell
# 1. 下载 InternLM 模型

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

# 2. 转换为 trubomind 要求的格式。默认存放路径为 ./workspace
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b

```
下面以 [internlm/internlm-chat-7b-v1_1](https://huggingface.co/internlm/internlm-chat-7b-v1_1) 为例,展示在线转换的使用方式。其他方式可参考[load_hf.md](docs/zh_cn/load_hf.md)

#### 使用 turbomind 推理

```shell
lmdeploy chat turbomind ./workspace
lmdeploy chat turbomind internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
```

> **Note**<br /> internlm/internlm-chat-7b-v1_1 会自动下载到 `.cache` 文件夹,这里也可以传下载好的路径。

> **Note**<br />
> turbomind 在使用 FP16 精度推理 InternLM-7B 模型时,显存开销至少需要 15.7G。建议使用 3090, V100,A100等型号的显卡。<br />
> 关闭显卡的 ECC 可以腾出 10% 显存,执行 `sudo nvidia-smi --ecc-config=0` 重启系统生效。
Expand All @@ -151,7 +140,7 @@ lmdeploy chat turbomind ./workspace
# 安装lmdeploy额外依赖
pip install lmdeploy[serve]

lmdeploy serve gradio ./workspace
lmdeploy serve gradio internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
Expand All @@ -164,13 +153,13 @@ lmdeploy serve gradio ./workspace
# 安装lmdeploy额外依赖
pip install lmdeploy[serve]

lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${server_port} --instance_num 32 --tp 1
lmdeploy serve api_server internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b --instance_num 32 --tp 1
```

你可以通过命令行方式与推理服务进行对话:

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client api_server_url
```

Expand All @@ -185,29 +174,6 @@ lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port

更多详情可以查阅 [restful_api.md](docs/zh_cn/restful_api.md)。

#### 通过容器部署推理服务

使用下面的命令启动推理服务:

```shell
bash workspace/service_docker_up.sh
```

你可以通过命令行方式与推理服务进行对话:

```shell
python3 -m pip install tritonclient[grpc]
lmdeploy serve triton_client {server_ip_addresss}:33337
```

也可以通过 WebUI 方式来对话:

```shell
lmdeploy serve gradio {server_ip_addresss}:33337
```

其他模型的部署方式,比如 LLaMA,LLaMA-2,vicuna等等,请参考[这里](docs/zh_cn/serving.md)

### 基于 PyTorch 的推理

你必须确保环境中有安装 deepspeed:
Expand Down
71 changes: 71 additions & 0 deletions docs/en/load_hf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Load huggingface model directly

Starting from v0.1.0, Turbomind adds the ability to pre-process the model parameters on-the-fly while loading them from huggingface style models.

## Supported model type

Currently, Turbomind support loading three types of model:

1. A lmdeploy-quantized model hosted on huggingface.co, such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc.
2. Other LM models on huggingface.co like Qwen/Qwen-7B-Chat
3. A model converted by `lmdeploy convert`, legacy format

## Usage

### 1) A lmdeploy-quantized model

For models quantized by `lmdeploy.lite` such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc.

```
repo_id=internlm/internlm-chat-20b-4bit
model_name=internlm-chat-20b
# or
# repo_id=/path/to/downloaded_model

# Inference by TurboMind
lmdeploy chat turbomind $repo_id --model-name $model_name

# Serving with gradio
lmdeploy serve gradio $repo_id --model-name $model_name

# Serving with Restful API
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
```

### 2) Other LM models

For other LM models such as Qwen/Qwen-7B-Chat or baichuan-inc/Baichuan2-7B-Chat. LMDeploy supported models can be viewed through `lmdeploy list`.

```
repo_id=Qwen/Qwen-7B-Chat
model_name=qwen-7b
# or
# repo_id=/path/to/Qwen-7B-Chat/local_path

# Inference by TurboMind
lmdeploy chat turbomind $repo_id --model-name $model_name

# Serving with gradio
lmdeploy serve gradio $repo_id --model-name $model_name

# Serving with Restful API
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
```

### 3) A model converted by `lmdeploy convert`

The usage is like previous

```
# Convert a model
lmdeploy convert /path/to/model ./workspace --model-name MODEL_NAME

# Inference by TurboMind
lmdeploy chat turbomind ./workspace

# Serving with gradio
lmdeploy serve gradio ./workspace

# Serving with Restful API
lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
```
72 changes: 72 additions & 0 deletions docs/zh_cn/load_hf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# 直接读取 huggingface 模型

从 v0.1.0 开始,Turbomid 添加了直接读取 Huggingface 格式权重的能力。

## 支持的类型

目前,TurboMind 支持加载三种类型的模型:

1. 在 huggingface.co 上面通过 lmdeploy 量化的模型,如 [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit)
2. huggingface.co 上面其他 LM 模型,如Qwen/Qwen-7B-Chat
3. 通过 `lmdeploy convert` 命令转换好的模型,兼容旧格式

## 使用方式

### 1) 通过 lmdeploy 量化的模型

对于通过 `lmdeploy.lite` 量化的模型,TurboMind 可以直接加载,比如 [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit).

```
repo_id=internlm/internlm-chat-20b-4bit
model_name=internlm-chat-20b

# or
# repo_id=/path/to/downloaded_model

# Inference by TurboMind
lmdeploy chat turbomind $repo_id --model-name $model_name

# Serving with gradio
lmdeploy serve gradio $repo_id --model-name $model_name

# Serving with Restful API
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
```

### 2) 其他的 LM 模型

其他 LM 模型比如 Qwen/Qwen-7B-Chat, baichuan-inc/Baichuan2-7B-Chat。LMDeploy 模型支持情况可通过 `lmdeploy list` 查看。

```
repo_id=Qwen/Qwen-7B-Chat
model_name=qwen-7b
# or
# repo_id=/path/to/Qwen-7B-Chat/local_path

# Inference by TurboMind
lmdeploy chat turbomind $repo_id --model-name $model_name

# Serving with gradio
lmdeploy serve gradio $repo_id --model-name $model_name

# Serving with Restful API
lmdeploy serve api_server $repo_id --model-name $model_name --instance_num 32 --tp 1
```

### 3) 通过 `lmdeploy convert` 命令转换好的模型

使用方式与之前相同

```
# Convert a model
lmdeploy convert /path/to/model ./workspace --model-name MODEL_NAME

# Inference by TurboMind
lmdeploy chat turbomind ./workspace

# Serving with gradio
lmdeploy serve gradio ./workspace

# Serving with Restful API
lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
```
Loading
Loading