Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into tm2
Browse files Browse the repository at this point in the history
  • Loading branch information
lzhangzz committed Nov 8, 2023
2 parents b7bf3d7 + 013000d commit efe06ea
Show file tree
Hide file tree
Showing 75 changed files with 4,338 additions and 2,333 deletions.
12 changes: 12 additions & 0 deletions .github/ISSUE_TEMPLATE/1-bug-report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,18 @@ body:
A placeholder for the command.
validations:
required: true
- type: textarea
attributes:
label: Environment
description: |
1. Please run `lmdeploy check_env` to collect necessary environment information and paste it here.
2. You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch \[e.g., pip, conda, source\]
- Other environment variables that may be related (such as `$PATH`, `$LD_LIBRARY_PATH`, `$PYTHONPATH`, etc.)
placeholder: Environment here.
render: Shell
validations:
required: true
- type: textarea
attributes:
label: Error traceback
Expand Down
24 changes: 12 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by

## Supported Models

`LMDeploy` has two inference backends, `Pytorch` and `TurboMind`.
`LMDeploy` has two inference backends, `Pytorch` and `TurboMind`. You can run `lmdeploy list` to check the supported model names.

### TurboMind

Expand Down Expand Up @@ -119,14 +119,14 @@ git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internl
GIT_LFS_SKIP_SMUDGE=1

# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b

```

#### Inference by TurboMind

```shell
python -m lmdeploy.turbomind.chat ./workspace
lmdeploy chat turbomind ./workspace
```

> **Note**<br />
Expand All @@ -140,7 +140,7 @@ python -m lmdeploy.turbomind.chat ./workspace
#### Serving with gradio

```shell
python3 -m lmdeploy.serve.gradio.app ./workspace
lmdeploy serve gradio ./workspace
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
Expand All @@ -150,23 +150,23 @@ python3 -m lmdeploy.serve.gradio.app ./workspace
Launch inference server by:

```shell
python3 -m lmdeploy.serve.openai.api_server ./workspace server_ip server_port --instance_num 32 --tp 1
lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
```

Then, you can communicate with it by command line,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
lmdeploy serve api_client api_server_url
```

or webui,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006
lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port ${gradio_ui_port}
```

Refer to [restful_api.md](docs/en/restful_api.md) for more details.
Expand All @@ -182,13 +182,13 @@ bash workspace/service_docker_up.sh
Then, you can communicate with the inference server by command line,

```shell
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
lmdeploy serve triton_client {server_ip_addresss}:33337
```

or webui,

```shell
python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337
lmdeploy serve gradio {server_ip_addresss}:33337
```

For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
Expand All @@ -200,7 +200,7 @@ For detailed instructions on Inference pytorch models, see [here](docs/en/pytorc
#### Single GPU

```shell
python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL \
lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL \
--max_new_tokens 64 \
--temperture 0.8 \
--top_p 0.95 \
Expand Down
24 changes: 12 additions & 12 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 和 [MMRazor](ht

## 支持的模型

`LMDeploy` 支持 `TurboMind``Pytorch` 两种推理后端
`LMDeploy` 支持 `TurboMind``Pytorch` 两种推理后端。运行`lmdeploy list`可查看支持模型列表

### TurboMind

Expand Down Expand Up @@ -120,14 +120,14 @@ git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internl
GIT_LFS_SKIP_SMUDGE=1

# 2. 转换为 trubomind 要求的格式。默认存放路径为 ./workspace
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b

```

#### 使用 turbomind 推理

```shell
python3 -m lmdeploy.turbomind.chat ./workspace
lmdeploy chat turbomind ./workspace
```

> **Note**<br />
Expand All @@ -140,7 +140,7 @@ python3 -m lmdeploy.turbomind.chat ./workspace
#### 启动 gradio server

```shell
python3 -m lmdeploy.serve.gradio.app ./workspace
lmdeploy serve gradio ./workspace
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
Expand All @@ -150,23 +150,23 @@ python3 -m lmdeploy.serve.gradio.app ./workspace
使用下面的命令启动推理服务:

```shell
python3 -m lmdeploy.serve.openai.api_server ./workspace server_ip server_port --instance_num 32 --tp 1
lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${server_port} --instance_num 32 --tp 1
```

你可以通过命令行方式与推理服务进行对话:

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
lmdeploy serve api_client api_server_url
```

也可以通过 WebUI 方式来对话:

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006
lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port ${gradio_ui_port}
```

更多详情可以查阅 [restful_api.md](docs/zh_cn/restful_api.md)
Expand All @@ -182,13 +182,13 @@ bash workspace/service_docker_up.sh
你可以通过命令行方式与推理服务进行对话:

```shell
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
lmdeploy serve triton_client {server_ip_addresss}:33337
```

也可以通过 WebUI 方式来对话:

```shell
python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337
lmdeploy serve gradio {server_ip_addresss}:33337
```

其他模型的部署方式,比如 LLaMA,LLaMA-2,vicuna等等,请参考[这里](docs/zh_cn/serving.md)
Expand All @@ -204,7 +204,7 @@ pip install deepspeed
#### 单个 GPU

```shell
python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL\
lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL\
--max_new_tokens 64 \
--temperture 0.8 \
--top_p 0.95 \
Expand Down
2 changes: 1 addition & 1 deletion benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ pip install nvidia-ml-py
```bash
python profile_generation.py \
--model-path /path/to/your/model \
--concurrency 1 8 --prompt-tokens 0 512 --completion-tokens 2048 512
--concurrency 1 8 --prompt-tokens 1 512 --completion-tokens 2048 512
```

## profile serving
Expand Down
6 changes: 4 additions & 2 deletions benchmark/profile_generation.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ def _infer(model, session_id):

def profile_throughput(model_path: str,
concurrency: int = 1,
input_seqlen: int = 0,
input_seqlen: int = 1,
output_seqlen: int = 512,
test_round: int = 10,
tp: int = 1):
Expand All @@ -99,8 +99,10 @@ def profile_throughput(model_path: str,
tm_model = TurboMind(model_path=model_path, tp=tp)

# make up a prompt that can be tokenized into {input_seqlen} tokens
prompt = '' if input_seqlen == 0 else 'hi' + ' hi' * (input_seqlen - 1)
assert input_seqlen > 0, 'input_seqlen should > 0'
prompt = 'hi'
input_ids = tokenizer.encode(prompt)
input_ids = input_ids * input_seqlen

warmup(tm_model, concurrency, input_ids, output_seqlen)

Expand Down
Loading

0 comments on commit efe06ea

Please sign in to comment.