Skip to content

Commit

Permalink
Add more user-friendly CLI (#541)
Browse files Browse the repository at this point in the history
* add

* import fire in main

* wrap to speed up fire cli

* update

* update docs

* update docs

* fix

* resolve commennts

* resolve confict and add test for cli
  • Loading branch information
RunningLeon authored Oct 25, 2023
1 parent 7283781 commit 169d516
Show file tree
Hide file tree
Showing 33 changed files with 566 additions and 126 deletions.
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,14 +119,14 @@ git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internl
GIT_LFS_SKIP_SMUDGE=1

# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b

```

#### Inference by TurboMind

```shell
python -m lmdeploy.turbomind.chat ./workspace
lmdeploy chat turbomind ./workspace
```

> **Note**<br />
Expand All @@ -140,7 +140,7 @@ python -m lmdeploy.turbomind.chat ./workspace
#### Serving with gradio

```shell
python3 -m lmdeploy.serve.gradio.app ./workspace
lmdeploy serve gradio ./workspace
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
Expand All @@ -150,23 +150,23 @@ python3 -m lmdeploy.serve.gradio.app ./workspace
Launch inference server by:

```shell
python3 -m lmdeploy.serve.openai.api_server ./workspace server_ip server_port --instance_num 32 --tp 1
lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
```

Then, you can communicate with it by command line,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
lmdeploy serve api_client restful_api_url
```

or webui,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
```

Refer to [restful_api.md](docs/en/restful_api.md) for more details.
Expand All @@ -182,13 +182,13 @@ bash workspace/service_docker_up.sh
Then, you can communicate with the inference server by command line,

```shell
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
lmdeploy serve triton_client {server_ip_addresss}:33337
```

or webui,

```shell
python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337
lmdeploy serve gradio {server_ip_addresss}:33337
```

For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
Expand All @@ -200,7 +200,7 @@ For detailed instructions on Inference pytorch models, see [here](docs/en/pytorc
#### Single GPU

```shell
python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL \
lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL \
--max_new_tokens 64 \
--temperture 0.8 \
--top_p 0.95 \
Expand Down
20 changes: 10 additions & 10 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,14 +120,14 @@ git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internl
GIT_LFS_SKIP_SMUDGE=1

# 2. 转换为 trubomind 要求的格式。默认存放路径为 ./workspace
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b

```

#### 使用 turbomind 推理

```shell
python3 -m lmdeploy.turbomind.chat ./workspace
lmdeploy chat turbomind ./workspace
```

> **Note**<br />
Expand All @@ -140,7 +140,7 @@ python3 -m lmdeploy.turbomind.chat ./workspace
#### 启动 gradio server

```shell
python3 -m lmdeploy.serve.gradio.app ./workspace
lmdeploy serve gradio ./workspace
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
Expand All @@ -150,23 +150,23 @@ python3 -m lmdeploy.serve.gradio.app ./workspace
使用下面的命令启动推理服务:

```shell
python3 -m lmdeploy.serve.openai.api_server ./workspace server_ip server_port --instance_num 32 --tp 1
lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${server_port} --instance_num 32 --tp 1
```

你可以通过命令行方式与推理服务进行对话:

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
lmdeploy serve api_client restful_api_url
```

也可以通过 WebUI 方式来对话:

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port${server_port} --restful_api True
```

更多详情可以查阅 [restful_api.md](docs/zh_cn/restful_api.md)
Expand All @@ -182,13 +182,13 @@ bash workspace/service_docker_up.sh
你可以通过命令行方式与推理服务进行对话:

```shell
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
lmdeploy serve triton_client {server_ip_addresss}:33337
```

也可以通过 WebUI 方式来对话:

```shell
python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337
lmdeploy serve gradio {server_ip_addresss}:33337
```

其他模型的部署方式,比如 LLaMA,LLaMA-2,vicuna等等,请参考[这里](docs/zh_cn/serving.md)
Expand All @@ -204,7 +204,7 @@ pip install deepspeed
#### 单个 GPU

```shell
python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL\
lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL\
--max_new_tokens 64 \
--temperture 0.8 \
--top_p 0.95 \
Expand Down
8 changes: 4 additions & 4 deletions docs/en/kv_int8.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ dequant: f = q * scale + zp
Convert the Hugging Face model format to the TurboMind inference format to create a workspace directory.

```bash
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
```

If you already have a workspace directory, skip this step.
Expand All @@ -29,15 +29,15 @@ Get the quantization parameters by these two steps:

```bash
# get minmax
python3 -m lmdeploy.lite.apis.calibrate \
lmdeploy lite calibrate \
--model $HF_MODEL \
--calib_dataset 'c4' \ # Support c4, ptb, wikitext2, pileval
--calib_samples 128 \ # Number of samples in the calibration set, if the memory is not enough, it can be adjusted appropriately
--calib_seqlen 2048 \ # Length of a single text, if the memory is not enough, you can adjust it appropriately
--work_dir $WORK_DIR \ # Directory for saving quantized statistical parameters and quantized weights in Pytorch format

# get quant parameters
python3 -m lmdeploy.lite.apis.kv_qparams \
lmdeploy lite kv_qparams \
--work_dir $WORK_DIR \ # Directory of the last output
--turbomind_dir workspace/triton_models/weights/ \ # Directory to save the quantization parameters
--kv_sym False \ # Symmetric or asymmetric quantization, default is False
Expand All @@ -64,7 +64,7 @@ Considering there are four combinations of kernels needed to be implemented, pre
Test the chat performance.

```bash
python3 -m lmdeploy.turbomind.chat ./workspace
lmdeploy chat turbomind ./workspace
```

## GPU Memory Test
Expand Down
6 changes: 3 additions & 3 deletions docs/en/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,21 @@ This submodule allow user to chat with language model through command line, and
**Example 1**: Chat with default setting

```shell
python -m lmdeploy.pytorch.chat $PATH_TO_HF_MODEL
lmdeploy chat torch $PATH_TO_HF_MODEL
```

**Example 2**: Disable sampling and chat history

```shell
python -m lmdeploy.pytorch.chat \
lmdeploy chat torch \
$PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \
--temperature 0 --max-history 0
```

**Example 3**: Accelerate with deepspeed inference

```shell
python -m lmdeploy.pytorch.chat \
lmdeploy chat torch \
$PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \
--accel deepspeed
```
Expand Down
8 changes: 4 additions & 4 deletions docs/en/restful_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
### Launch Service

```shell
python3 -m lmdeploy.serve.openai.api_server ./workspace 0.0.0.0 server_port --instance_num 32 --tp 1
lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${server_port} --instance_num 32 --tp 1
```

Then, the user can open the swagger UI: `http://{server_ip}:{server_port}` for the detailed api usage.
Expand Down Expand Up @@ -125,7 +125,7 @@ There is a client script for restful api server.

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
lmdeploy serve api_client restful_api_url
```

### webui
Expand All @@ -135,8 +135,8 @@ You can also test restful-api through webui.
```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
```

### FAQ
Expand Down
18 changes: 9 additions & 9 deletions docs/en/serving.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ You can download [llama-2 models from huggingface](https://huggingface.co/meta-l
<summary><b>7B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-7b-chat-hf
lmdeploy convert llama2 /path/to/llama-2-7b-chat-hf
bash workspace/service_docker_up.sh
```

Expand All @@ -18,7 +18,7 @@ bash workspace/service_docker_up.sh
<summary><b>13B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-13b-chat-hf --tp 2
lmdeploy convert llama2 /path/to/llama-2-13b-chat-hf --tp 2
bash workspace/service_docker_up.sh
```

Expand All @@ -28,7 +28,7 @@ bash workspace/service_docker_up.sh
<summary><b>70B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-70b-chat-hf --tp 8
lmdeploy convert llama2 /path/to/llama-2-70b-chat-hf --tp 8
bash workspace/service_docker_up.sh
```

Expand All @@ -42,7 +42,7 @@ Weights for the LLaMA models can be obtained from by filling out [this form](htt
<summary><b>7B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-7b llama \
lmdeploy convert llama /path/to/llama-7b llama \
--tokenizer_path /path/to/tokenizer/model
bash workspace/service_docker_up.sh
```
Expand All @@ -53,7 +53,7 @@ bash workspace/service_docker_up.sh
<summary><b>13B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-13b llama \
lmdeploy convert llama /path/to/llama-13b llama \
--tokenizer_path /path/to/tokenizer/model --tp 2
bash workspace/service_docker_up.sh
```
Expand All @@ -64,7 +64,7 @@ bash workspace/service_docker_up.sh
<summary><b>30B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-30b llama \
lmdeploy convert llama /path/to/llama-30b llama \
--tokenizer_path /path/to/tokenizer/model --tp 4
bash workspace/service_docker_up.sh
```
Expand All @@ -75,7 +75,7 @@ bash workspace/service_docker_up.sh
<summary><b>65B</b></summary>

```shell
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-65b llama \
lmdeploy convert llama /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh
```
Expand All @@ -94,7 +94,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-7b \
--delta-path lmsys/vicuna-7b-delta-v1.1

python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-7b
lmdeploy convert vicuna /path/to/vicuna-7b
bash workspace/service_docker_up.sh
```

Expand All @@ -110,7 +110,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-13b \
--delta-path lmsys/vicuna-13b-delta-v1.1

python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-13b
lmdeploy convert vicuna /path/to/vicuna-13b
bash workspace/service_docker_up.sh
```

Expand Down
18 changes: 9 additions & 9 deletions docs/en/supported_models/codellama.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Based on the above table, download the model that meets your requirements. Execu
python3 -m pip install lmdeploy

# convert weight layout
python3 -m lmdeploy.serve.turbomind.deploy codellama /the/path/of/codellama/model
lmdeploy convert codellama /the/path/of/codellama/model
```

Then, you can communicate with codellama in consolo by following instructions in next sections
Expand All @@ -42,13 +42,13 @@ Then, you can communicate with codellama in consolo by following instructions in
### Completion

```shell
python3 -m lmdeploy.turbomind.chat ./workspace --cap completion
lmdeploy chat turbomind ./workspace --cap completion
```

### Infilling

```shell
python3 -m lmdeploy.turbomind.chat ./workspace --cap infilling
lmdeploy chat turbomind ./workspace --cap infilling
```

The input code is supposed to have a special placeholder `<FILL>`. For example,
Expand All @@ -64,15 +64,15 @@ And the generated code piece by `turbomind.chat` is the one to be filled in `<FI
### Chat

```
python3 -m lmdeploy.turbomind.chat ./workspace --cap chat --sys-instruct "Provide answers in Python"
lmdeploy chat turbomind ./workspace --cap chat --sys-instruct "Provide answers in Python"
```

`--sys-instruct` instruction can be changed to other coding languages as long as codellama supports it

### Python specialist

```
python3 -m lmdeploy.turbomind.chat ./workspace --cap python
lmdeploy chat turbomind ./workspace --cap python
```

Python fine-tuned model is highly recommended when 'python specialist' capability is required.
Expand All @@ -90,23 +90,23 @@ Launch inference server by:
```shell
# --instance_num: number of instances to performance inference, which can be viewed as max requests concurrency
# --tp: the number of GPUs used in tensor parallelism
python3 -m lmdeploy.serve.openai.api_server ./workspace server_ip server_port --instance_num 32 --tp 1
lmdeploy serve api_server ./workspace --server_name ${server_ip} --server_port ${server_port} --instance_num 32 --tp 1
```

Then, you can communicate with it by command line,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
lmdeploy serve api_client restful_api_url
```

or through webui after launching gradio,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
```

Regarding the detailed information of RESTful API, you can refer to [restful_api.md](../restful_api.md).
Loading

0 comments on commit 169d516

Please sign in to comment.