Skip to content

Commit

Permalink
Support serving with gradio without communicating to TIS (#162)
Browse files Browse the repository at this point in the history
* use local model for webui

* local model for app.py

* lint

* remove print

* add seed

* comments

* fixed seesion_id

* support turbomind batch inference

* update app.py

* lint and docstring

* move webui to serve/gradio

* update doc

* update doc

* update docstring and rmeove print conversition

* log

* Update docs/zh_cn/build.md

Co-authored-by: Chen Xin <[email protected]>

* Update docs/en/build.md

Co-authored-by: Chen Xin <[email protected]>

* use latest gradio

* fix

* replace partial with InterFace

* use host ip instead of coolie

---------

Co-authored-by: Chen Xin <[email protected]>
  • Loading branch information
AllentDan and irexyc authored Aug 4, 2023
1 parent 7a2128b commit 18c386d
Show file tree
Hide file tree
Showing 8 changed files with 430 additions and 182 deletions.
18 changes: 11 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,11 +50,9 @@ And the request throughput of TurboMind is 30% higher than vLLM.

### Installation

Below are quick steps for installation:
Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)

```shell
conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy
pip install lmdeploy
```

Expand Down Expand Up @@ -92,7 +90,15 @@ python -m lmdeploy.turbomind.chat ./workspace
> **Note**<br />
> Tensor parallel is available to perform inference on multiple GPUs. Add `--tp=<num_gpu>` on `chat` to enable runtime TP.
#### Serving
#### Serving with gradio

```shell
python3 -m lmdeploy.serve.gradio.app ./workspace
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)

#### Serving with Triton Inference Server

Launch inference server by:

Expand All @@ -109,11 +115,9 @@ python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
or webui,

```shell
python3 -m lmdeploy.app {server_ip_addresss}:33337
python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)

For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)

### Inference with PyTorch
Expand Down
18 changes: 12 additions & 6 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,9 @@ TurboMind 的 output token throughput 超过 2000 token/s, 整体比 DeepSpeed

### 安装

使用 pip ( python 3.8+) 安装 LMDeploy,或者[源码安装](./docs/zh_cn/build.md)

```shell
conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy
pip install lmdeploy
```

Expand Down Expand Up @@ -90,7 +90,15 @@ python3 -m lmdeploy.turbomind.chat ./workspace
> **Note**<br />
> 使用 Tensor 并发可以利用多张 GPU 进行推理。在 `chat` 时添加参数 `--tp=<num_gpu>` 可以启动运行时 TP。
#### 部署推理服务
#### 启动 gradio server

```shell
python3 -m lmdeploy.serve.gradio.app ./workspace
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)

#### 通过容器部署推理服务

使用下面的命令启动推理服务:

Expand All @@ -107,11 +115,9 @@ python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
也可以通过 WebUI 方式来对话:

```shell
python3 -m lmdeploy.app {server_ip_addresss}:33337
python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)

其他模型的部署方式,比如 LLaMA,LLaMA-2,vicuna等等,请参考[这里](docs/zh_cn/serving.md)

### 基于 PyTorch 的推理
Expand Down
26 changes: 26 additions & 0 deletions docs/en/build.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
## Build from source

- make sure local gcc version no less than 9, which can be conformed by `gcc --version`.
- install packages for compiling and running:
```shell
pip install -r requirements.txt
```
- install [nccl](https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html), set environment variables:
```shell
export NCCL_ROOT_DIR=/path/to/nccl/build
export NCCL_LIBRARIES=/path/to/nccl/build/lib
```
- install rapidjson
- install openmpi, installing from source is recommended.
```shell
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.gz
tar -xzf openmpi-*.tar.gz && cd openmpi-*
./configure --with-cuda
make -j$(nproc)
make install
```
- build and install lmdeploy:
```shell
mkdir build && cd build
sh ../generate.sh
```
26 changes: 26 additions & 0 deletions docs/zh_cn/build.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
### 源码安装

- 确保物理机环境的 gcc 版本不低于 9,可以通过`gcc --version`确认。
- 安装编译和运行依赖包:
```shell
pip install -r requirements.txt
```
- 安装 [nccl](https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html),设置环境变量
```shell
export NCCL_ROOT_DIR=/path/to/nccl/build
export NCCL_LIBRARIES=/path/to/nccl/build/lib
```
- rapidjson 安装
- openmpi 安装, 推荐从源码安装:
```shell
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.gz
tar -xzf openmpi-*.tar.gz && cd openmpi-*
./configure --with-cuda
make -j$(nproc)
make install
```
- lmdeploy 编译安装:
```shell
mkdir build && cd build
sh ../generate.sh
```
169 changes: 0 additions & 169 deletions lmdeploy/app.py

This file was deleted.

1 change: 1 addition & 0 deletions lmdeploy/serve/gradio/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Copyright (c) OpenMMLab. All rights reserved.
Loading

0 comments on commit 18c386d

Please sign in to comment.