Support serving with gradio without communicating to TIS (#162)

* use local model for webui * local model for app.py * lint * remove print * add seed * comments * fixed seesion_id * support turbomind batch inference * update app.py * lint and docstring * move webui to serve/gradio * update doc * update doc * update docstring and rmeove print conversition * log * Update docs/zh_cn/build.md Co-authored-by: Chen Xin <[email protected]> * Update docs/en/build.md Co-authored-by: Chen Xin <[email protected]> * use latest gradio * fix * replace partial with InterFace * use host ip instead of coolie --------- Co-authored-by: Chen Xin <[email protected]>
InternLM · Aug 4, 2023 · 18c386d · 18c386d
1 parent 7a2128b
commit 18c386d
Show file tree

Hide file tree

Showing 8 changed files with 430 additions and 182 deletions.
diff --git a/README.md b/README.md
@@ -50,11 +50,9 @@ And the request throughput of TurboMind is 30% higher than vLLM.
 
 ### Installation
 
-Below are quick steps for installation:
+Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
 
 ```shell
-conda create -n lmdeploy python=3.10 -y
-conda activate lmdeploy
 pip install lmdeploy
 ```
 
@@ -92,7 +90,15 @@ python -m lmdeploy.turbomind.chat ./workspace
 > **Note**<br />
 > Tensor parallel is available to perform inference on multiple GPUs. Add `--tp=<num_gpu>` on `chat` to enable runtime TP.
 
-#### Serving
+#### Serving with gradio
+
+```shell
+python3 -m lmdeploy.serve.gradio.app ./workspace
+```
+
+![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
+
+#### Serving with Triton Inference Server
 
 Launch inference server by:
 
@@ -109,11 +115,9 @@ python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
 or webui,
 
 ```shell
-python3 -m lmdeploy.app {server_ip_addresss}:33337
+python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337
 ```
 
-![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
-
 For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
 
 ### Inference with PyTorch

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -51,9 +51,9 @@ TurboMind 的 output token throughput 超过 2000 token/s, 整体比 DeepSpeed
 
 ### 安装
 
+使用 pip ( python 3.8+) 安装 LMDeploy，或者[源码安装](./docs/zh_cn/build.md)
+
 ```shell
-conda create -n lmdeploy python=3.10 -y
-conda activate lmdeploy
 pip install lmdeploy
 ```
 
@@ -90,7 +90,15 @@ python3 -m lmdeploy.turbomind.chat ./workspace
 > **Note**<br />
 > 使用 Tensor 并发可以利用多张 GPU 进行推理。在 `chat` 时添加参数 `--tp=<num_gpu>` 可以启动运行时 TP。
 
-#### 部署推理服务
+#### 启动 gradio server
+
+```shell
+python3 -m lmdeploy.serve.gradio.app ./workspace
+```
+
+![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
+
+#### 通过容器部署推理服务
 
 使用下面的命令启动推理服务：
 
@@ -107,11 +115,9 @@ python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
 也可以通过 WebUI 方式来对话：
 
 ```shell
-python3 -m lmdeploy.app {server_ip_addresss}:33337
+python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337
 ```
 
-![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
-
 其他模型的部署方式，比如 LLaMA，LLaMA-2，vicuna等等，请参考[这里](docs/zh_cn/serving.md)
 
 ### 基于 PyTorch 的推理

diff --git a/docs/en/build.md b/docs/en/build.md
@@ -0,0 +1,26 @@
+## Build from source
+
+- make sure local gcc version no less than 9, which can be conformed by `gcc --version`.
+- install packages for compiling and running:
+  ```shell
+  pip install -r requirements.txt
+  ```
+- install [nccl](https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html), set environment variables:
+  ```shell
+  export NCCL_ROOT_DIR=/path/to/nccl/build
+  export NCCL_LIBRARIES=/path/to/nccl/build/lib
+  ```
+- install rapidjson
+- install openmpi, installing from source is recommended.
+  ```shell
+  wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.gz
+  tar -xzf openmpi-*.tar.gz && cd openmpi-*
+  ./configure --with-cuda
+  make -j$(nproc)
+  make install
+  ```
+- build and install lmdeploy:
+  ```shell
+  mkdir build && cd build
+  sh ../generate.sh
+  ```
diff --git a/docs/zh_cn/build.md b/docs/zh_cn/build.md
@@ -0,0 +1,26 @@
+### 源码安装
+
+- 确保物理机环境的 gcc 版本不低于 9，可以通过`gcc --version`确认。
+- 安装编译和运行依赖包：
+  ```shell
+  pip install -r requirements.txt
+  ```
+- 安装 [nccl](https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html),设置环境变量
+  ```shell
+  export NCCL_ROOT_DIR=/path/to/nccl/build
+  export NCCL_LIBRARIES=/path/to/nccl/build/lib
+  ```
+- rapidjson 安装
+- openmpi 安装, 推荐从源码安装:
+  ```shell
+  wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.gz
+  tar -xzf openmpi-*.tar.gz && cd openmpi-*
+  ./configure --with-cuda
+  make -j$(nproc)
+  make install
+  ```
+- lmdeploy 编译安装:
+  ```shell
+  mkdir build && cd build
+  sh ../generate.sh
+  ```
diff --git a/lmdeploy/app.py b/lmdeploy/app.py
diff --git a/lmdeploy/serve/gradio/__init__.py b/lmdeploy/serve/gradio/__init__.py
@@ -0,0 +1 @@
+# Copyright (c) OpenMMLab. All rights reserved.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		# Copyright (c) OpenMMLab. All rights reserved.