InternLM · lvhan028 · Jan 12, 2024 · Jan 2, 2024 · Jan 2, 2024 · Jan 4, 2024
diff --git a/.gitignore b/.gitignore
@@ -77,3 +77,4 @@ work_dir*/
 *.pkl
 
 !CMakeLists.txt
+proxy_config.yml
diff --git a/docs/en/index.rst b/docs/en/index.rst
@@ -46,6 +46,7 @@ Welcome to LMDeploy's tutorials!
    :maxdepth: 1
    :caption: serving
 
+   serving/proxy_server.md
    serving/restful_api.md
 
 .. _quantization:

diff --git a/docs/en/serving/proxy_server.md b/docs/en/serving/proxy_server.md
@@ -0,0 +1,39 @@
+## Proxy
+
+The proxy service can parallelize multiple api_server services. Users only need to access the proxy URL, and they can indirectly access different api_server services. The proxy service will automatically distribute requests internally, achieving load balancing.
+
+### Startup
+
+Start the proxy service:
+
+```shell
+python lmdeploy/serve/proxy/proxy.py --server_name {server_name} --server_port {server_port} --strategy "min_expected_latency"
+```
+
+After startup is successful, the URL of the proxy service will also be printed by the script. Access this URL in your browser to open the Swagger UI.
+
+### API
+
+Through Swagger UI, we can see multiple APIs. Those related to api_server node management include:
+
+- /nodes/status
+- /nodes/add
+- /nodes/remove
+
+They respectively represent viewing all api_server service nodes, adding a certain node, and deleting a certain node.
+
+APIs related to usage include:
+
+- /v1/models
+- /v1/chat/completions
+- /v1/completions
+
+The usage of these APIs is the same as that of api_server.
+
+### Dispatch Strategy
+
+The current distribution strategies of the proxy service are as follows:
+
+- random： dispatches based on the ability of each api_server node provided by the user to process requests. The greater the request throughput, the more likely it is to be allocated. Nodes that do not provide throughput are treated according to the average throughput of other nodes.
+- min_expected_latency： allocates based on the number of requests currently waiting to be processed on each node, and the throughput capability of each node, calculating the expected time required to complete the response. The shortest one gets allocated. Nodes that do not provide throughput are treated similarly.
+- min_observed_latency： allocates based on the average time required to handle a certain number of past requests on each node. The one with the shortest time gets allocated.
diff --git a/docs/en/serving/restful_api.md b/docs/en/serving/restful_api.md
@@ -160,3 +160,7 @@ lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port
 4. The `/v1/chat/interactive` api disables engaging in multiple rounds of conversation by default. The input argument `prompt` consists of either single strings or entire chat histories.
 
 5. If you need to adjust other default parameters of the session, such as the content of fields like system. You can directly pass in the initialization parameters of the [dialogue template](https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/model.py). For example, for the internlm-chat-7b model, you can set the `--meta_instruction` parameter when starting the `api_server`.
+
+### multiple services
+
+Please refer to our [proxy service](./proxy_server.md)
diff --git a/docs/zh_cn/index.rst b/docs/zh_cn/index.rst
@@ -47,6 +47,7 @@
    :maxdepth: 1
    :caption: 服务
 
+   serving/proxy_server.md
    serving/restful_api.md
 
 

diff --git a/docs/zh_cn/serving/proxy_server.md b/docs/zh_cn/serving/proxy_server.md
@@ -0,0 +1,39 @@
+## 代理
+
+代理服务可以将多个 api_server 服务，进行并联。用户可以只需要访问代理 URL，就可以间接访问不同的 api_server 服务。代理服务内部会自动分发请求，做到负载均衡。
+
+### 启动
+
+启动代理服务：
+
+```shell
+python lmdeploy/serve/proxy/proxy.py --server_name {server_name} --server_port {server_port} --strategy "min_expected_latency"
+```
+
+启动成功后，代理服务的 URL 也会被脚本打印。浏览器访问这个 URL，可以打开 Swagger UI。
+
+### API
+
+通过 Swagger UI，我们可以看到多个 API。其中，和 api_server 节点管理相关的有：
+
+- /nodes/status
+- /nodes/add
+- /nodes/remove
+
+他们分别表示，查看所有的 api_server 服务节点，增加某个节点，删除某个节点。
+
+和使用相关的 api 有：
+
+- /v1/models
+- /v1/chat/completions
+- /v1/completions
+
+这些 API 的使用方式和 api_server 一样。
+
+### 分发策略
+
+代理服务目前的分发策略如下：
+
+- random： 根据用户提供的各个 api_server 节点的处理请求的能力，进行有权重的随机。处理请求的吞吐量越大，就越有可能被分配。部分节点没有提供吞吐量，将按照其他节点的平均吞吐量对待。
+- min_expected_latency： 根据每个节点现有的待处理完的请求，和各个节点吞吐能力，计算预期完成响应所需时间，时间最短的将被分配。未提供吞吐量的节点，同上。
+- min_observed_latency： 根据每个节点过去一定数量的请求，处理完成所需的平均用时，用时最短的将被分配。
diff --git a/docs/zh_cn/serving/restful_api.md b/docs/zh_cn/serving/restful_api.md
@@ -154,3 +154,7 @@ lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port
 4. `/v1/chat/interactive` api 支持多轮对话, 但是默认关闭。`messages` 或者 `prompt` 参数既可以是一个简单字符串表示用户的单词提问，也可以是一段对话历史。
 
 5. 如需调整会话默认的其他参数，比如 system 等字段的内容，可以直接将[对话模板](https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/model.py)初始化参数传入。比如 internlm-chat-7b 模型，可以通过启动`api_server`时，设置`--meta_instruction`参数。
+
+### 多个服务并行
+
+请参考我们的 [代理服务](./proxy_server.md)
diff --git a/lmdeploy/constants.py b/lmdeploy/constants.py
@@ -0,0 +1,39 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+
+import enum
+
+LATENCY_DEEQUE_LEN = 15
+API_TIMEOUT_LEN = 100
+
+
+class Strategy(enum.Enum):
+    RANDOM = enum.auto()
+    MIN_EXPECTED_LATENCY = enum.auto()
+    MIN_OBSERVED_LATENCY = enum.auto()
+
+    @classmethod
+    def from_str(cls, name):
+        if name == 'random':
+            return cls.RANDOM
+        elif name == 'min_expected_latency':
+            return cls.MIN_EXPECTED_LATENCY
+        elif name == 'min_observed_latency':
+            return cls.MIN_OBSERVED_LATENCY
+        else:
+            raise ValueError(f'Invalid strategy: {name}. Supported: random, '
+                             f'min_expected_latency, min_observed_latency.')
+
+
+class ErrorCodes(enum.Enum):
+    MODEL_NOT_FOUND = 10400
+    SERVICE_UNAVAILABLE = 10401
+    API_TIMEOUT = 10402
+
+
+err_msg = {
+    ErrorCodes.MODEL_NOT_FOUND:
+    'The request model name does not exist in the model list.',
+    ErrorCodes.SERVICE_UNAVAILABLE:
+    'The service is unavailable now. May retry later.',
+    ErrorCodes.API_TIMEOUT: 'Failed to get response after a period of time'
+}
Original file line number	Diff line number	Diff line change
Expand Up		@@ -77,3 +77,4 @@ work_dir*/
		*.pkl

		!CMakeLists.txt
		proxy_config.yml