Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add request distributor server #903

Merged
merged 13 commits into from
Jan 12, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -77,3 +77,4 @@ work_dir*/
*.pkl

!CMakeLists.txt
proxy_config.yml
1 change: 1 addition & 0 deletions docs/en/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ Welcome to LMDeploy's tutorials!
:maxdepth: 1
:caption: serving

serving/proxy_server.md
serving/restful_api.md

.. _quantization:
Expand Down
39 changes: 39 additions & 0 deletions docs/en/serving/proxy_server.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
## Proxy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

H1 title: "Request Distributor Server"


The proxy service can parallelize multiple api_server services. Users only need to access the proxy URL, and they can indirectly access different api_server services. The proxy service will automatically distribute requests internally, achieving load balancing.

### Startup

Start the proxy service:

```shell
python lmdeploy/serve/proxy/proxy.py --server_name {server_name} --server_port {server_port} --strategy "min_expected_latency"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

能用下面这种方式么?

python3 -m lmdeploy.serve.proxy --server-name {server_name} --server-port 

```

After startup is successful, the URL of the proxy service will also be printed by the script. Access this URL in your browser to open the Swagger UI.

### API

Through Swagger UI, we can see multiple APIs. Those related to api_server node management include:

- /nodes/status
- /nodes/add
- /nodes/remove

They respectively represent viewing all api_server service nodes, adding a certain node, and deleting a certain node.

APIs related to usage include:

- /v1/models
- /v1/chat/completions
- /v1/completions

The usage of these APIs is the same as that of api_server.

### Dispatch Strategy

The current distribution strategies of the proxy service are as follows:

- random: dispatches based on the ability of each api_server node provided by the user to process requests. The greater the request throughput, the more likely it is to be allocated. Nodes that do not provide throughput are treated according to the average throughput of other nodes.
- min_expected_latency: allocates based on the number of requests currently waiting to be processed on each node, and the throughput capability of each node, calculating the expected time required to complete the response. The shortest one gets allocated. Nodes that do not provide throughput are treated similarly.
- min_observed_latency: allocates based on the average time required to handle a certain number of past requests on each node. The one with the shortest time gets allocated.
4 changes: 4 additions & 0 deletions docs/en/serving/restful_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,3 +160,7 @@ lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port
4. The `/v1/chat/interactive` api disables engaging in multiple rounds of conversation by default. The input argument `prompt` consists of either single strings or entire chat histories.

5. If you need to adjust other default parameters of the session, such as the content of fields like system. You can directly pass in the initialization parameters of the [dialogue template](https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/model.py). For example, for the internlm-chat-7b model, you can set the `--meta_instruction` parameter when starting the `api_server`.

### multiple services
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request distribution service


Please refer to our [proxy service](./proxy_server.md)
1 change: 1 addition & 0 deletions docs/zh_cn/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
:maxdepth: 1
:caption: 服务

serving/proxy_server.md
serving/restful_api.md


Expand Down
39 changes: 39 additions & 0 deletions docs/zh_cn/serving/proxy_server.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
## 代理

代理服务可以将多个 api_server 服务,进行并联。用户可以只需要访问代理 URL,就可以间接访问不同的 api_server 服务。代理服务内部会自动分发请求,做到负载均衡。

### 启动

启动代理服务:

```shell
python lmdeploy/serve/proxy/proxy.py --server_name {server_name} --server_port {server_port} --strategy "min_expected_latency"
```

启动成功后,代理服务的 URL 也会被脚本打印。浏览器访问这个 URL,可以打开 Swagger UI。

### API

通过 Swagger UI,我们可以看到多个 API。其中,和 api_server 节点管理相关的有:

- /nodes/status
- /nodes/add
- /nodes/remove

他们分别表示,查看所有的 api_server 服务节点,增加某个节点,删除某个节点。

和使用相关的 api 有:

- /v1/models
- /v1/chat/completions
- /v1/completions

这些 API 的使用方式和 api_server 一样。

### 分发策略

代理服务目前的分发策略如下:

- random: 根据用户提供的各个 api_server 节点的处理请求的能力,进行有权重的随机。处理请求的吞吐量越大,就越有可能被分配。部分节点没有提供吞吐量,将按照其他节点的平均吞吐量对待。
- min_expected_latency: 根据每个节点现有的待处理完的请求,和各个节点吞吐能力,计算预期完成响应所需时间,时间最短的将被分配。未提供吞吐量的节点,同上。
- min_observed_latency: 根据每个节点过去一定数量的请求,处理完成所需的平均用时,用时最短的将被分配。
4 changes: 4 additions & 0 deletions docs/zh_cn/serving/restful_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,3 +154,7 @@ lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port
4. `/v1/chat/interactive` api 支持多轮对话, 但是默认关闭。`messages` 或者 `prompt` 参数既可以是一个简单字符串表示用户的单词提问,也可以是一段对话历史。

5. 如需调整会话默认的其他参数,比如 system 等字段的内容,可以直接将[对话模板](https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/model.py)初始化参数传入。比如 internlm-chat-7b 模型,可以通过启动`api_server`时,设置`--meta_instruction`参数。

### 多个服务并行
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

多机并行服务


请参考我们的 [代理服务](./proxy_server.md)
39 changes: 39 additions & 0 deletions lmdeploy/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Copyright (c) OpenMMLab. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to put constants.py to lmdeploy/serve/proxy?


import enum

LATENCY_DEEQUE_LEN = 15
API_TIMEOUT_LEN = 100


class Strategy(enum.Enum):
RANDOM = enum.auto()
MIN_EXPECTED_LATENCY = enum.auto()
MIN_OBSERVED_LATENCY = enum.auto()

@classmethod
def from_str(cls, name):
if name == 'random':
return cls.RANDOM
elif name == 'min_expected_latency':
return cls.MIN_EXPECTED_LATENCY
elif name == 'min_observed_latency':
return cls.MIN_OBSERVED_LATENCY
else:
raise ValueError(f'Invalid strategy: {name}. Supported: random, '
f'min_expected_latency, min_observed_latency.')


class ErrorCodes(enum.Enum):
MODEL_NOT_FOUND = 10400
SERVICE_UNAVAILABLE = 10401
API_TIMEOUT = 10402


err_msg = {
ErrorCodes.MODEL_NOT_FOUND:
'The request model name does not exist in the model list.',
ErrorCodes.SERVICE_UNAVAILABLE:
'The service is unavailable now. May retry later.',
ErrorCodes.API_TIMEOUT: 'Failed to get response after a period of time'
}
Loading
Loading