Skip to content

Commit

Permalink
Improve api_server and webui usage (#544)
Browse files Browse the repository at this point in the history
* make IPv6 compatible, safe run for coroutine interrupting

* instance_id -> session_id and fix api_client.py

* update doc

* remove useless faq

* safe ip mapping

* update app.py

* WIP completion

* completion

* update doc

* disable interactive mode for /v1/chat/completions

* docstring

* docstring

* refactor gradio

* update gradio

* udpate

* update doc

* rename

* session_id default -1

* missed two files

* add a APIClient

* add chat func for APIClient

* refine

* add concurrent function

* sequence_start, sequence_end --> interactive_mode

* update doc

* comments

* doc

* better text completion

* remove /v1/embeddings

* comments

* deprecate generate and use /v1/interactive/completions

* /v1/interactive/completion -> /v1/chat/interactive

* embeddings

* rename

* remove wrong arg description

* docstring

* fix

* update cli

* update doc

* strict session_len limit condition

* pass model args to api_server
  • Loading branch information
AllentDan authored Nov 1, 2023
1 parent 56942c4 commit 373bd01
Show file tree
Hide file tree
Showing 21 changed files with 1,280 additions and 929 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,16 +157,16 @@ Then, you can communicate with it by command line,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client restful_api_url
lmdeploy serve api_client api_server_url
```

or webui,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006
lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port ${gradio_ui_port}
```

Refer to [restful_api.md](docs/en/restful_api.md) for more details.
Expand Down
8 changes: 4 additions & 4 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,16 +157,16 @@ lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${serv

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client restful_api_url
lmdeploy serve api_client api_server_url
```

也可以通过 WebUI 方式来对话:

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port${server_port} --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006
lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port ${gradio_ui_port}
```

更多详情可以查阅 [restful_api.md](docs/zh_cn/restful_api.md)
Expand Down
54 changes: 9 additions & 45 deletions benchmark/profile_restful_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,48 +2,15 @@
import multiprocessing as mp
import random
import time
from typing import Iterable, List

import fire
import numpy as np
import requests

from lmdeploy.serve.openai.api_client import get_streaming_response
from lmdeploy.tokenizer import Tokenizer
from lmdeploy.utils import get_logger


def get_streaming_response(prompt: str,
api_url: str,
session_id: int,
request_output_len: int,
stream: bool = True,
sequence_start: bool = True,
sequence_end: bool = False,
ignore_eos: bool = False) -> Iterable[List[str]]:
headers = {'User-Agent': 'Test Client'}
pload = {
'prompt': prompt,
'stream': stream,
'session_id': session_id,
'request_output_len': request_output_len,
'sequence_start': sequence_start,
'sequence_end': sequence_end,
'ignore_eos': ignore_eos
}
response = requests.post(api_url,
headers=headers,
json=pload,
stream=stream)
for chunk in response.iter_lines(chunk_size=8192,
decode_unicode=False,
delimiter=b'\n'):
if chunk:
data = json.loads(chunk.decode('utf-8'))
output = data['text']
tokens = data['tokens']
yield output, tokens


def infer(server_addr: str, session_id: int, req_queue: mp.Queue,
res_que: mp.Queue):
stats = []
Expand All @@ -55,13 +22,12 @@ def infer(server_addr: str, session_id: int, req_queue: mp.Queue,
timestamps = []
tokens = []
start = time.perf_counter()
for res, token in get_streaming_response(
for res, token, status in get_streaming_response(
prompt,
server_addr,
session_id,
request_output_len=output_seqlen,
sequence_start=True,
sequence_end=True):
interactive_mode=False):
timestamps.append(time.perf_counter())
tokens.append(token)

Expand All @@ -80,13 +46,11 @@ def warmup(server_addr: str,

def _infer(server_addr, session_id):
for _ in range(warmup_round):
for _, _ in get_streaming_response(
'',
server_addr,
session_id,
request_output_len=output_seqlen,
sequence_start=True,
sequence_end=True):
for _ in get_streaming_response('',
server_addr,
session_id,
request_output_len=output_seqlen,
interactive_mode=False):
continue

_start = time.perf_counter()
Expand Down Expand Up @@ -150,7 +114,7 @@ def main(server_addr: str,
concurrency: int = 1,
session_len: int = 2048,
samples: int = 1000):
api_url = server_addr + '/generate'
api_url = server_addr + '/v1/chat/interactive'
warmup(api_url, concurrency, session_len - 1)
req_queue, n_req = read_dataset(tokenizer_path, dataset_path, samples,
session_len)
Expand Down
3 changes: 2 additions & 1 deletion benchmark/profile_throughput.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ def sample_requests(
dataset = [data for data in dataset if len(data['conversations']) >= 2]
# Only keep the first two turns of each conversation.
dataset = [(data['conversations'][0]['value'],
data['conversations'][1]['value']) for data in dataset]
data['conversations'][1]['value'])
for data in dataset][:num_requests * 2] # speed up encoding

# Tokenize the prompts and completions.
prompts = [prompt for prompt, _ in dataset]
Expand Down
126 changes: 63 additions & 63 deletions docs/en/restful_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,52 +7,57 @@ lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${serv
```

Then, the user can open the swagger UI: `http://{server_ip}:{server_port}` for the detailed api usage.
We provide four restful api in total. Three of them are in OpenAI format. However, we recommend users try
our own api which provides more arguments for users to modify. The performance is comparatively better.
We provide four restful api in total. Three of them are in OpenAI format.

- /v1/chat/completions
- /v1/models
- /v1/completions

However, we recommend users try
our own api `/v1/chat/interactive` which provides more arguments for users to modify. The performance is comparatively better.

**Note** please, if you want to launch multiple requests, you'd better set different `session_id` for both
`/v1/chat/completions` and `/v1/chat/interactive` apis. Or, we will set them random values.

### python

Here is an example for our own api `generate`.
We have integrated the client-side functionalities of these services into the `APIClient` class. Below are some examples demonstrating how to invoke the `api_server` service on the client side.

If you want to use the `/v1/chat/completions` endpoint, you can try the following code:

```python
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient('http://{server_ip}:{server_port}')
model_name = api_client.available_models[0]
messages = [{"role": "user", "content": "Say this is a test!"}]
for item in api_client.chat_completions_v1(model=model_name, messages=messages):
print(item)
```

For the `/v1/completions` endpoint. If you want to use the `/v1/completions` endpoint, you can try:

```python
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient('http://{server_ip}:{server_port}')
model_name = api_client.available_models[0]
for item in api_client.completions_v1(model=model_name, prompt='hi'):
print(item)
```

Lmdeploy supports maintaining session histories on the server for `/v1/chat/interactive` api. We disable the
feature by default.

- On interactive mode, the chat history is kept on the server. In a multiple rounds of conversation, you should set
`interactive_mode = True` and the same `session_id` (can't be -1, it's the default number) to `/v1/chat/interactive` for requests.
- On normal mode, no chat history is kept on the server.

The interactive mode can be controlled by the `interactive_mode` boolean parameter. The following is an example of normal mode. If you want to experience the interactive mode, simply pass in `interactive_mode=True`.

```python
import json
import requests
from typing import Iterable, List


def get_streaming_response(prompt: str,
api_url: str,
session_id: int,
request_output_len: int,
stream: bool = True,
sequence_start: bool = True,
sequence_end: bool = True,
ignore_eos: bool = False) -> Iterable[List[str]]:
headers = {'User-Agent': 'Test Client'}
pload = {
'prompt': prompt,
'stream': stream,
'session_id': session_id,
'request_output_len': request_output_len,
'sequence_start': sequence_start,
'sequence_end': sequence_end,
'ignore_eos': ignore_eos
}
response = requests.post(
api_url, headers=headers, json=pload, stream=stream)
for chunk in response.iter_lines(
chunk_size=8192, decode_unicode=False, delimiter=b'\n'):
if chunk:
data = json.loads(chunk.decode('utf-8'))
output = data['text']
tokens = data['tokens']
yield output, tokens


for output, tokens in get_streaming_response(
"Hi, how are you?", "http://{server_ip}:{server_port}/generate", 0,
512):
print(output, end='')
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient('http://{server_ip}:{server_port}')
for item in api_client.generate(prompt='hi'):
print(item)
```

### Java/Golang/Rust
Expand Down Expand Up @@ -84,16 +89,15 @@ List Models:
curl http://{server_ip}:{server_port}/v1/models
```

Generate:
Interactive Chat:

```bash
curl http://{server_ip}:{server_port}/generate \
curl http://{server_ip}:{server_port}/v1/chat/interactive \
-H "Content-Type: application/json" \
-d '{
"prompt": "Hello! How are you?",
"session_id": 1,
"sequence_start": true,
"sequence_end": true
"interactive_mode": true
}'
```

Expand All @@ -104,19 +108,19 @@ curl http://{server_ip}:{server_port}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "internlm-chat-7b",
"messages": [{"role": "user", "content": "Hello! Ho are you?"}]
"messages": [{"role": "user", "content": "Hello! How are you?"}]
}'
```

Embeddings:
Text Completions:

```bash
curl http://{server_ip}:{server_port}/v1/embeddings \
-H "Content-Type: application/json" \
```shell
curl http://{server_ip}:{server_port}/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "internlm-chat-7b",
"input": "Hello world!"
}'
"model": "llama",
"prompt": "two steps to build a house:"
}'
```

### CLI client
Expand All @@ -125,18 +129,18 @@ There is a client script for restful api server.

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client restful_api_url
lmdeploy serve api_client api_server_url
```

### webui

You can also test restful-api through webui.

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006
lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port ${gradio_ui_port}
```

### FAQ
Expand All @@ -146,10 +150,6 @@ lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port $

2. When OOM appeared at the server side, please reduce the number of `instance_num` when lanching the service.

3. When the request with the same `session_id` to `generate` got a empty return value and a negative `tokens`, please consider setting `sequence_start=false` for the second question and the same for the afterwards.

4. Requests were previously being handled sequentially rather than concurrently. To resolve this issue,

- kindly provide unique session_id values when calling the `generate` API or else your requests may be associated with client IP addresses
3. When the request with the same `session_id` to `/v1/chat/interactive` got a empty return value and a negative `tokens`, please consider setting `interactive_mode=false` to restart the session.

5. Both `generate` api and `v1/chat/completions` upport engaging in multiple rounds of conversation, where input `prompt` or `messages` consists of either single strings or entire chat histories.These inputs are interpreted using multi-turn dialogue modes. However, ff you want to turn the mode of and manage the chat history in clients, please the parameter `sequence_end: true` when utilizing the `generate` function, or specify `renew_session: true` when making use of `v1/chat/completions`
4. The `/v1/chat/interactive` api disables engaging in multiple rounds of conversation by default. The input argument `prompt` consists of either single strings or entire chat histories.
8 changes: 4 additions & 4 deletions docs/en/supported_models/codellama.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,16 +97,16 @@ Then, you can communicate with it by command line,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client restful_api_url
lmdeploy serve api_client api_server_url
```

or through webui after launching gradio,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006
lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port ${gradio_ui_port}
```

Regarding the detailed information of RESTful API, you can refer to [restful_api.md](../restful_api.md).
Loading

0 comments on commit 373bd01

Please sign in to comment.