Skip to content

Latest commit

 

History

History
155 lines (116 loc) · 5.12 KB

restful_api.md

File metadata and controls

155 lines (116 loc) · 5.12 KB

Restful API

Launch Service

lmdeploy serve api_server ./workspace --server_name 0.0.0.0 --server_port ${server_port} --instance_num 32 --tp 1

Then, the user can open the swagger UI: http://{server_ip}:{server_port} for the detailed api usage. We provide four restful api in total. Three of them are in OpenAI format. However, we recommend users try our own api which provides more arguments for users to modify. The performance is comparatively better.

python

Here is an example for our own api generate.

import json
import requests
from typing import Iterable, List


def get_streaming_response(prompt: str,
                           api_url: str,
                           session_id: int,
                           request_output_len: int,
                           stream: bool = True,
                           sequence_start: bool = True,
                           sequence_end: bool = True,
                           ignore_eos: bool = False) -> Iterable[List[str]]:
    headers = {'User-Agent': 'Test Client'}
    pload = {
        'prompt': prompt,
        'stream': stream,
        'session_id': session_id,
        'request_output_len': request_output_len,
        'sequence_start': sequence_start,
        'sequence_end': sequence_end,
        'ignore_eos': ignore_eos
    }
    response = requests.post(
        api_url, headers=headers, json=pload, stream=stream)
    for chunk in response.iter_lines(
            chunk_size=8192, decode_unicode=False, delimiter=b'\n'):
        if chunk:
            data = json.loads(chunk.decode('utf-8'))
            output = data['text']
            tokens = data['tokens']
            yield output, tokens


for output, tokens in get_streaming_response(
        "Hi, how are you?", "http://{server_ip}:{server_port}/generate", 0,
        512):
    print(output, end='')

Java/Golang/Rust

May use openapi-generator-cli to convert http://{server_ip}:{server_port}/openapi.json to java/rust/golang client. Here is an example:

$ docker run -it --rm -v ${PWD}:/local openapitools/openapi-generator-cli generate -i /local/openapi.json -g rust -o /local/rust

$ ls rust/*
rust/Cargo.toml  rust/git_push.sh  rust/README.md

rust/docs:
ChatCompletionRequest.md  EmbeddingsRequest.md  HttpValidationError.md  LocationInner.md  Prompt.md
DefaultApi.md             GenerateRequest.md    Input.md                Messages.md       ValidationError.md

rust/src:
apis  lib.rs  models

cURL

cURL is a tool for observing the output of the api.

List Models:

curl http://{server_ip}:{server_port}/v1/models

Generate:

curl http://{server_ip}:{server_port}/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Hello! How are you?",
    "session_id": 1,
    "sequence_start": true,
    "sequence_end": true
  }'

Chat Completions:

curl http://{server_ip}:{server_port}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "internlm-chat-7b",
    "messages": [{"role": "user", "content": "Hello! Ho are you?"}]
  }'

Embeddings:

curl http://{server_ip}:{server_port}/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "internlm-chat-7b",
    "input": "Hello world!"
  }'

CLI client

There is a client script for restful api server.

# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client restful_api_url

webui

You can also test restful-api through webui.

# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True

FAQ

  1. When user got "finish_reason":"length" which means the session is too long to be continued. Please add "renew_session": true into the next request.

  2. When OOM appeared at the server side, please reduce the number of instance_num when lanching the service.

  3. When the request with the same session_id to generate got a empty return value and a negative tokens, please consider setting sequence_start=false for the second question and the same for the afterwards.

  4. Requests were previously being handled sequentially rather than concurrently. To resolve this issue,

    • kindly provide unique session_id values when calling the generate API or else your requests may be associated with client IP addresses
  5. Both generate api and v1/chat/completions upport engaging in multiple rounds of conversation, where input prompt or messages consists of either single strings or entire chat histories.These inputs are interpreted using multi-turn dialogue modes. However, ff you want to turn the mode of and manage the chat history in clients, please the parameter sequence_end: true when utilizing the generate function, or specify renew_session: true when making use of v1/chat/completions