Skip to content

Commit

Permalink
Add open ai endpoints by mimicking those of vLLM (#25)
Browse files Browse the repository at this point in the history
* Adding /v1/ to existing functional routes

* Added missing arguments to prepare for the new endpoints

* Added openai endpoints

* Updated the documentation with /v1/

* WIP : add completions request parameters

* Added request pydantic model

* Added response examples for the new routes and fixed the tests

* Added requests examples in the swagger

* Added swagger for models endpoint

* Updated the docs to add the new endpoints

* Changed a comment

* Fix to make the bandit action pass

* Forgot some references

* Simplifying an import
  • Loading branch information
gsolard authored and mfournioux committed Jun 27, 2024
1 parent 4c401b2 commit 62f9a74
Show file tree
Hide file tree
Showing 23 changed files with 769 additions and 125 deletions.
10 changes: 6 additions & 4 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
### All variables present in this .env are case insensitive. The "-" character present in the cli args is replace by an underscore "_"
# The specified values in the example are the default values
### All variables present in this .env are case insensitive. The "-" character present in the cli args is replaced by an underscore "_"
# The specified values in the following examples are the default values

### Log settings ###
### Happy_vLLM log settings ###

# LOG_LEVEL="INFO"


### Application settings ###

# APP_NAME="happy_vllm"
Expand All @@ -23,6 +22,9 @@
# SSL_CA_CERTS=None
# DEFAULT_SSL_CERT_REQS=0
# ROOT_PATH=None
# LORA_MODULES=None
# CHAT_TEMPLATE=None
# RESPONSE_ROLE="assistant"


### Model settings ###
Expand Down
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,5 @@ site

*.pyc
*.egg-info
.env
.env
bandit_outputs.txt
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,19 +36,19 @@ pip install -e .
Just use the entrypoint `happy-vllm` (see [arguments](https://oss-pole-emploi.github.io/happy_vllm/arguments/) for a list of all possible arguments)

```bash
happy_vllm --model path_to_model --host 127.0.0.1 --port 5000
happy_vllm --model path_to_model --host 127.0.0.1 --port 5000 --model-name my_model
```

It will launch the API and you can directly query it for example with

```bash
curl 127.0.0.1:5000/info
curl 127.0.0.1:5000/v1/info
```

To get various information on the application or

```bash
curl 127.0.0.1:5000/generate -d '{"prompt": "Hey,"}'
curl 127.0.0.1:5000/v1/completions -d '{"prompt": "Hey,", "model": "my_model"}'
```

if you want to generate your first LLM response using happy_vLLM. See [endpoints](https://oss-pole-emploi.github.io/happy_vllm/endpoints/endpoints) for more details on all the endpoints provided by happy_vLLM.
Expand Down
5 changes: 4 additions & 1 deletion docs/arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Here is a list of arguments useful for the application (they all have default va

- `host` : The name of the host (default value is `127.0.0.1`)
- `port` : The port number (default value is `5000`)
- `model-name` : The name of the model which will be given by the `\info` endpoint. It is solely informative and won't have any other purpose (default value is `?`)
- `model-name` : The name of the model which will be given by the `/v1/info` endpoint or the `/v1/models`. Knowing the name of the model is important to be able to use the endpoints `/v1/completions` and `/v1/chat/completions` (default value is `?`)
- `app-name`: The name of the application (default value is `happy_vllm`)
- `api-endpoint-prefix`: The prefix added to all the API endpoints (default value is no prefix)
- `explicit-errors`: If `False`, the message displayed when an `500 error` is encountered will be `Internal Server Error`. If `True`, the message displayed will be more explicit and give information on the underlying error. The `True` setting is not recommended in a production setting (default value is `False`).
Expand All @@ -34,6 +34,9 @@ Here is a list of arguments useful for the application (they all have default va
- `ssl-ca-certs`: Uvicorn setting, the CA certificates file (default value is `None`)
- `ssl-cert-reqs`: Uvicorn setting, Whether client certificate is required (see stdlib ssl module's) (default value is `0`)
- `root_path`: The FastAPI root path (default value is `None`)
- `lora-modules`: LoRA module configurations in the format name=path
- `chat-template`: The file path to the chat template, or the template in single-line form for the specified model (see [the documentation of vLLM](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#chat-template) for more details). Useful in the `/v1/chat/completions` endpoint
- `response-role`: The role name to return if `request.add_generation_prompt=true`. Useful in the `/v1/chat/completions` endpoint

### Model arguments

Expand Down
10 changes: 5 additions & 5 deletions docs/endpoints/data_manipulation.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Data manipulation endpoints

In this section we will give more details on the endpoints `/metadata_text` and `/split_text`.
In this section we will give more details on the endpoints [`/v1/metadata_text`](#v1metadata_text-post) and [`v1/split_text`](#v1split_text-post)

## metadata_text (POST)
## /v1/metadata_text (POST)

Returns the number of tokens of a text and indicates the part that would be truncated if too long. Note that this endpoint uses the special version of the tokenizer provided by happy_vLLM (more details [here](tokenizer.md#vanilla-tokenizer-vs-happy_vllm-tokenizer)). The format of the input is as follows:

Expand All @@ -15,8 +15,8 @@ Returns the number of tokens of a text and indicates the part that would be trun
```

- `text`: The text we want to analyze
- `truncation_side`: The side of the truncation. This keyword is optional and the default value is the one of the tokenizer which can be obtained for example via the [`/info` endpoint](technical.md#info-get)
- `max_length`: The maximal length of the string before the truncation acts. This keyword is optional and the default value is the `max_model_len` of the model which can be obtained for example via the [`/info` endpoint](technical.md#info-get)
- `truncation_side`: The side of the truncation. This keyword is optional and the default value is the one of the tokenizer which can be obtained for example via the [`/v1/info` endpoint](technical.md#v1info-get)
- `max_length`: The maximal length of the string before the truncation acts. This keyword is optional and the default value is the `max_model_len` of the model which can be obtained for example via the [`/v1/info` endpoint](technical.md#v1info-get)

The format of the output is as follows:

Expand All @@ -30,7 +30,7 @@ The format of the output is as follows:
- `tokens_nb`: The number of tokens in the given text
- `truncated_text`: The part of the text which would be truncated

## split_text (POST)
## /v1/split_text (POST)

Splits a text in chunks. You can specify a minimal number of tokens present in each chunk. Each chunk will be delimited by separators you can specify.

Expand Down
26 changes: 17 additions & 9 deletions docs/endpoints/endpoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,44 +4,52 @@ happy_vLLM provides several endpoints which cover most of the use cases. Feel fr

## Technical endpoints

### info (GET)
### /v1/info (GET)

Provides information on the API and the model (more details [here](technical.md))

### metrics (GET)
### /metrics (GET)

The technical metrics obtained for prometheus (more details [here](technical.md))

### liveness (GET)
### /liveness (GET)

The liveness endpoint (more details [here](technical.md))

### readiness (GET)
### /readiness (GET)

The readiness endpoint (more details [here](technical.md))

### /v1/models (GET)

The Open AI compatible endpoint used, for example, to get the name of the model. Mimicks the vLLM implementation (more details [here](technical.md))

## Generating endpoints

### generate and generate_stream (POST)
### /v1/completions and /v1/chat/completions (POST)

These two endpoints mimick the one of vLLM. They follow the Open AI contract and you can find more details in [the vLLM documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)

### DEPRECATED /v1/generate and /v1/generate_stream (POST)

The core of the reactor. These two routes take a prompt and completes it (more details [here](generate.md))

## Tokenizer endpoints

### tokenizer (POST)
### /v1/tokenizer (POST)

Used to tokenizer a text (more details [here](tokenizer.md))

### decode (POST)
### /v1/decode (POST)

Used to decode a list of token ids (more details [here](tokenizer.md))

## Data manipulation endpoints

### metadata_text (POST)
### /v1/metadata_text (POST)

Used to know which part of a prompt will be truncated (more details [here](data_manipulation.md))

### split_text (POST)
### /v1/split_text (POST)

Splits a text on some separators, for example to prepare for some RAG (more details [here](data_manipulation.md))
30 changes: 19 additions & 11 deletions docs/endpoints/generate.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,18 @@
# Generating endpoints

There are two endpoints used to generate content. They have the same contract, the only difference is in the response.
There are four endpoints used to generate content. The first two are direct copies of the endpoints provided by vLLM : `/v1/completions` and `/v1/chat/completions` and follow the Open AI contract. The last two `/v1/generate` and `/v1/generate_stream` are deprecated and should not be used anymore. We keep the relevant documentation until we delete them.

The `/generate` endpoint will give the whole response in one go whereas the `/generate_stream` endpoint will provide a streaming response.
## Open AI compatible endpoints

## Keywords
For these two endpoints (`/v1/completions` and `/v1/chat/completions`) we refer you to the [vLLM documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html). Some examples on how to use them are available in the swagger (whose adress is `127.0.0.1:5000/docs` by default)

## Deprecated generating endpoints

They have the same contract, the only difference is in the response.

The `/v1/generate` endpoint will give the whole response in one go whereas the `/v1/generate_stream` endpoint will provide a streaming response.

### Keywords

Here are the keywords you can send to the endpoint. The `prompt` keyword is the only one which is mandatory, all others are optional.

Expand All @@ -20,7 +28,7 @@ Note that the keyword `logits_processors` is not allowed to be used since happy_

If you would like to add a specific logits processor, feel free to open a PR or an issue.

## Output
### Output

The output is of the following form :

Expand All @@ -44,11 +52,11 @@ The `responses` field are the responses to the prompt. The `finish_reasons` fiel
- `abort` means that the request has been aborted
- `None` that the response is not finished (happens when you use the streaming endpoint and that the generation for this request is ongoing).

## Json format
### Json format

In order to force the LLM to answer in a json format, we implemented [LM-format-enforcer](https://github.com/noamgat/lm-format-enforcer). To be more user friendly we implemented two ways to force this response.

### Simple Json
#### Simple Json

You can specify the fields you want and the type of the corresponding values (to choose in the following list `["string", "integer", "boolean", "number"]`) by passing a json in the `json_format` field. You can also specify if the value should be an array by passing the type of the items in an array. For example by passing the following json in `json_format`:

Expand All @@ -75,7 +83,7 @@ the LLM should answer something similar to:

To use this mode, the keyword `json_format_is_json_schema` should be set to `false` (which is the default value)

### Json schema
#### Json schema

In order to permit more complicated json outputs (in particular nested json), you can also use a json schema ([more detail here](https://json-schema.org/)). For example the simple json above could also have been put under the form of a json schema as such :

Expand Down Expand Up @@ -115,9 +123,9 @@ In order to permit more complicated json outputs (in particular nested json), yo

To use this mode, the keyword `json_format_is_json_schema` should be set to `true` (the default value is `false`)

## Examples
### Examples

### Nominal example
#### Nominal example

You can use the following input

Expand Down Expand Up @@ -146,7 +154,7 @@ You will receive something similar to this :

Here we can see that the LLM has not completed its response since the `max_tokens` of 50 has been reached wich can be seen via the `finish_reasons` of the response being ̀`length`

### Use of response_pool
#### Use of response_pool

You can use the following input

Expand Down Expand Up @@ -200,7 +208,7 @@ The response should be this :

The LLM generated just a few token, providing the response faster. We don't need to parse the answer since we know it is necessarily an item of ["mathematician", "astronaut", "show writer"]. Moreover, we are not forced to put the choices in the prompt itself even if it might help get the correct answer.

### Use of json_format
#### Use of json_format

You can use the following input

Expand Down
16 changes: 10 additions & 6 deletions docs/endpoints/technical.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Technical endpoints

Here we present the various technical endpoints : `/info`, `/metrics`, `/liveness` and `/readiness`.
Here we present the various technical endpoints : [`/v1/info`](#v1info-get), [`/metrics`](#metrics-get), [`/liveness`](#liveness-get) and [`/readiness`](#readiness-get), [`/v1/models`](#v1models-get)

## info (GET)
## /v1/info (GET)

This endpoints gives various information on the application and the model. The format of the output is as follows:

Expand All @@ -17,11 +17,11 @@ This endpoints gives various information on the application and the model. The f
}
```

## metrics (GET)
## /metrics (GET)

We remind you that this endpoint is the only one not prefixed by the `api_endpoint_prefix`. This endpoint is generated by a prometheus client. To have more details on which metrics are available, please refer to [vLLM documentation](https://docs.vllm.ai/en/latest/serving/metrics.html)

## liveness (GET)
## /liveness (GET)

Checks if the API is live. The format of the output is as follows:

Expand All @@ -31,7 +31,7 @@ Checks if the API is live. The format of the output is as follows:
}
```

## readiness (GET)
## /readiness (GET)

Checks if the API is ready. The format of the output is as follows:

Expand All @@ -41,4 +41,8 @@ Checks if the API is ready. The format of the output is as follows:
}
```

If the API is not ready, the value is "ko"
If the API is not ready, the value is "ko"

## /v1/models (GET)

The Open AI compatible endpoint used, for example, to get the name of the model. Mimicks the vLLM implementation. Getting the name of the model is important since it is needed to use the [Open AI compatible generating endpoints](generate.md)
10 changes: 5 additions & 5 deletions docs/endpoints/tokenizer.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

## Tokenizer endpoints

The tokenizer endpoints allow to use the tokenizer underlying the model. These endpoints are `/tokenizer` and `/decode` and you can find more details on each below.
The tokenizer endpoints allow to use the tokenizer underlying the model. These endpoints are [`/v1/tokenizer`](#v1tokenizer-post) and [`/v1/decode`](#v1decode-post) and you can find more details on each below.

### tokenizer (POST)
### /v1/tokenizer (POST)

Tokenizes the given text. The format of the input is as follows :

Expand Down Expand Up @@ -50,7 +50,7 @@ The format of the output is as follows :
- `tokens_nb`: The number of tokens in the input
- `tokens_str`: The string representation of each token (given only if `with_tokens_str` was set to `true` in the request)

### decode (POST)
### /v1/decode (POST)

Decodes the given token ids. The format of the input is as follows :

Expand Down Expand Up @@ -96,7 +96,7 @@ The format of the output is as follows:

## Vanilla tokenizer vs happy_vLLM tokenizer

Using the routes `tokenizer` and `decode`, you can decide if you want to use the usual version of the tokenizers (with the keyword `vanilla` set to `true`). But in some cases, the tokenizer introduces special characters instead of whitespaces, adds a whitespace in front of the string etc. While it is usually the correct way to use the tokenizer (since the models have been trained with these), in some cases, you might want just to get rid of all these additions. We provide a simple way to do so just by setting the keyword `vanilla` to `false` in the routes `tokenizer` and `decode`.
Using the endpoints `/v1/tokenizer` and `/v1/decode`, you can decide if you want to use the usual version of the tokenizers (with the keyword `vanilla` set to `true`). But in some cases, the tokenizer introduces special characters instead of whitespaces, adds a whitespace in front of the string etc. While it is usually the correct way to use the tokenizer (since the models have been trained with these), in some cases, you might want just to get rid of all these additions. We provide a simple way to do so just by setting the keyword `vanilla` to `false` in the endpoints `/v1/tokenizer` and `/v1/decode`.

For example, if you want to encode and decode the string : `Hey, how are you ? Fine thanks.` with the Llama tokenizer, it will create the following tokens (in string forms) :

Expand All @@ -110,4 +110,4 @@ For the happy_vLLM tokenizer:

Note that the "Hey" is not treated the same way, that the whitespaces are directly translated in real whitespaces and there is no initial whitespace.

Note that our modified version of the tokenizer is the one used in the `/metadata_text` endpoint (see [this section](data_manipulation.md#metadata_text-post) for more details). For all other endpoints, the usual tokenizer is used (in particular for the `/generate` and `/generate_stream` routes).
Note that our modified version of the tokenizer is the one used in the `/v1/metadata_text` endpoint (see [this section](data_manipulation.md#metadata_text-post) for more details). For all other endpoints, the usual tokenizer is used (in particular for the `/v1/generate` and `/v1/generate_stream` endpoints).
6 changes: 3 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,19 +23,19 @@ pip install -e .
Just use the entrypoint `happy-vllm` (see [arguments](arguments.md) for a list of all possible arguments)

```bash
happy-vllm --model path_to_model --host 127.0.0.1 --port 5000
happy-vllm --model path_to_model --host 127.0.0.1 --port 5000 --model-name my_model
```

It will launch the API and you can directly query it for example with

```bash
curl 127.0.0.1:5000/info
curl 127.0.0.1:5000/v1/info
```

To get various information on the application or

```bash
curl 127.0.0.1:5000/generate -d '{"prompt": "Hey,"}'
curl 127.0.0.1:5000/v1/completions -d '{"prompt": "Hey,", "model": "my_model"}'
```

if you want to generate your first LLM response using happy_vLLM. See [endpoints](endpoints/endpoints.md) for more details on all the endpoints provided by happy_vLLM.
Expand Down
9 changes: 0 additions & 9 deletions docs/pros.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,6 @@ happy_vLLM add new endpoints useful for the users wich don't need to set up thei

If you would like to see an endpoint added, don't hesitate to open an issue or a PR.

## Already included logits processors, easy to use

happy_vLLM include some logits processors which provide new functionalities to the generation which can simply be accessed via keywords passed to the generation request. Namely:

- The possibility to force the LLM to answer in a set of possible answer. Useful, for example, when you want to use the LLM as a classifier, you can force it to answer only in constrained way to be sure to always have a valid output without any parsing of the response.
- The possibility to force the LLM to answer using a json making the parsing of the answer a piece of cake. The specification of this json are made as simple as possible in order to permit beginner users to use this functionality.

More details on how to use these [here](endpoints/generate.md)

## Swagger

A well documented swagger (the UI being reachable at the `/docs` endpoint) in order for users not so used to using API to be able to quickly get the hang of it and be as autonomous as possible in querying the LLM.
Loading

0 comments on commit 62f9a74

Please sign in to comment.