Add open ai endpoints by mimicking those of vLLM (#25)

* Adding /v1/ to existing functional routes * Added missing arguments to prepare for the new endpoints * Added openai endpoints * Updated the documentation with /v1/ * WIP : add completions request parameters * Added request pydantic model * Added response examples for the new routes and fixed the tests * Added requests examples in the swagger * Added swagger for models endpoint * Updated the docs to add the new endpoints * Changed a comment * Fix to make the bandit action pass * Forgot some references * Simplifying an import
France-Travail · Jun 27, 2024 · 62f9a74 · 62f9a74
1 parent 4c401b2
commit 62f9a74
Show file tree

Hide file tree

Showing 23 changed files with 769 additions and 125 deletions.
diff --git a/.env.example b/.env.example
@@ -1,11 +1,10 @@
-### All variables present in this .env are case insensitive. The "-" character present in the cli args is replace by an underscore "_"  
-# The specified values in the example are the default values
+### All variables present in this .env are case insensitive. The "-" character present in the cli args is replaced by an underscore "_"  
+# The specified values in the following examples are the default values
 
-### Log settings ###
+### Happy_vLLM log settings ###
 
 # LOG_LEVEL="INFO"
 
-
 ### Application settings ###
 
 # APP_NAME="happy_vllm"
@@ -23,6 +22,9 @@
 # SSL_CA_CERTS=None
 # DEFAULT_SSL_CERT_REQS=0
 # ROOT_PATH=None
+# LORA_MODULES=None
+# CHAT_TEMPLATE=None
+# RESPONSE_ROLE="assistant"
 
 
 ### Model settings ###

diff --git a/.gitignore b/.gitignore
@@ -10,4 +10,5 @@ site
 
 *.pyc
 *.egg-info
-.env
+.env
+bandit_outputs.txt
diff --git a/README.md b/README.md
@@ -36,19 +36,19 @@ pip install -e .
 Just use the entrypoint `happy-vllm` (see [arguments](https://oss-pole-emploi.github.io/happy_vllm/arguments/) for a list of all possible arguments)
 
 ```bash
-happy_vllm --model path_to_model --host 127.0.0.1 --port 5000
+happy_vllm --model path_to_model --host 127.0.0.1 --port 5000 --model-name my_model
 ```
 
 It will launch the API and you can directly query it for example with 
 
 ```bash
-curl 127.0.0.1:5000/info
+curl 127.0.0.1:5000/v1/info
 ```
 
 To get various information on the application or 
 
 ```bash
-curl 127.0.0.1:5000/generate -d '{"prompt": "Hey,"}'
+curl 127.0.0.1:5000/v1/completions -d '{"prompt": "Hey,", "model": "my_model"}'
 ```
 
 if you want to generate your first LLM response using happy_vLLM. See [endpoints](https://oss-pole-emploi.github.io/happy_vllm/endpoints/endpoints) for more details on all the endpoints provided by happy_vLLM. 

diff --git a/docs/arguments.md b/docs/arguments.md
@@ -20,7 +20,7 @@ Here is a list of arguments useful for the application (they all have default va
 
  - `host` : The name of the host (default value is `127.0.0.1`)
  - `port` : The port number (default value is `5000`)
- - `model-name` : The name of the model which will be given by the `\info` endpoint. It is solely informative and won't have any other purpose (default value is `?`)
+ - `model-name` : The name of the model which will be given by the `/v1/info` endpoint or the `/v1/models`. Knowing the name of the model is important to be able to use the endpoints `/v1/completions` and `/v1/chat/completions` (default value is `?`)
  - `app-name`: The name of the application (default value is `happy_vllm`)
  - `api-endpoint-prefix`: The prefix added to all the API endpoints (default value is no prefix)
  - `explicit-errors`: If `False`, the message displayed when an `500 error` is encountered will be `Internal Server Error`. If `True`, the message displayed will be more explicit and give information on the underlying error. The `True` setting is not recommended in a production setting (default value is `False`).
@@ -34,6 +34,9 @@ Here is a list of arguments useful for the application (they all have default va
  - `ssl-ca-certs`: Uvicorn setting, the CA certificates file (default value is `None`)
  - `ssl-cert-reqs`: Uvicorn setting, Whether client certificate is required (see stdlib ssl module's) (default value is `0`)
  - `root_path`: The FastAPI root path (default value is `None`)
+ - `lora-modules`: LoRA module configurations in the format name=path
+ - `chat-template`: The file path to the chat template, or the template in single-line form for the specified model (see [the documentation of vLLM](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#chat-template) for more details). Useful in the `/v1/chat/completions` endpoint
+ - `response-role`: The role name to return if `request.add_generation_prompt=true`. Useful in the `/v1/chat/completions` endpoint
 
 ### Model arguments
 

diff --git a/docs/endpoints/data_manipulation.md b/docs/endpoints/data_manipulation.md
@@ -1,8 +1,8 @@
 # Data manipulation endpoints
 
-In this section we will give more details on the endpoints `/metadata_text` and `/split_text`.
+In this section we will give more details on the endpoints [`/v1/metadata_text`](#v1metadata_text-post) and [`v1/split_text`](#v1split_text-post)
 
-## metadata_text (POST)
+## /v1/metadata_text (POST)
 
 Returns the number of tokens of a text and indicates the part that would be truncated if too long. Note that this endpoint uses the special version of the tokenizer provided by happy_vLLM (more details [here](tokenizer.md#vanilla-tokenizer-vs-happy_vllm-tokenizer)). The format of the input is as follows:
 
@@ -15,8 +15,8 @@ Returns the number of tokens of a text and indicates the part that would be trun
 ```
 
  - `text`: The text we want to analyze
- - `truncation_side`: The side of the truncation. This keyword is optional and the default value is the one of the tokenizer which can be obtained for example via the [`/info` endpoint](technical.md#info-get)
- - `max_length`: The maximal length of the string before the truncation acts. This keyword is optional and the default value is the `max_model_len` of the model which can be obtained for example via the [`/info` endpoint](technical.md#info-get)
+ - `truncation_side`: The side of the truncation. This keyword is optional and the default value is the one of the tokenizer which can be obtained for example via the [`/v1/info` endpoint](technical.md#v1info-get)
+ - `max_length`: The maximal length of the string before the truncation acts. This keyword is optional and the default value is the `max_model_len` of the model which can be obtained for example via the [`/v1/info` endpoint](technical.md#v1info-get)
 
 The format of the output is as follows:
 
@@ -30,7 +30,7 @@ The format of the output is as follows:
  - `tokens_nb`: The number of tokens in the given text
  - `truncated_text`: The part of the text which would be truncated
 
-## split_text (POST)
+## /v1/split_text (POST)
 
 Splits a text in chunks. You can specify a minimal number of tokens present in each chunk. Each chunk will be delimited by separators you can specify. 
 

diff --git a/docs/endpoints/endpoints.md b/docs/endpoints/endpoints.md
@@ -4,44 +4,52 @@ happy_vLLM provides several endpoints which cover most of the use cases. Feel fr
 
 ## Technical endpoints
 
-### info (GET)
+### /v1/info (GET)
 
 Provides information on the API and the model (more details [here](technical.md))
 
-### metrics (GET)
+### /metrics (GET)
 
 The technical metrics obtained for prometheus (more details [here](technical.md))
 
-### liveness (GET)
+### /liveness (GET)
 
 The liveness endpoint (more details [here](technical.md))
 
-### readiness (GET)
+### /readiness (GET)
 
 The readiness endpoint (more details [here](technical.md))
 
+### /v1/models (GET)
+
+The Open AI compatible endpoint used, for example, to get the name of the model. Mimicks the vLLM implementation (more details [here](technical.md))
+
 ## Generating endpoints
 
-### generate and generate_stream (POST)
+### /v1/completions and /v1/chat/completions (POST)
+
+These two endpoints mimick the one of vLLM. They follow the Open AI contract and you can find more details in [the vLLM documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)
+
+### DEPRECATED /v1/generate and /v1/generate_stream (POST)
 
 The core of the reactor. These two routes take a prompt and completes it (more details [here](generate.md))
 
 ## Tokenizer endpoints
 
-### tokenizer (POST)
+### /v1/tokenizer (POST)
 
 Used to tokenizer a text (more details [here](tokenizer.md))
 
-### decode (POST)
+### /v1/decode (POST)
 
 Used to decode a list of token ids (more details [here](tokenizer.md))
 
 ## Data manipulation endpoints
 
-### metadata_text (POST)
+### /v1/metadata_text (POST)
 
 Used to know which part of a prompt will be truncated (more details [here](data_manipulation.md))
 
-### split_text (POST)
+### /v1/split_text (POST)
 
 Splits a text on some separators, for example to prepare for some RAG (more details [here](data_manipulation.md))
diff --git a/docs/endpoints/generate.md b/docs/endpoints/generate.md
@@ -1,10 +1,18 @@
 # Generating endpoints
 
-There are two endpoints used to generate content. They have the same contract, the only difference is in the response.
+There are four endpoints used to generate content. The first two are direct copies of the endpoints provided by vLLM : `/v1/completions` and `/v1/chat/completions` and follow the Open AI contract. The last two `/v1/generate` and `/v1/generate_stream` are deprecated and should not be used anymore. We keep the relevant documentation until we delete them.
 
-The `/generate` endpoint will give the whole response in one go whereas the `/generate_stream` endpoint will provide a streaming response.
+## Open AI compatible endpoints
 
-## Keywords
+For these two endpoints (`/v1/completions` and `/v1/chat/completions`) we refer you to the [vLLM documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html). Some examples on how to use them are available in the swagger (whose adress is `127.0.0.1:5000/docs` by default)
+
+## Deprecated generating endpoints
+
+They have the same contract, the only difference is in the response.
+
+The `/v1/generate` endpoint will give the whole response in one go whereas the `/v1/generate_stream` endpoint will provide a streaming response.
+
+### Keywords
 
 Here are the keywords you can send to the endpoint. The `prompt` keyword is the only one which is mandatory, all others are optional.
 
@@ -20,7 +28,7 @@ Note that the keyword `logits_processors` is not allowed to be used since happy_
 
 If you would like to add a specific logits processor, feel free to open a PR or an issue.
 
-## Output
+### Output
 
 The output is of the following form :
 
@@ -44,11 +52,11 @@ The `responses` field are the responses to the prompt. The `finish_reasons` fiel
 - `abort` means that the request has been aborted
 - `None` that the response is not finished (happens when you use the streaming endpoint and that the generation for this request is ongoing).
 
-## Json format
+### Json format
 
 In order to force the LLM to answer in a json format, we implemented [LM-format-enforcer](https://github.com/noamgat/lm-format-enforcer). To be more user friendly we implemented two ways to force this response.
 
-### Simple Json
+#### Simple Json
 
 You can specify the fields you want and the type of the corresponding values (to choose in the following list `["string", "integer", "boolean", "number"]`) by passing a json in the `json_format` field. You can also specify if the value should be an array by passing the type of the items in an array. For example by passing the following json in `json_format`:
 
@@ -75,7 +83,7 @@ the LLM should answer something similar to:
 
 To use this mode, the keyword `json_format_is_json_schema` should be set to `false` (which is the default value)
 
-### Json schema
+#### Json schema
 
 In order to permit more complicated json outputs (in particular nested json), you can also use a json schema ([more detail here](https://json-schema.org/)). For example the simple json above  could also have been put under the form of a json schema as such :
 
@@ -115,9 +123,9 @@ In order to permit more complicated json outputs (in particular nested json), yo
 
 To use this mode, the keyword `json_format_is_json_schema` should be set to `true` (the default value is `false`)
 
-## Examples
+### Examples
 
-### Nominal example
+#### Nominal example
 
 You can use the following input
 
@@ -146,7 +154,7 @@ You will receive something similar to this :
 
 Here we can see that the LLM has not completed its response since the `max_tokens` of 50 has been reached wich can be seen via the `finish_reasons` of the response being ̀`length`
 
-### Use of response_pool
+#### Use of response_pool
 
 You can use the following input
 
@@ -200,7 +208,7 @@ The response should be this :
 
 The LLM generated just a few token, providing the response faster. We don't need to parse the answer since we know it is necessarily an item of ["mathematician", "astronaut", "show writer"]. Moreover, we are not forced to put the choices in the prompt itself even if it might help get the correct answer.
 
-### Use of json_format
+#### Use of json_format
 
 You can use the following input
 

diff --git a/docs/endpoints/technical.md b/docs/endpoints/technical.md
@@ -1,8 +1,8 @@
 # Technical endpoints
 
-Here we present the various technical endpoints : `/info`, `/metrics`, `/liveness` and  `/readiness`.
+Here we present the various technical endpoints : [`/v1/info`](#v1info-get), [`/metrics`](#metrics-get), [`/liveness`](#liveness-get) and  [`/readiness`](#readiness-get), [`/v1/models`](#v1models-get)
 
-## info (GET)
+## /v1/info (GET)
 
 This endpoints gives various information on the application and the model. The format of the output is as follows:
 
@@ -17,11 +17,11 @@ This endpoints gives various information on the application and the model. The f
 }
 ```
 
-## metrics (GET)
+## /metrics (GET)
 
 We remind you that this endpoint is the only one not prefixed by the `api_endpoint_prefix`. This endpoint is generated by a prometheus client. To have more details on which metrics are available, please refer to [vLLM documentation](https://docs.vllm.ai/en/latest/serving/metrics.html)
 
-## liveness (GET)
+## /liveness (GET)
 
 Checks if the API is live. The format of the output is as follows:
 
@@ -31,7 +31,7 @@ Checks if the API is live. The format of the output is as follows:
 }
 ```
 
-## readiness (GET)
+## /readiness (GET)
 
 Checks if the API is ready. The format of the output is as follows:
 
@@ -41,4 +41,8 @@ Checks if the API is ready. The format of the output is as follows:
 }
 ```
 
-If the API is not ready, the value is "ko"
+If the API is not ready, the value is "ko"
+
+## /v1/models (GET)
+
+The Open AI compatible endpoint used, for example, to get the name of the model. Mimicks the vLLM implementation. Getting the name of the model is important since it is needed to use the [Open AI compatible generating endpoints](generate.md) 
diff --git a/docs/endpoints/tokenizer.md b/docs/endpoints/tokenizer.md
@@ -2,9 +2,9 @@
 
 ## Tokenizer endpoints
 
-The tokenizer endpoints allow to use the tokenizer underlying the model. These endpoints are `/tokenizer` and `/decode` and you can find more details on each below.
+The tokenizer endpoints allow to use the tokenizer underlying the model. These endpoints are [`/v1/tokenizer`](#v1tokenizer-post) and [`/v1/decode`](#v1decode-post) and you can find more details on each below.
 
-### tokenizer (POST)
+### /v1/tokenizer (POST)
 
 Tokenizes the given text. The format of the input is as follows :
 
@@ -50,7 +50,7 @@ The format of the output is as follows :
  - `tokens_nb`: The number of tokens in the input
  - `tokens_str`: The string representation of each token (given only if `with_tokens_str` was set to `true` in the request)
 
-### decode (POST)
+### /v1/decode (POST)
 
 Decodes the given token ids. The format of the input is as follows :
 
@@ -96,7 +96,7 @@ The format of the output is as follows:
 
 ## Vanilla tokenizer vs happy_vLLM tokenizer
 
-Using the routes `tokenizer` and `decode`, you can decide if you want to use the usual version of the tokenizers (with the keyword `vanilla` set to `true`). But in some cases, the tokenizer introduces special characters instead of whitespaces, adds a whitespace in front of the string etc. While it is usually the correct way to use the tokenizer (since the models have been trained with these), in some cases, you might want just to get rid of all these additions. We provide a simple way to do so just by setting the keyword `vanilla` to `false` in the routes `tokenizer` and `decode`.
+Using the endpoints `/v1/tokenizer` and `/v1/decode`, you can decide if you want to use the usual version of the tokenizers (with the keyword `vanilla` set to `true`). But in some cases, the tokenizer introduces special characters instead of whitespaces, adds a whitespace in front of the string etc. While it is usually the correct way to use the tokenizer (since the models have been trained with these), in some cases, you might want just to get rid of all these additions. We provide a simple way to do so just by setting the keyword `vanilla` to `false` in the endpoints `/v1/tokenizer` and `/v1/decode`.
 
 For example, if you want to encode and decode the string : `Hey, how are you ? Fine thanks.` with the Llama tokenizer, it will create the following tokens (in string forms) : 
 
@@ -110,4 +110,4 @@ For the happy_vLLM tokenizer:
 
  Note that the "Hey" is not treated the same way, that the whitespaces are directly translated in real whitespaces and there is no initial whitespace.
 
-Note that our modified version of the tokenizer is the one used in the `/metadata_text` endpoint (see [this section](data_manipulation.md#metadata_text-post) for more details). For all other endpoints, the usual tokenizer is used (in particular for the `/generate` and `/generate_stream` routes).
+Note that our modified version of the tokenizer is the one used in the `/v1/metadata_text` endpoint (see [this section](data_manipulation.md#metadata_text-post) for more details). For all other endpoints, the usual tokenizer is used (in particular for the `/v1/generate` and `/v1/generate_stream` endpoints).
diff --git a/docs/index.md b/docs/index.md
@@ -23,19 +23,19 @@ pip install -e .
 Just use the entrypoint `happy-vllm` (see [arguments](arguments.md) for a list of all possible arguments)
 
 ```bash
-happy-vllm --model path_to_model --host 127.0.0.1 --port 5000
+happy-vllm --model path_to_model --host 127.0.0.1 --port 5000 --model-name my_model
 ```
 
 It will launch the API and you can directly query it for example with 
 
 ```bash
-curl 127.0.0.1:5000/info
+curl 127.0.0.1:5000/v1/info
 ```
 
 To get various information on the application or 
 
 ```bash
-curl 127.0.0.1:5000/generate -d '{"prompt": "Hey,"}'
+curl 127.0.0.1:5000/v1/completions -d '{"prompt": "Hey,", "model": "my_model"}'
 ```
 
 if you want to generate your first LLM response using happy_vLLM. See [endpoints](endpoints/endpoints.md) for more details on all the endpoints provided by happy_vLLM. 

diff --git a/docs/pros.md b/docs/pros.md
@@ -13,15 +13,6 @@ happy_vLLM add new endpoints useful for the users wich don't need to set up thei
 
 If you would like to see an endpoint added, don't hesitate to open an issue or a PR.
 
-## Already included logits processors, easy to use
-
-happy_vLLM include some logits processors which provide new functionalities to the generation which can simply be accessed via keywords passed to the generation request. Namely:
-
- - The possibility to force the LLM to answer in a set of possible answer. Useful, for example, when you want to use the LLM as a classifier, you can force it to answer only in constrained way to be sure to always have a valid output without any parsing of the response.
- - The possibility to force the LLM to answer using a json making the parsing of the answer a piece of cake. The specification of this json are made as simple as possible in order to permit beginner users to use this functionality. 
-
-More details on how to use these [here](endpoints/generate.md)
-
 ## Swagger
 
 A well documented swagger (the UI being reachable at the `/docs` endpoint) in order for users not so used to using API to be able to quickly get the hang of it and be as autonomous as possible in querying the LLM.
-Original file line number
+Diff line change
@@ Expand Up / @@ -10,4 +10,5 @@ site @@
     *.pyc
     *.egg-info
-    .env
+    .env
+    bandit_outputs.txt