Skip to content

Commit

Permalink
[Frontend] Chat-based Embeddings API (vllm-project#9759)
Browse files Browse the repository at this point in the history
Signed-off-by: Tyler Michael Smith <[email protected]>
  • Loading branch information
DarkLight1337 authored and tlrmchlsmth committed Nov 23, 2024
1 parent 5d62a6d commit 8675bc1
Show file tree
Hide file tree
Showing 21 changed files with 853 additions and 415 deletions.
2 changes: 2 additions & 0 deletions docs/requirements-docs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,7 @@ torch
py-cpuinfo
transformers
mistral_common >= 1.3.4
aiohttp
starlette
openai # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args
partial-json-parser # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,6 @@ def setup(app):

# Mock out external dependencies here, otherwise the autodoc pages may be blank.
autodoc_mock_imports = [
"aiohttp",
"compressed_tensors",
"cpuinfo",
"cv2",
Expand Down Expand Up @@ -143,6 +142,7 @@ def add_line(self, line: str, source: str, *lineno: int) -> None:
"python": ("https://docs.python.org/3", None),
"typing_extensions":
("https://typing-extensions.readthedocs.io/en/latest", None),
"aiohttp": ("https://docs.aiohttp.org/en/stable", None),
"pillow": ("https://pillow.readthedocs.io/en/stable", None),
"numpy": ("https://numpy.org/doc/stable", None),
"torch": ("https://pytorch.org/docs/stable", None),
Expand Down
5 changes: 5 additions & 0 deletions docs/source/dev/pooling_params.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Pooling Parameters
==================

.. autoclass:: vllm.PoolingParams
:members:
8 changes: 4 additions & 4 deletions docs/source/getting_started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -138,10 +138,10 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
A more detailed client example can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`__.

OpenAI Chat API with vLLM
~~~~~~~~~~~~~~~~~~~~~~~~~~
OpenAI Chat Completions API with vLLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

vLLM is designed to also support the OpenAI Chat API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
vLLM is designed to also support the OpenAI Chat Completions API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.

You can use the `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_ endpoint to interact with the model:

Expand All @@ -157,7 +157,7 @@ You can use the `create chat completion <https://platform.openai.com/docs/api-re
$ ]
$ }'
Alternatively, you can use the `openai` python package:
Alternatively, you can use the ``openai`` python package:

.. code-block:: python
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@ Documentation
:caption: Developer Documentation

dev/sampling_params
dev/pooling_params
dev/offline_inference/offline_index
dev/engine/engine_index
dev/kernel/paged_attention
Expand Down
54 changes: 51 additions & 3 deletions docs/source/models/vlm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ Below is an example on how to launch the same ``microsoft/Phi-3.5-vision-instruc
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2
.. important::
Since OpenAI Vision API is based on `Chat Completions <https://platform.openai.com/docs/api-reference/chat>`_ API,
Since OpenAI Vision API is based on `Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`_,
a chat template is **required** to launch the API server.

Although Phi-3.5-Vision comes with a chat template, for other models you may have to provide one if the model's tokenizer does not come with it.
Expand Down Expand Up @@ -243,6 +243,10 @@ To consume the server, you can use the OpenAI client like in the example below:
A full code example can be found in `examples/openai_api_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_api_client_for_multimodal.py>`_.

.. tip::
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
In fact, you can place image placeholders in the middle of the text by interleaving text and image content.

.. note::

By default, the timeout for fetching images through http url is ``5`` seconds. You can override this by setting the environment variable:
Expand All @@ -251,5 +255,49 @@ A full code example can be found in `examples/openai_api_client_for_multimodal.p
$ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
.. note::
There is no need to format the prompt in the API request since it will be handled by the server.
Chat Embeddings API
^^^^^^^^^^^^^^^^^^^

vLLM's Chat Embeddings API is a superset of OpenAI's `Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`_,
where a list of ``messages`` can be passed instead of batched ``inputs``. This enables multi-modal inputs to be passed to embedding models.

.. tip::
The schema of ``messages`` is exactly the same as in Chat Completions API.

In this example, we will serve the ``TIGER-Lab/VLM2Vec-Full`` model.

.. code-block:: bash
vllm serve TIGER-Lab/VLM2Vec-Full --task embedding \
--trust-remote-code --max-model-len 4096
.. important::

Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass ``--task embedding``
to run this model in embedding mode instead of text generation mode.

Since this schema is not defined by OpenAI client, we post a request to the server using the lower-level ``requests`` library:

.. code-block:: python
import requests
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
response = requests.post(
"http://localhost:8000/v1/embeddings",
json={
"model": "TIGER-Lab/VLM2Vec-Full",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
{"type": "text", "text": "Represent the given image."},
],
}],
"encoding_format": "float",
},
)
response.raise_for_status()
response_json = response.json()
print("Embedding output:", response_json["data"][0]["embedding"])
55 changes: 44 additions & 11 deletions docs/source/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,26 @@ print(completion.choices[0].message)
```

## API Reference
Please see the [OpenAI API Reference](https://platform.openai.com/docs/api-reference) for more information on the API. We support all parameters except:
- Chat: `tools`, and `tool_choice`.
- Completions: `suffix`.

vLLM also provides experimental support for OpenAI Vision API compatible inference. See more details in [Using VLMs](../models/vlm.rst).
We currently support the following OpenAI APIs:

- [Completions API](https://platform.openai.com/docs/api-reference/completions)
- *Note: `suffix` parameter is not supported.*
- [Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
- [Vision](https://platform.openai.com/docs/guides/vision)-related parameters are supported; see [Using VLMs](../models/vlm.rst).
- *Note: `image_url.detail` parameter is not supported.*
- We also support `audio_url` content type for audio files.
- Refer to [vllm.entrypoints.chat_utils](https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/chat_utils.py) for the exact schema.
- *TODO: Support `input_audio` content type as defined [here](https://github.com/openai/openai-python/blob/v1.52.2/src/openai/types/chat/chat_completion_content_part_input_audio_param.py).*
- *Note: `parallel_tool_calls` and `user` parameters are ignored.*
- [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings)
- Instead of `inputs`, you can pass in a list of `messages` (same schema as Chat Completions API),
which will be treated as a single prompt to the model according to its chat template.
- This enables multi-modal inputs to be passed to embedding models, see [Using VLMs](../models/vlm.rst).
- *Note: You should run `vllm serve` with `--task embedding` to ensure that the model is being run in embedding mode.*

## Extra Parameters

vLLM supports a set of parameters that are not part of the OpenAI API.
In order to use them, you can pass them as extra parameters in the OpenAI client.
Or directly merge them into the JSON payload if you are using HTTP call directly.
Expand All @@ -49,7 +62,26 @@ completion = client.chat.completions.create(
)
```

### Extra Parameters for Chat API
### Extra Parameters for Completions API

The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.

```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-completion-sampling-params
:end-before: end-completion-sampling-params
```

The following extra parameters are supported:

```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-completion-extra-params
:end-before: end-completion-extra-params
```

### Extra Parameters for Chat Completions API

The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.

```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
Expand All @@ -66,21 +98,22 @@ The following extra parameters are supported:
:end-before: end-chat-completion-extra-params
```

### Extra Parameters for Completions API
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.
### Extra Parameters for Embeddings API

The following [pooling parameters (click through to see documentation)](../dev/pooling_params.rst) are supported.

```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-completion-sampling-params
:end-before: end-completion-sampling-params
:start-after: begin-embedding-pooling-params
:end-before: end-embedding-pooling-params
```

The following extra parameters are supported:

```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-completion-extra-params
:end-before: end-completion-extra-params
:start-after: begin-embedding-extra-params
:end-before: end-embedding-extra-params
```

## Chat Template
Expand Down
13 changes: 4 additions & 9 deletions tests/entrypoints/openai/test_basic.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
from http import HTTPStatus
from typing import List

import openai
import pytest
import pytest_asyncio
import requests
Expand Down Expand Up @@ -83,10 +82,8 @@ async def client(server):
indirect=True,
)
@pytest.mark.asyncio
async def test_show_version(client: openai.AsyncOpenAI):
base_url = str(client.base_url)[:-3].strip("/")

response = requests.get(base_url + "/version")
async def test_show_version(server: RemoteOpenAIServer):
response = requests.get(server.url_for("version"))
response.raise_for_status()

assert response.json() == {"version": VLLM_VERSION}
Expand All @@ -102,9 +99,7 @@ async def test_show_version(client: openai.AsyncOpenAI):
indirect=True,
)
@pytest.mark.asyncio
async def test_check_health(client: openai.AsyncOpenAI):
base_url = str(client.base_url)[:-3].strip("/")

response = requests.get(base_url + "/health")
async def test_check_health(server: RemoteOpenAIServer):
response = requests.get(server.url_for("health"))

assert response.status_code == HTTPStatus.OK
Loading

0 comments on commit 8675bc1

Please sign in to comment.