diff --git a/docs/source/usage/compatibility_matrix.rst b/docs/source/usage/compatibility_matrix.rst index a93632ff36fb8..79ca27fb694eb 100644 --- a/docs/source/usage/compatibility_matrix.rst +++ b/docs/source/usage/compatibility_matrix.rst @@ -39,7 +39,7 @@ Feature x Feature - :abbr:`prmpt adptr (Prompt Adapter)` - :ref:`SD ` - CUDA graph - - :abbr:`emd (Embedding Models)` + - :abbr:`pooling (Pooling Models)` - :abbr:`enc-dec (Encoder-Decoder Models)` - :abbr:`logP (Logprobs)` - :abbr:`prmpt logP (Prompt Logprobs)` @@ -151,7 +151,7 @@ Feature x Feature - - - - * - :abbr:`emd (Embedding Models)` + * - :abbr:`pooling (Pooling Models)` - ✗ - ✗ - ✗ @@ -386,7 +386,7 @@ Feature x Hardware - ✅ - ✗ - ✅ - * - :abbr:`emd (Embedding Models)` + * - :abbr:`pooling (Pooling Models)` - ✅ - ✅ - ✅ diff --git a/docs/source/usage/pooling_models.rst b/docs/source/usage/pooling_models.rst index a2554d1b0eada..5f19dfcaa3751 100644 --- a/docs/source/usage/pooling_models.rst +++ b/docs/source/usage/pooling_models.rst @@ -3,7 +3,7 @@ Using Pooling Models ==================== -vLLM provides second-class support for pooling models, including embedding, reranking and reward models. +vLLM also supports pooling models, including embedding, reranking and reward models. In vLLM, pooling models implement the :class:`~vllm.model_executor.models.VllmModelForPooling` interface. These models use a :class:`~vllm.model_executor.layers.Pooler` to aggregate the final hidden states of the input @@ -13,6 +13,10 @@ Technically, any :ref:`generative model ` in vLLM can be conv by aggregating and returning the hidden states directly, skipping the generation step. Nevertheless, you should use those that are specifically trained as pooling models. +We currently support pooling models primarily as a matter of convenience. +As shown in the :code:`Compatibility Matrix `, most vLLM features are not applicable to +pooling models as they only work on the generation or decode stage, so performance may not improve as much. + Offline Inference -----------------