[Frontend] Separate pooling APIs in offline inference #11129

DarkLight1337 · 2024-12-12T10:38:33Z

This PR builds on #10820, splitting up the LLM.encode() API according to the model's task. This avoids confusion about the data format when accessing the pooler output. The pooling-related APIs are:

[UPDATED] LLM.encode(): Returns the raw tensor, instead of a nested list of floats
[NEW] LLM.embed(): Converts 1-D embedding tensor into a list of floats
[NEW} LLM.classify(): Converts 1-D probability tensor into a list of floats
[UPDATED] LLM.score(): Returns scalar floats, instead of a single-element list of floats.

Furthermore, this PR adds a layer of abstraction to the Pooler layer inside the model so that it is easy to override the type of output data. #11065 may be implemented based on this.

cc @maxdebayser @flaviabeo

Signed-off-by: DarkLight1337 <[email protected]>

github-actions · 2024-12-12T10:38:47Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 · 2024-12-12T10:49:41Z

vllm/entrypoints/llm.py

@@ -120,7 +121,7 @@ class LLM:
        serving, use the :class:`~vllm.AsyncLLMEngine` class instead.
    """

-    DEPRECATE_LEGACY: ClassVar[bool] = False
+    DEPRECATE_LEGACY: ClassVar[bool] = True


I forgot to turn this on after adding deprecated decorator a while back.

DarkLight1337 · 2024-12-12T10:57:15Z

vllm/outputs.py

    """
-    embedding: List[float]
+    data: torch.Tensor


We now return a tensor directly. It is up to the task-specific output classes (e.g. EmbeddingOutput) to parse the tensor into a format that is convenient to the user.

Would it make sense to return a list of tensors here, in case the output of the model is multiple tensors?
Or do we assume that in such case the model works stack any output tensors into one before returning them?

This type annotation can be freely changed without affecting the internals (originally I wanted to be even more abstract and annotate this as type object). For now, all models return a single tensor, so that's what I put here.

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 · 2024-12-12T12:24:39Z

vllm/engine/llm_engine.py

+        if model_output and isinstance(model_output[0], SamplerOutput) and (
+                model_output[0].spec_decode_worker_metrics is not None):


Since spec decode isn't applicable to pooling models, I have removed spec_decode_worker_metrics from PoolerOutput. The type annotation that model_output is a list of SamplerOutputs is actually incorrect here (it can be a list of PoolerOutput) but I'm not bothered to fix it since we will probably rework this in V1 anyways.

DarkLight1337 · 2024-12-12T16:44:08Z

vllm/entrypoints/llm.py

+        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
+        guided_options_request: Optional[Union[LLMGuidedOptions,
+                                               GuidedDecodingRequest]] = None,


Not sure why these arguments were missing from the overloads from the first place...

Isotr0py

Overall LGTM. Thanks for cleaning up the Pooler layer!

Just a comment about score API example.

docs/source/models/pooling_models.rst

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 · 2024-12-12T17:16:08Z

@maxdebayser I'm thinking of updating the score API so that a single scalar per pair is returned instead of a list. WDYT?

Signed-off-by: DarkLight1337 <[email protected]>

maxdebayser

I like the stronger typing that this PR introduces. It looks good to me

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 added 2 commits December 12, 2024 10:35

Separate pooling APIs

e87e3ab

Signed-off-by: DarkLight1337 <[email protected]>

Merge branch 'main' into separate-pooling-apis

4084522

DarkLight1337 requested review from ywang96, zhuohan123, youkaichao, alexm-neuralmagic, comaniac and njhill as code owners December 12, 2024 10:38

mergify bot added documentation Improvements or additions to documentation frontend labels Dec 12, 2024

DarkLight1337 requested a review from Isotr0py December 12, 2024 10:39

format

ab4aa67

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 mentioned this pull request Dec 12, 2024

[RFC]: Adding support for Geospatial models #11065

Open

1 task

DarkLight1337 commented Dec 12, 2024

View reviewed changes

DarkLight1337 changed the title ~~[Frontend] Separate pooling APIs~~ [Frontend] Separate pooling APIs in offline inference Dec 12, 2024

DarkLight1337 commented Dec 12, 2024

View reviewed changes

Fix test failures

1463a5c

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 commented Dec 12, 2024

View reviewed changes

Isotr0py approved these changes Dec 12, 2024

View reviewed changes

docs/source/models/pooling_models.rst Show resolved Hide resolved

DarkLight1337 added 2 commits December 12, 2024 17:11

Add scoring example

28be60f

Signed-off-by: DarkLight1337 <[email protected]>

Fix docs and add example scripts to CI

b63000f

Signed-off-by: DarkLight1337 <[email protected]>

mergify bot added the ci/build label Dec 12, 2024

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 12, 2024

Fix incorrect type annotation

71c4295

Signed-off-by: DarkLight1337 <[email protected]>

Rename

5361121

Signed-off-by: DarkLight1337 <[email protected]>

maxdebayser approved these changes Dec 12, 2024

View reviewed changes

Update scoring output to use scalars

18d8f3a

Signed-off-by: DarkLight1337 <[email protected]>

Merge branch 'main' into separate-pooling-apis

ed3b558

DarkLight1337 enabled auto-merge (squash) December 13, 2024 04:11

DarkLight1337 added 9 commits December 13, 2024 05:16

Update

828f4b4

Signed-off-by: DarkLight1337 <[email protected]>

Reuse head

cc47a53

Signed-off-by: DarkLight1337 <[email protected]>

Format

581bf13

Signed-off-by: DarkLight1337 <[email protected]>

Fix

61556b0

Signed-off-by: DarkLight1337 <[email protected]>

Fix

78e5c81

Signed-off-by: DarkLight1337 <[email protected]>

Fix typo

9577e3b

Signed-off-by: DarkLight1337 <[email protected]>

Fix

db3c71e

Signed-off-by: DarkLight1337 <[email protected]>

Fix

26436d6

Signed-off-by: DarkLight1337 <[email protected]>

Update timing

60835d6

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 requested review from robertgshaw2-neuralmagic and simon-mo as code owners December 13, 2024 08:22

DarkLight1337 merged commit eeec9e3 into vllm-project:main Dec 13, 2024
76 checks passed

DarkLight1337 deleted the separate-pooling-apis branch December 13, 2024 10:40

passaglia mentioned this pull request Dec 24, 2024

[Bug]: Qwen2.5-Math-RM-72B Online Inference Fails #11446

Closed

1 task

DarkLight1337 mentioned this pull request Dec 24, 2024

[Frontend] Online Pooling API #11457

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend] Separate pooling APIs in offline inference #11129

[Frontend] Separate pooling APIs in offline inference #11129

DarkLight1337 commented Dec 12, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 12, 2024

DarkLight1337 Dec 12, 2024

DarkLight1337 Dec 12, 2024

christian-pinto Dec 13, 2024

DarkLight1337 Dec 13, 2024 •

edited

Loading

DarkLight1337 Dec 12, 2024 •

edited

Loading

DarkLight1337 Dec 12, 2024

Isotr0py left a comment

DarkLight1337 commented Dec 12, 2024 •

edited

Loading

maxdebayser left a comment

		if model_output and isinstance(model_output[0], SamplerOutput) and (
		model_output[0].spec_decode_worker_metrics is not None):

[Frontend] Separate pooling APIs in offline inference #11129

[Frontend] Separate pooling APIs in offline inference #11129

Conversation

DarkLight1337 commented Dec 12, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 12, 2024

DarkLight1337 Dec 12, 2024

Choose a reason for hiding this comment

DarkLight1337 Dec 12, 2024

Choose a reason for hiding this comment

christian-pinto Dec 13, 2024

Choose a reason for hiding this comment

DarkLight1337 Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

DarkLight1337 Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

DarkLight1337 Dec 12, 2024

Choose a reason for hiding this comment

Isotr0py left a comment

Choose a reason for hiding this comment

DarkLight1337 commented Dec 12, 2024 • edited Loading

maxdebayser left a comment

Choose a reason for hiding this comment

DarkLight1337 commented Dec 12, 2024 •

edited by github-actions bot

Loading

DarkLight1337 Dec 13, 2024 •

edited

Loading

DarkLight1337 Dec 12, 2024 •

edited

Loading

DarkLight1337 commented Dec 12, 2024 •

edited

Loading