Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] Separate pooling APIs in offline inference #11129

Merged
merged 19 commits into from
Dec 13, 2024

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Dec 12, 2024

This PR builds on #10820, splitting up the LLM.encode() API according to the model's task. This avoids confusion about the data format when accessing the pooler output. The pooling-related APIs are:

  • [UPDATED] LLM.encode(): Returns the raw tensor, instead of a nested list of floats
  • [NEW] LLM.embed(): Converts 1-D embedding tensor into a list of floats
  • [NEW} LLM.classify(): Converts 1-D probability tensor into a list of floats
  • [UPDATED] LLM.score(): Returns scalar floats, instead of a single-element list of floats.

Furthermore, this PR adds a layer of abstraction to the Pooler layer inside the model so that it is easy to override the type of output data. #11065 may be implemented based on this.

cc @maxdebayser @flaviabeo

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation frontend labels Dec 12, 2024
Signed-off-by: DarkLight1337 <[email protected]>
@@ -120,7 +121,7 @@ class LLM:
serving, use the :class:`~vllm.AsyncLLMEngine` class instead.
"""

DEPRECATE_LEGACY: ClassVar[bool] = False
DEPRECATE_LEGACY: ClassVar[bool] = True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to turn this on after adding deprecated decorator a while back.

@DarkLight1337 DarkLight1337 changed the title [Frontend] Separate pooling APIs [Frontend] Separate pooling APIs in offline inference Dec 12, 2024
"""
embedding: List[float]
data: torch.Tensor
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now return a tensor directly. It is up to the task-specific output classes (e.g. EmbeddingOutput) to parse the tensor into a format that is convenient to the user.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to return a list of tensors here, in case the output of the model is multiple tensors?
Or do we assume that in such case the model works stack any output tensors into one before returning them?

Copy link
Member Author

@DarkLight1337 DarkLight1337 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type annotation can be freely changed without affecting the internals (originally I wanted to be even more abstract and annotate this as type object). For now, all models return a single tensor, so that's what I put here.

Signed-off-by: DarkLight1337 <[email protected]>
Comment on lines +1786 to +1787
if model_output and isinstance(model_output[0], SamplerOutput) and (
model_output[0].spec_decode_worker_metrics is not None):
Copy link
Member Author

@DarkLight1337 DarkLight1337 Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since spec decode isn't applicable to pooling models, I have removed spec_decode_worker_metrics from PoolerOutput. The type annotation that model_output is a list of SamplerOutputs is actually incorrect here (it can be a list of PoolerOutput) but I'm not bothered to fix it since we will probably rework this in V1 anyways.

Comment on lines +266 to +268
prompt_adapter_request: Optional[PromptAdapterRequest] = None,
guided_options_request: Optional[Union[LLMGuidedOptions,
GuidedDecodingRequest]] = None,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why these arguments were missing from the overloads from the first place...

Copy link
Collaborator

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Thanks for cleaning up the Pooler layer!

Just a comment about score API example.

docs/source/models/pooling_models.rst Show resolved Hide resolved
@mergify mergify bot added the ci/build label Dec 12, 2024
@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 12, 2024
Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Dec 12, 2024

@maxdebayser I'm thinking of updating the score API so that a single scalar per pair is returned instead of a list. WDYT?

Signed-off-by: DarkLight1337 <[email protected]>
Copy link
Contributor

@maxdebayser maxdebayser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the stronger typing that this PR introduces. It looks good to me

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) December 13, 2024 04:11
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337 DarkLight1337 merged commit eeec9e3 into vllm-project:main Dec 13, 2024
76 checks passed
@DarkLight1337 DarkLight1337 deleted the separate-pooling-apis branch December 13, 2024 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants