-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend] Separate pooling APIs in offline inference #11129
[Frontend] Separate pooling APIs in offline inference #11129
Conversation
Signed-off-by: DarkLight1337 <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: DarkLight1337 <[email protected]>
@@ -120,7 +121,7 @@ class LLM: | |||
serving, use the :class:`~vllm.AsyncLLMEngine` class instead. | |||
""" | |||
|
|||
DEPRECATE_LEGACY: ClassVar[bool] = False | |||
DEPRECATE_LEGACY: ClassVar[bool] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to turn this on after adding deprecated
decorator a while back.
""" | ||
embedding: List[float] | ||
data: torch.Tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now return a tensor directly. It is up to the task-specific output classes (e.g. EmbeddingOutput
) to parse the tensor into a format that is convenient to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to return a list of tensors here, in case the output of the model is multiple tensors?
Or do we assume that in such case the model works stack any output tensors into one before returning them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type annotation can be freely changed without affecting the internals (originally I wanted to be even more abstract and annotate this as type object
). For now, all models return a single tensor, so that's what I put here.
Signed-off-by: DarkLight1337 <[email protected]>
if model_output and isinstance(model_output[0], SamplerOutput) and ( | ||
model_output[0].spec_decode_worker_metrics is not None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since spec decode isn't applicable to pooling models, I have removed spec_decode_worker_metrics
from PoolerOutput
. The type annotation that model_output
is a list of SamplerOutputs
is actually incorrect here (it can be a list of PoolerOutput
) but I'm not bothered to fix it since we will probably rework this in V1 anyways.
prompt_adapter_request: Optional[PromptAdapterRequest] = None, | ||
guided_options_request: Optional[Union[LLMGuidedOptions, | ||
GuidedDecodingRequest]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why these arguments were missing from the overloads from the first place...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Thanks for cleaning up the Pooler
layer!
Just a comment about score
API example.
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@maxdebayser I'm thinking of updating the score API so that a single scalar per pair is returned instead of a list. WDYT? |
Signed-off-by: DarkLight1337 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the stronger typing that this PR introduces. It looks good to me
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
This PR builds on #10820, splitting up the
LLM.encode()
API according to the model's task. This avoids confusion about the data format when accessing the pooler output. The pooling-related APIs are:LLM.encode()
: Returns the raw tensor, instead of a nested list of floatsLLM.embed()
: Converts 1-D embedding tensor into a list of floatsLLM.classify()
: Converts 1-D probability tensor into a list of floatsLLM.score()
: Returns scalar floats, instead of a single-element list of floats.Furthermore, this PR adds a layer of abstraction to the
Pooler
layer inside the model so that it is easy to override the type of output data. #11065 may be implemented based on this.cc @maxdebayser @flaviabeo