-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
[Core] generate from input embeds #6869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
@WoosukKwon @ywang96 @robertgshaw2-neuralmagic the failed tests with other than that, this is ready for review |
This is due to a recent change from transformers of deprecating default chat template, and it should have been fixed by #7238. Can you merge your branch with main again? |
3f79009
to
dfd9301
Compare
@DarkLight1337 since #10374 has landed, is this pr going to be updated so that this work can be finished, or is there a new PR you can reference for tracking? |
Thanks for your interest. At the moment, V1 isn't stable enough to implement embedding inputs yet. I would point you to #8779 to check the progress, but that RFC is somewhat outdated. Instead, you can search for recent PRs with the [V1] tag. |
Thanks for the reference, @DarkLight1337 . Is it possible also to add this to the roadmap for additional visibility as well? |
Hey @CandiedCode! Thanks for following up on this. IMO supporting embeddings as input as is does not have technical difficulty but we do want to be careful with the design to make it work with all other features we want to natively support on vLLM, especially now that we're going through re-architecture. I have discussed briefly with @WoosukKwon in #11032 (comment) about it. In particular, some issues we still need to design and address:
|
using the code for inference,
and it raise Exception as below:
how to solve this? |
Please revert to the above commit. The current vLLM is not compatible with input embeds yet. |
is there a version of this that currently works for input_embed + prompt for text-only model? |
No, can you elaborate on the use case of mixing these two? vLLM already does prefix caching so this shouldn't be necessary. |
I'm using an embedding that comes externally as context that I need to prepend to the text. So I have a context vector the size of the hidden dimension of the LLM, and a prompt text. I have a patched vllm where I pass down an input_embeds array to the model (I'm using llama.py), and then I'm adding the prefix in the forward pass like this:
so was wondering if there is a way of doing this without such a patch. |
I think you can use prompt adapter request (#4645) for the prepended embeddings. It's going to be deprecated in V1 though. I suggest that you simply pass the context alongside the prompt in each request. There should be little overhead for doing so when prefix caching is enabled. |
Hi @ywang96 and @DarkLight1337 Last time you guys mentioned that following still needed to be designed and address:
I wanted to check to see if there had been any progress made on these efforts. Thanks for all that you guys do! ~CandiedCode |
To avoid complicating the development of V1, we haven't really touched this yet. We'll consider this once it is more stable. |
Any progress on this? |
Co-authored-by: Nan2018 <[email protected]> Signed-off-by: Andrew Sansom <[email protected]>
Closing in favor of #15428 |
Hi @ywang96 and @DarkLight1337 |
@Nan2018 and I are writing up an RFC for v1 support. We're hoping to open it this week. We'll have a draft PR in the next few weeks. |
adds support for passing
prompt_embeds
toLLM.generate
asor
this enables use cases when only the embedding layer is finetuned, and have the same model backend support multiple custom tuned embedding layers
FIX #416
FIX #8323
inspired by #1265 which is very outdated