[Core] generate from input embeds #6869

Nan2018 · 2024-07-27T22:48:56Z

adds support for passing prompt_embeds to LLM.generate as

llm.generate({"prompt_embeds": input_embeds}, sampling_params)

or

llm.generate(
    [{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params
)

this enables use cases when only the embedding layer is finetuned, and have the same model backend support multiple custom tuned embedding layers

FIX #416
FIX #8323

inspired by #1265 which is very outdated

github-actions · 2024-07-27T22:49:07Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

Nan2018 · 2024-08-08T21:05:08Z

@WoosukKwon @ywang96 @robertgshaw2-neuralmagic

the failed tests with
ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating
seems not related to my changes and I can't reproduce it locally.

other than that, this is ready for review

ywang96 · 2024-08-08T21:23:24Z

@WoosukKwon @ywang96 @robertgshaw2-neuralmagic

the failed tests with ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating seems not related to my changes and I can't reproduce it locally.

other than that, this is ready for review

This is due to a recent change from transformers of deprecating default chat template, and it should have been fixed by #7238. Can you merge your branch with main again?

CandiedCode · 2024-12-19T07:24:42Z

@DarkLight1337 since #10374 has landed, is this pr going to be updated so that this work can be finished, or is there a new PR you can reference for tracking?

DarkLight1337 · 2024-12-19T07:47:44Z

@DarkLight1337 since #10374 has landed, is this pr going to be updated so that this work can be finished, or is there a new PR you can reference for tracking?

Thanks for your interest. At the moment, V1 isn't stable enough to implement embedding inputs yet. I would point you to #8779 to check the progress, but that RFC is somewhat outdated. Instead, you can search for recent PRs with the [V1] tag.

CandiedCode · 2024-12-19T08:39:24Z

Thanks for your interest. At the moment, V1 isn't stable enough to implement embedding inputs yet. I would point you to #8779 to check the progress, but that RFC is somewhat outdated. Instead, you can search for recent PRs with the [V1] tag.

Thanks for the reference, @DarkLight1337 . Is it possible also to add this to the roadmap for additional visibility as well?

ywang96 · 2024-12-19T09:07:20Z

Thanks for your interest. At the moment, V1 isn't stable enough to implement embedding inputs yet. I would point you to #8779 to check the progress, but that RFC is somewhat outdated. Instead, you can search for recent PRs with the [V1] tag.

Thanks for the reference, @DarkLight1337 . Is it possible also to add this to the roadmap for additional visibility as well?

Hey @CandiedCode! Thanks for following up on this.

IMO supporting embeddings as input as is does not have technical difficulty but we do want to be careful with the design to make it work with all other features we want to natively support on vLLM, especially now that we're going through re-architecture. I have discussed briefly with @WoosukKwon in #11032 (comment) about it.

In particular, some issues we still need to design and address:

What happens if a batch has both token ids as input and embeddings as input?
Prefix caching (Currently we use token ids as hash key)
Spec decode (we assume draft models to output token id to be accepted by main model)

HaiFengZeng · 2024-12-24T10:04:11Z

using the code for inference,prompt_embeds.size= torch.Size([85, 896])

for tk in llm.generate({'prompt_embeds':prompt_input},sample_params):
    print(tk)

and it raise Exception as below:

rank0]: Traceback (most recent call last):
[rank0]:   File "/AIOT-vePFS/speech/workspace/zenghaifeng/sllm/CosyVoice2-0.5B/vllm-test.py", line 17, in <module>
[rank0]:     for tk in llm.generate({'prompt_embeds':prompt_input},sample_params):
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/utils.py", line 1063, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 396, in generate
[rank0]:     outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 931, in _run_engine
[rank0]:     step_outputs = self.llm_engine.step()
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1415, in step
[rank0]:     ) = self.scheduler[virtual_engine].schedule()
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1219, in schedule
[rank0]:     scheduler_outputs: SchedulerOutputs = self._schedule()
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1178, in _schedule
[rank0]:     return self._schedule_default()
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1013, in _schedule_default
[rank0]:     prefills = self._schedule_prefills(budget,
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/core/scheduler.py", line 885, in _schedule_prefills
[rank0]:     num_new_tokens = self._get_num_new_tokens(seq_group,
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1605, in _get_num_new_tokens
[rank0]:     assert num_new_tokens > 0
[rank0]: AssertionError

how to solve this?

DarkLight1337 · 2024-12-24T10:07:09Z

I think the latest working commit is 49fe3f7

Please revert to the above commit. The current vLLM is not compatible with input embeds yet.

serser · 2025-01-02T14:12:24Z

It seems the length of a sequence of prompt_embeds is always 0 due to get_len which keeps the scheduler has nothing to schedule, hence deadlocking here. Am I right?

DarkLight1337 · 2025-01-02T14:17:00Z

It seems the length of a sequence of prompt_embeds is always 0 due to get_len which keeps the scheduler has nothing to schedule, hence deadlocking here. Am I right?

Yeah, that is the problem.

solume · 2025-02-05T01:39:05Z

is there a version of this that currently works for input_embed + prompt for text-only model?

DarkLight1337 · 2025-02-05T02:32:38Z

is there a version of this that currently works for input_embed + prompt for text-only model?

No, can you elaborate on the use case of mixing these two? vLLM already does prefix caching so this shouldn't be necessary.

solume · 2025-02-05T18:56:06Z

I'm using an embedding that comes externally as context that I need to prepend to the text. So I have a context vector the size of the hidden dimension of the LLM, and a prompt text. I have a patched vllm where I pass down an input_embeds array to the model (I'm using llama.py), and then I'm adding the prefix in the forward pass like this:


    def forward(
        ...
        inputs_embeds: Optional[torch.Tensor] = None,
    ) -> Union[torch.Tensor, IntermediateTensors]:
        ...
        if inputs_embeds is not None and input_ids is not None:
            text_embeds = self.get_input_embeddings(input_ids)
            inputs_embeds = torch.cat([inputs_embeds, text_embeds], dim=1)
            input_ids = None
        elif inputs_embeds is None:
            inputs_embeds = self.get_input_embeddings(input_ids)
            input_ids = None

        model_output = self.model(
            input_ids,
            positions,
            kv_caches,
            attn_metadata,
            intermediate_tensors,
            inputs_embeds,
        )

so was wondering if there is a way of doing this without such a patch.

DarkLight1337 · 2025-02-06T03:48:59Z

I think you can use prompt adapter request (#4645) for the prepended embeddings. It's going to be deprecated in V1 though.

I suggest that you simply pass the context alongside the prompt in each request. There should be little overhead for doing so when prefix caching is enabled.

CandiedCode · 2025-02-10T20:57:13Z

Hi @ywang96 and @DarkLight1337

Last time you guys mentioned that following still needed to be designed and address:

What happens if a batch has both token ids as input and embeddings as input?
Prefix caching (Currently we use token ids as hash key)
Spec decode (we assume draft models to output token id to be accepted by main model)

I wanted to check to see if there had been any progress made on these efforts.

Thanks for all that you guys do!

~CandiedCode

DarkLight1337 · 2025-02-11T07:41:20Z

Hi @ywang96 and @DarkLight1337

Last time you guys mentioned that following still needed to be designed and address:

What happens if a batch has both token ids as input and embeddings as input?
Prefix caching (Currently we use token ids as hash key)
Spec decode (we assume draft models to output token id to be accepted by main model)

I wanted to check to see if there had been any progress made on these efforts.

Thanks for all that you guys do!

~CandiedCode

To avoid complicating the development of V1, we haven't really touched this yet. We'll consider this once it is more stable.

amuvarma13 · 2025-02-24T22:45:04Z

Any progress on this?

Co-authored-by: Nan2018 <[email protected]> Signed-off-by: Andrew Sansom <[email protected]>

DarkLight1337 · 2025-04-28T07:43:04Z

Closing in favor of #15428

tongjin0521 · 2025-07-22T23:52:07Z

Hi @ywang96 and @DarkLight1337
This is currently only supported in v0 engine, which is going to be deprecated.
Also, for example, v0 doesn't support flex attention, any plan on generating from input embeds in v1 engine?
Or is there a quick fix to achieve this?

qthequartermasterman · 2025-07-23T00:36:01Z

Hi @ywang96 and @DarkLight1337

This is currently only supported in v0 engine, which is going to be deprecated.

Also, for example, v0 doesn't support flex attention, any plan on generating from input embeds in v1 engine?

Or is there a quick fix to achieve this?

@Nan2018 and I are writing up an RFC for v1 support. We're hoping to open it this week. We'll have a draft PR in the next few weeks.

Nan2018 and others added 3 commits July 27, 2024 16:23

feat: add support for generate from prompt embeddings

ed8eb22

Merge remote-tracking branch 'vllm/main' into feature-input-embeds

7ccbf01

Merge branch 'vllm-project:main' into feature-input-embeds

7c663a1

fix ci errors

3bd6423

DarkLight1337 requested review from WoosukKwon, ywang96 and robertgshaw2-redhat July 28, 2024 01:10

Nan2018 added 12 commits August 5, 2024 13:54

Merge remote-tracking branch 'vllm/main' into feature-input-embeds

86777a7

fix: tensor parallel

9a9d406

style: yapf

48fc6a8

fix: model_runner in a WorkerWrapper

737d01b

Merge remote-tracking branch 'vllm/main' into feature-input-embeds

4b99109

fix: spec decoding model

03344ab

fix: ruff

40038e0

fix: move param prompt_embeds_shape to the last of RequestOutput

b00bbc7

feat: all *ForCausalLM models support inputs_embeds

3c1a6fa

fix: format

c454647

fix: format

535ad97

fix: format

c05a8ff

This was referenced Aug 22, 2024

[Core][VLM] Support image embeddings as input #6613

Merged

[RFC]: Multi-modality Support on vLLM #4194

Open

Nan2018 added 4 commits September 4, 2024 12:38

Merge remote-tracking branch 'vllm/main' into feature-input-embeds

2b50573

Merge remote-tracking branch 'vllm/main' into feature-input-embeds

7ddc863

fix: engines

d83915b

fix: format

dfd9301

Nan2018 force-pushed the feature-input-embeds branch from 3f79009 to dfd9301 Compare September 4, 2024 22:27

fix: format

fd455eb

mergify bot added the needs-rebase label Dec 11, 2024

DarkLight1337 mentioned this pull request Jan 2, 2025

[Bugfix] add input embedding #11684

Closed

DarkLight1337 mentioned this pull request Mar 25, 2025

[Core] [Bugfix] Add Input Embeddings #15428

Merged

mergify bot added tpu Related to Google TPUs and removed tpu Related to Google TPUs labels Mar 27, 2025

qthequartermasterman added a commit to qthequartermasterman/vllm that referenced this pull request Apr 2, 2025

Include unit tests from vllm-project#6869.

c9d8024

Co-authored-by: Nan2018 <[email protected]> Signed-off-by: Andrew Sansom <[email protected]>

mergify bot added the tpu Related to Google TPUs label Apr 9, 2025

DarkLight1337 mentioned this pull request Apr 9, 2025

[Model][VLM] Add Qwen2.5-Omni model support (end-to-end full support) #16347

Draft

mergify bot removed the tpu Related to Google TPUs label Apr 11, 2025

DarkLight1337 closed this Apr 28, 2025

Uh oh!

[Core] generate from input embeds #6869

[Core] generate from input embeds #6869

Uh oh!

Conversation

Nan2018 commented Jul 27, 2024 • edited by DarkLight1337 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 27, 2024

Uh oh!

Nan2018 commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywang96 commented Aug 8, 2024

Uh oh!

CandiedCode commented Dec 19, 2024

Uh oh!

DarkLight1337 commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CandiedCode commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywang96 commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HaiFengZeng commented Dec 24, 2024

Uh oh!

DarkLight1337 commented Dec 24, 2024

Uh oh!

serser commented Jan 2, 2025

Uh oh!

DarkLight1337 commented Jan 2, 2025

Uh oh!

solume commented Feb 5, 2025

Uh oh!

DarkLight1337 commented Feb 5, 2025

Uh oh!

solume commented Feb 5, 2025

Uh oh!

DarkLight1337 commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CandiedCode commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Feb 11, 2025

Uh oh!

amuvarma13 commented Feb 24, 2025

Uh oh!

DarkLight1337 commented Apr 28, 2025

Uh oh!

tongjin0521 commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qthequartermasterman commented Jul 23, 2025

Uh oh!

Uh oh!

Nan2018 commented Jul 27, 2024 •

edited by DarkLight1337

Loading

Nan2018 commented Aug 8, 2024 •

edited

Loading

DarkLight1337 commented Dec 19, 2024 •

edited

Loading

CandiedCode commented Dec 19, 2024 •

edited

Loading

ywang96 commented Dec 19, 2024 •

edited

Loading

DarkLight1337 commented Feb 6, 2025 •

edited

Loading

CandiedCode commented Feb 10, 2025 •

edited

Loading

tongjin0521 commented Jul 22, 2025 •

edited

Loading