[RFC]: Merge input processor and input mapper for multi-modal models #10114

DarkLight1337 · 2024-11-07T09:57:55Z

Motivation

Background

To provide more control over the model inputs, we currently define two methods for multi-modal models in vLLM:

The input processor is called inside LLMEngine to extend the prompt with placeholder tokens which are reserved for vLLM features such as KV cache and chunked prefill.
The input mapper is called inside ModelRunner to transform multi-modal inputs (e.g. PIL images) into tensor inputs, usually via the modality-specific processor (e.g. AutoImageProcessor) from HuggingFace.

Issues with the current design

The input processor accepts the output of HF AutoTokenizer, a list of token IDs, instead of the text prompt. Since HF AutoProcessor doesn’t accept token IDs, we have to write custom code to edit the list of token IDs based on the multi-modal inputs. For some models (such as Phi-3-vision), this means re-implementing code from their HF AutoProcessor, complicating the process of porting the model to vLLM.
The input mapper, being inside ModelRunner, lies on the critical path of vLLM’s model execution. Even when the input mapper is fast, the tail TTFT and TPOT suffers because of this. As the input mapper takes up more time, our overall throughput decreases proportionally which can be avoided if we move it outside of the critical path. Nevertheless, we can do little if the AutoProcessor inside input mapper is very slow, like in #9238. Hope that huggingface/transformers#33810 can help with that!
This abstraction results in redundant processing for models (such as Qwen2-VL and Molmo) with HF AutoProcessor that already performs most of the work for calculating the number of placeholder tokens.

Proposed Change

Unified multi-modal processor

We plan to merge our input processor and input mapper into a unified multi-modal processor (BaseMultiModalProcessor) that wraps HF AutoProcessor, and call it inside the LLMEngine (and thus benefit from #8779), taking the role of the existing tokenizer. After this change, each input type will be processed as follows:

Text-only prompt: Pass to vLLM tokenizer (wraps HF AutoTokenizer) [Unchanged]
List of token IDs: Skip vLLM tokenizer [Unchanged]
Text prompt with multi-modal input: Pass to vLLM multi-modal processor [NEW]
List of token IDs with multi-modal input: ~~[Deprecated]~~ Pass to vLLM multi-modal processor [NEW]

Automatic prompt replacement

BaseMultiModalProcessor._get_prompt_replacements specifies HF's logic of replacing input placeholder tokens (e.g. <image> for a single image) with feature placeholder tokens (e.g. <image><image>...<image>, the number of which equals to the feature size). Given this specification, we can automatically detect whether HF has replaced the input placeholder tokens by checking whether the feature placeholder tokens exist in the prompt.

BaseMultiModalProcessor._apply_prompt_replacements provides model-agnostic code for automatically replacing input placeholder tokens with feature placeholder tokens. This is only called if we find that HF hasn't done so yet.

This enables the multi-modal processor to accept text/token prompts and process them separately from the multi-modal data. The detailed logic is shown in BaseMultiModalProcessor._apply_hf_processor_main.

Processor caching

#11396 caches each item in the multi-modal output of HF processor and links them back to items in the input data.

When new data is passed in, we first check which items are in the cache, and which ones are missing. The missing items are passed into the HF processor in a single batch and cached, before being merged with the existing items in the cache.

Note that the text/token prompt must be processed separately from the multi-modal data because HF processors expect the input placeholders in the text to correspond to each multi-modal data item, but we only want to process the items that are missing. We can handle this elegantly using automatic prompt replacement (see above).

Deprecate token IDs with multi-modal input

To be compatible with OpenAI’s (legacy) Completions API, we currently support passing token IDs directly to both LLM class and OpenAI-compatible server. However, Completions API doesn’t support multi-modal inputs, so we will deprecate passing token IDs alongside multi-modal inputs to simplify model implementation (see Issue 1 above). Please tell us if you have a use case for this and don’t want to see it removed!

Feedback Period

Feel free to comment as the effort progresses!

Timeline

The majority of our code will be called inside the existing InputPreprocessor which is separated from the vLLM engine, making it easy to integrate with #8779.

CC List

@ywang96 @Isotr0py @WoosukKwon @robertgshaw2-neuralmagic

Any Other Things

Multi-modal plugins remain supported Migrating multi-modal plugins

You can define additional input modalities (ModalityDataItems) and parse them in subclasses of MultiModalDataParser on a per-model basis. Afterwards, override BaseMultiModalProcessor._get_data_parser to construct your newly-defined parser.

Some users currently use multi-modal plugins to directly pass custom model inputs (#6260). Those inputs can be excluded from HF processing by returning them in ModalityDataItems.get_passthrough_data instead of ModalityDataItems.get_processor_data.

No batched preprocessing for now

Currently, preprocessing is performed per prompt in vLLM. While we can call HF tokenizer and modality-specific processor on batched inputs separately, calling the wrapping HF AutoProcessor with both list of texts and list of multi-modal data results in the processed multi-modal data (e.g. image) being assigned to every text in the list, rather than the more intuitive zip-like behavior (e.g. the ith image only assigned to the ith text). To support batched preprocessing, we would have to write custom code for each model to combine the outputs of HF tokenizer and modality-specific processors. Given that this can significantly complicate model implementation (see Issue 1 above), we will not consider batched preprocessing at this stage, even with this change.

The text was updated successfully, but these errors were encountered:

robertgshaw2-redhat · 2024-11-08T14:25:54Z

This is great. In the EngineCore/AsyncLLM refactor (#9826), we introduced the concept of a Processor. I think that this code should sit inside there.

You initiative here will fit very well with the EngineCore/AsyncLLM refactor --- since the Processor runs in process 0, while the EngineCore runs in process 1. This means that we can overlap the input processing with the model execution (which is not currently true since the input processing currently runs in ModelRunner, which is part of Engine core.

One other note. The Processor in the PR linked currently runs inside process 0. However, we made the APIs such that we can adjust the Processor to run N background processes if needed. So, if you can work within this class, we can have a nice separation of concerns, which will enable us to offload more things to background processes as need.

Very excited about this!

mlinmg · 2024-11-29T17:53:55Z

I would like to discuss an edge case where passing the input ids and the MultiModal args is rather useful.
My use case is that I have implemented a general TTS engine using vllm as the backbone for the decoder model, in tts you have essentially 2 dictionaries, one is mapped with a tokenizer(the text one) and one which is for "audio tokens" that aren't mapped to a dictionary. usually the decoder model generates tokens that are not mappable with the text tokenizer. since both dictionaries have different bos and eos tokens it is rather complex to uniform the preprocessing and it is much easier to just do it manually (https://github.com/astramind-ai/Auralis/blob/main/src/auralis/models/xttsv2/XTTSv2.py and https://github.com/astramind-ai/Auralis/blob/main/src/auralis/models/xttsv2/components/vllm_mm_gpt.py)

mlinmg · 2024-11-29T17:56:29Z

maybe a solution is explicitly including a supeclass in the model definition which will allow such behavior, otherwise deprecating it?

DarkLight1337 · 2024-11-30T04:43:20Z

Maybe we can make a special case and allow token IDs if all other inputs aren't processed by HF.

jnordberg · 2024-12-06T07:58:46Z

INFO 12-06 07:56:21 preprocess.py:215] Your model uses the legacy input pipeline instead of the new multi-modal processor. Please note that the legacy pipeline will be removed in a future release. For more details, see: https://github.com/vllm-project/vllm/issues/10114

This seems a bit premature since this new multi-modal processor isn't even usable yet

DarkLight1337 · 2024-12-06T08:29:41Z

INFO 12-06 07:56:21 preprocess.py:215] Your model uses the legacy input pipeline instead of the new multi-modal processor. Please note that the legacy pipeline will be removed in a future release. For more details, see: https://github.com/vllm-project/vllm/issues/10114

This seems a bit premature since this new multi-modal processor isn't even usable yet

The purpose of that is to direct users to this RFC thread, so we can get more thoughts.

jnordberg · 2024-12-06T08:34:04Z

My thought is that the warning is very annoying 😀

DarkLight1337 · 2024-12-06T08:37:19Z

Sorry for the spam, it has been fixed in #10530 so the message is now only logged once.

DarkLight1337 · 2025-01-09T14:43:29Z

I would like to discuss an edge case where passing the input ids and the MultiModal args is rather useful.
My use case is that I have implemented a general TTS engine using vllm as the backbone for the decoder model, in tts you have essentially 2 dictionaries, one is mapped with a tokenizer(the text one) and one which is for "audio tokens" that aren't mapped to a dictionary. usually the decoder model generates tokens that are not mappable with the text tokenizer. since both dictionaries have different bos and eos tokens it is rather complex to uniform the preprocessing and it is much easier to just do it manually (https://github.com/astramind-ai/Auralis/blob/main/src/auralis/models/xttsv2/XTTSv2.py and https://github.com/astramind-ai/Auralis/blob/main/src/auralis/models/xttsv2/components/vllm_mm_gpt.py)

Good news: since we need to apply our own prompt replacement logic anyway, I have opened #11900 to enable this.

akshay-loci · 2025-01-10T14:44:25Z

One question here - We use our own multimodal LLM that we deploy and call using VLLM's openai server. I'm forced to name our model the same as an existing supported model (like 'phi3_v') in it's config.json even though the model has its own template, placeholder tokens etc. The only reason for that seems to be because vllm tries to insert placeholder tokens if they don't exist in the prompt. Will this change also allow getting rid of that behaviour?

This is the code I'm referring to
https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/chat_utils.py#L420

DarkLight1337 · 2025-01-10T14:49:56Z

That is a separate issue regarding online serving. ~~I think you can avoid triggering that logic by updating your chat template to support "openai" format (see here).~~ Never mind, seems that this is triggered in both formats. I think it should be possible to change the logic so that those placeholder tokens are only needed in "string" format.

xfalcox · 2025-01-16T18:29:31Z

I'm getting this warning for OpenGVLab/InternVL2_5-8B using latest docker image. Will this model be migrated to the new pipeline too?

DarkLight1337 · 2025-01-17T03:31:45Z

Yes, we are planning to migrate all models eventually.

DarkLight1337 added the RFC label Nov 7, 2024

This was referenced Nov 7, 2024

[1/N] Initial prototype for multi-modal processor #10044

Merged

[0/N] Rename MultiModalInputs to MultiModalKwargs #10040

Merged

[RFC]: Multi-modality Support on vLLM #4194

Open

xffxff mentioned this issue Nov 25, 2024

[Model]: Add support for Aria model #10514

Merged

This was referenced Nov 26, 2024

[Model] Implement merged input processor for LLaVA model #10676

Merged

[2/N] Proper handling of placeholders in merged multi-modal processor #10485

Merged

Sunxiaohu0406 mentioned this issue Nov 27, 2024

[Bug]: glm-4v-9b报错 wangshuai09/vllm#4

Open

1 task

heichang12138 mentioned this issue Dec 18, 2024

Qwen2-VL VLLM backend infer_dataset报Engine iteration timed out. This should never happen modelscope/ms-swift#2690

Closed

This was referenced Dec 18, 2024

[Model] Whisper model implementation #11280

Merged

[VLM] Merged multimodal processor for Qwen2-Audio #11303

Merged

DarkLight1337 mentioned this issue Jan 9, 2025

[VLM] Enable tokenized inputs for merged multi-modal processor #11900

Merged

LaoWangGB mentioned this issue Jan 20, 2025

[Doc] Explicitly state that InternVL 2.5 is supported #10978

Merged

ywang96 mentioned this issue Feb 13, 2025

[V1] Clarify input processing and multimodal feature caching logic #13211

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Merge input processor and input mapper for multi-modal models #10114

[RFC]: Merge input processor and input mapper for multi-modal models #10114

DarkLight1337 commented Nov 7, 2024 •

edited

Loading

robertgshaw2-redhat commented Nov 8, 2024

mlinmg commented Nov 29, 2024

mlinmg commented Nov 29, 2024

DarkLight1337 commented Nov 30, 2024

jnordberg commented Dec 6, 2024

DarkLight1337 commented Dec 6, 2024

jnordberg commented Dec 6, 2024

DarkLight1337 commented Dec 6, 2024 •

edited

Loading

DarkLight1337 commented Jan 9, 2025 •

edited

Loading

akshay-loci commented Jan 10, 2025 •

edited

Loading

DarkLight1337 commented Jan 10, 2025 •

edited

Loading

xfalcox commented Jan 16, 2025

DarkLight1337 commented Jan 17, 2025

[RFC]: Merge input processor and input mapper for multi-modal models #10114

[RFC]: Merge input processor and input mapper for multi-modal models #10114

Comments

DarkLight1337 commented Nov 7, 2024 • edited Loading

Motivation

Background

Issues with the current design

Proposed Change

Unified multi-modal processor

Automatic prompt replacement

Processor caching

Deprecate token IDs with multi-modal input

Feedback Period

Timeline

CC List

Any Other Things

Multi-modal plugins remain supported Migrating multi-modal plugins

No batched preprocessing for now

robertgshaw2-redhat commented Nov 8, 2024

mlinmg commented Nov 29, 2024

mlinmg commented Nov 29, 2024

DarkLight1337 commented Nov 30, 2024

jnordberg commented Dec 6, 2024

DarkLight1337 commented Dec 6, 2024

jnordberg commented Dec 6, 2024

DarkLight1337 commented Dec 6, 2024 • edited Loading

DarkLight1337 commented Jan 9, 2025 • edited Loading

akshay-loci commented Jan 10, 2025 • edited Loading

DarkLight1337 commented Jan 10, 2025 • edited Loading

xfalcox commented Jan 16, 2025

DarkLight1337 commented Jan 17, 2025

DarkLight1337 commented Nov 7, 2024 •

edited

Loading

DarkLight1337 commented Dec 6, 2024 •

edited

Loading

DarkLight1337 commented Jan 9, 2025 •

edited

Loading

akshay-loci commented Jan 10, 2025 •

edited

Loading

DarkLight1337 commented Jan 10, 2025 •

edited

Loading