[Model] Refactor Ultravox to use merged input processor #11198

Isotr0py · 2024-12-14T07:58:59Z

Refactor Ultravox to use merged input processor
Ultravox placeholder will be changed to <|audio|> to keep align with HF.

Signed-off-by: Isotr0py <[email protected]>

github-actions · 2024-12-14T07:59:11Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Isotr0py <[email protected]>

DarkLight1337 · 2024-12-15T12:16:46Z

cc @petersalas regarding the change of placeholder token.

vllm/model_executor/models/ultravox.py

DarkLight1337 · 2024-12-15T12:54:05Z

WDYT of making sampling_rate part of mm_processor_kwargs to make the input format consistent with HF? Even so, we should maintain backwards compatibility for a while.

Isotr0py · 2024-12-15T13:36:26Z

WDYT of making sampling_rate part of mm_processor_kwargs to make the input format consistent with HF?

But the whisper feature extraction is using a fixed sampling rate, so if we expose the sampling rate to be dynamic, this may cause unnecessary exception.

For example, if we specify the sampling_rate=32000, the ultravox processor will raise an error due to incorrect sampling rate.

from transformers import AutoProcessor
import librosa

processor = AutoProcessor.from_pretrained("fixie-ai/ultravox-v0_3", trust_remote_code=True)
audio, sr = librosa.load("translate_to_chinese.wav")
processor(text="<|audio|>", audio=audio, sampling_rate=32000)

ValueError: The model corresponding to this feature extractor: WhisperFeatureExtractor was trained using a sampling rate of 16000. Please make sure that the provided `raw_speech` input was sampled with 16000 and not 32000.

DarkLight1337 · 2024-12-15T14:04:22Z

WDYT of making sampling_rate part of mm_processor_kwargs to make the input format consistent with HF?

But the whisper feature extraction is using a fixed sampling rate, so if we expose the sampling rate to be dynamic, this may cause unnecessary exception.

For example, if we specify the sampling_rate=32000, the ultravox processor will raise an error due to incorrect sampling rate.
from transformers import AutoProcessor
import librosa

processor = AutoProcessor.from_pretrained("fixie-ai/ultravox-v0_3", trust_remote_code=True)
audio, sr = librosa.load("translate_to_chinese.wav")
processor(text="<|audio|>", audio=audio, sampling_rate=32000)
ValueError: The model corresponding to this feature extractor: WhisperFeatureExtractor was trained using a sampling rate of 16000. Please make sure that the provided `raw_speech` input was sampled with 16000 and not 32000.

I see, so the sampling_rate parameter actually refers to the input, not the HF processor. Let's keep this as is then.

Signed-off-by: Isotr0py <[email protected]>

Co-authored-by: Cyrus Leung <[email protected]>

Signed-off-by: Isotr0py <[email protected]>

vllm/model_executor/models/ultravox.py

Co-authored-by: Cyrus Leung <[email protected]>

vllm/model_executor/models/ultravox.py

Signed-off-by: Isotr0py <[email protected]>

DarkLight1337 · 2024-12-16T02:13:47Z

There seems to be some problem with online inference of this model, please fix it.

Signed-off-by: Isotr0py <[email protected]>

DarkLight1337

The tests pass so LGTM!

…#11198) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Isotr0py added 10 commits December 10, 2024 15:19

refactor ultravox process

4c3e5d5

Signed-off-by: Isotr0py <[email protected]>

fix processor inputs

d160560

Signed-off-by: Isotr0py <[email protected]>

fix ultravox processor

91384bf

Signed-off-by: Isotr0py <[email protected]>

Merge branch 'vllm-project:main' into ultravox-refactor

6d31c3d

fix placeholder padding

782bd61

Signed-off-by: Isotr0py <[email protected]>

Merge branch 'vllm-project:main' into ultravox-refactor

5350918

add comments

57c7ec9

Signed-off-by: Isotr0py <[email protected]>

update example

c1a9cef

Signed-off-by: Isotr0py <[email protected]>

code format

89416a8

Signed-off-by: Isotr0py <[email protected]>

remove unused code

9693691

Signed-off-by: Isotr0py <[email protected]>

mergify bot added the frontend label Dec 14, 2024

This was referenced Dec 14, 2024

[RFC]: Merge input processor and input mapper for multi-modal models #10114

Open

[RFC]: Multi-modality Support on vLLM #4194

Open

Isotr0py and others added 10 commits December 15, 2024 11:59

Merge branch 'main' into ultravox-refactor

8254384

Merge branch 'vllm-project:main' into ultravox-refactor

e0ef4bc

clean up

08a3422

Signed-off-by: Isotr0py <[email protected]>

refactor

d72fe45

Signed-off-by: Isotr0py <[email protected]>

code format

0b8aa47

Signed-off-by: Isotr0py <[email protected]>

fix prompt replacement

d5b7cf7

Signed-off-by: Isotr0py <[email protected]>

code format

980c731

Signed-off-by: Isotr0py <[email protected]>

fix audio_token truncation

5cb6362

Signed-off-by: Isotr0py <[email protected]>

fix mm_data

0854a67

Signed-off-by: Isotr0py <[email protected]>

fix audio_token_len and online inference

146fc63

Signed-off-by: Isotr0py <[email protected]>

Isotr0py marked this pull request as ready for review December 15, 2024 10:51

Isotr0py requested review from DarkLight1337 and ywang96 as code owners December 15, 2024 10:51

DarkLight1337 reviewed Dec 15, 2024

View reviewed changes

vllm/model_executor/models/ultravox.py Outdated Show resolved Hide resolved