Skip to content

Conversation

fyabc
Copy link
Contributor

@fyabc fyabc commented Apr 9, 2025

This draft PR adding support for Qwen2.5-Omni model (end-to-end full support).

This PR is a later version of #15130, it adds support for talker, code2wav, and an OmniLLMEngine class to manage the end-to-end audio generation process.
You can see #15130 for more details about Qwen2.5-Omni model architecture.

NOTE: Since this PR makes significant changes to vLLM, its a draft and will not be merged in the short term.

Requirements

This PR requires huggingface/transformers#36752.

pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8

Note: You need to install transformers from source from that branch

Example Usage

python examples/offline_inference/qwen2_5_omni/end2end.py --model Qwen/Qwen2.5-Omni-7B --prompt audio-in-video-v2 --enforce-eager --do-wave --voice-type m02 --warmup-voice-type m02

This command will print text output and generate .wav output files under current folder.

Copy link

github-actions bot commented Apr 9, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) tpu Related to Google TPUs labels Apr 9, 2025
Copy link

mergify bot commented Apr 9, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fyabc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 9, 2025
@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 9, 2025

I think we can further split this PR, with the first one (after Qwen2.5-Omni thinker only) adding prompt_embeds support to vLLM. For reference, here are some previous/ongoing efforts to add this feature:

@ywang96
Copy link
Member

ywang96 commented Apr 9, 2025

Thanks for this contribution! As we discussed offline, we'll be carefully reviewing this PR/design and think about how to enable end-to-end support for models like this with vLLM!

@mergify mergify bot added the ci/build label Apr 10, 2025
fyabc and others added 3 commits April 10, 2025 22:37
Signed-off-by: Tao He <[email protected]>
(cherry picked from commit 005879f2b22e40b7d03be7063e80686862a72e2d)
Signed-off-by: fyabc <[email protected]>
@majunze2001
Copy link

Is this fork still usable? After cloning and building I got the following errors:

root@ubuntu:/workspace# python examples/offline_inference/qwen2_5_omni/end2end.py --model Qwen/Qwen2.5-Omni-7B --prompt audio-in-video-v2 --enforce-eager --do-wave --voice-type m02 --warmup-voice-type m02
INFO 06-01 00:40:02 [__init__.py:239] Automatically detected platform cuda.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
/workspace/examples/offline_inference/qwen2_5_omni/end2end.py:258: UserWarning: PySoundFile failed. Trying audioread instead.
  librosa.load(temp_video_file_path, sr=16000)[0])
/opt/venv/lib/python3.11/site-packages/librosa/core/audio.py:184: FutureWarning: librosa.core.audio.__audioread_load
        Deprecated as of librosa version 0.10.0.
        It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
Traceback (most recent call last):
  File "/opt/venv/lib/python3.11/site-packages/librosa/core/audio.py", line 176, in load
    y, sr_native = __soundfile_load(path, offset, duration, dtype)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/librosa/core/audio.py", line 209, in __soundfile_load
    context = sf.SoundFile(path)
              ^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/soundfile.py", line 690, in __init__
    self._file = self._open(file, mode_int, closefd)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/soundfile.py", line 1265, in _open
    raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening '/tmp/tmp3_ttt320': Format not recognised.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/examples/offline_inference/qwen2_5_omni/end2end.py", line 677, in <module>
    main()
  File "/workspace/examples/offline_inference/qwen2_5_omni/end2end.py", line 651, in main
    prompt = make_omni_prompt()
             ^^^^^^^^^^^^^^^^^^
  File "/workspace/examples/offline_inference/qwen2_5_omni/end2end.py", line 480, in make_omni_prompt
    prompt = make_audio_in_video_v2_prompt()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/examples/offline_inference/qwen2_5_omni/end2end.py", line 400, in make_audio_in_video_v2_prompt
    prompt = make_inputs_qwen2_omni(
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/examples/offline_inference/qwen2_5_omni/end2end.py", line 258, in make_inputs_qwen2_omni
    librosa.load(temp_video_file_path, sr=16000)[0])
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/librosa/core/audio.py", line 184, in load
    y, sr_native = __audioread_load(path, offset, duration, dtype)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/decorator.py", line 235, in fun
    return caller(func, *(extras + args), **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/librosa/util/decorators.py", line 63, in __wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/librosa/core/audio.py", line 240, in __audioread_load
    reader = audioread.audio_open(path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/audioread/__init__.py", line 132, in audio_open
    raise NoBackendError()
audioread.exceptions.NoBackendError

@mergify mergify bot added the qwen Related to Qwen models label Jun 19, 2025
@liaoweiguo
Copy link

watching ...

@BakerBunker
Copy link

@majunze2001 librosa needs filename suffix to get the file format in some cases, add suffix to your tmpfile and try again.

@mergify mergify bot added the new-model Requests to new models label Jul 11, 2025
@SamitHuang
Copy link
Contributor

Thanks for this contribution! As we discussed offline, we'll be carefully reviewing this PR/design and think about how to enable end-to-end support for models like this with vLLM!

looking forward to this feature!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) needs-rebase new-model Requests to new models qwen Related to Qwen models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants