Online video support for VLMs #10020

litianjian · 2024-11-05T04:02:08Z

Online video support for VLMs

vLLM already supports a large number of MultiModal Machine Learning visual models, some of which support image and video input，such as Qwen2-VL, LLaVA-Onevision, etc. Referring to the implementation of image, this proposal adds support for video.

Refer to the visual interfaces of OpenAI (vision and video) and Google Gemini, the visual interface should ideally support inputs from Video URLs and base64.

FIX #9842

Examples

vllm serve llava-hf/llava-onevision-qwen2-7b-ov-hf --served-model-name hello --trust-remote-code

try:
    from decord import VideoReader, cpu
except ImportError:
    pass

import base64
from io import BytesIO
from PIL import Image
import numpy as np
import requests

from openai import OpenAI
import time

openai_api_key = "123"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

def encode_video(video_path, max_frames=80, is_fps_sampling=True):
    if video_path.startswith("http") or video_path.startswith("https"):
        response = requests.get(video_path)
        if response.status_code == 200:
            video_path = BytesIO(response.content)
        else:
            print('failed to load the video')

        vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
        total_frame_num = len(vr)
        if is_fps_sampling:
            # FPS Sampling
            avg_fps = round(vr.get_avg_fps())
            frame_idx = [i for i in range(
                0, total_frame_num, avg_fps)]
            if len(frame_idx) > max_frames:
                uniform_sampled_frames = np.linspace(
                    0, total_frame_num - 1, max_frames, dtype=int
                )
                frame_idx = uniform_sampled_frames.tolist()
            print(frame_idx)
        else:
            # uniform sampling
            if total_frame_num > max_frames:
                uniform_sampled_frames = np.linspace(
                    0, total_frame_num - 1, max_frames, dtype=int
                )
                frame_idx = uniform_sampled_frames.tolist()
            else:
                frame_idx = [i for i in range(0, total_frame_num)]
            print(frame_idx)

        frames = vr.get_batch(frame_idx).asnumpy()
        print("actual frames", len(frames))
        
        base64_frames = []
        for frame in frames:
            img = Image.fromarray(frame)
            output_buffer = BytesIO()
            img.save(output_buffer, format="PNG")

            byte_data = output_buffer.getvalue()
            base64_str = base64.b64encode(byte_data).decode("utf-8")
            base64_frames.append(base64_str)
        return base64_frames


video_url = "https://raw.githubusercontent.com/EvolvingLMMs-Lab/sglang/dev/onevision_local/assets/jobs.mp4"
frames = encode_video(video_url, max_frames=32, is_fps_sampling=False)
images = []
images.extend(frames)
video_base64 = ",".join(images)

chat_response = client.chat.completions.create(
    model="hello",
    temperature=0,
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            # NOTE: The prompt formatting with the video token `<video>` is not needed
            # since the prompt will be processed automatically by the API server.
            {"type": "text", "text": "Please describe the video comprehensively as much as possible."},
            {"type": "video_url", "video_url": {"url": video_url}},
        ],
    }],
)
print("Chat completion output:", chat_response.choices[0].message.content)


chat_response = client.chat.completions.create(
    model="hello", 
    temperature=0,
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            # NOTE: The prompt formatting with the video token `<video>` is not needed
            # since the prompt will be processed automatically by the API server.
            {"type": "text", "text": "Please describe the video comprehensively as much as possible."},
            {"type": "video_url", "video_url": {"url": f"data:video/png;base64,{video_base64}"}},
        ],
    }],
)
print("Chat completion output:", chat_response.choices[0].message.content)

chat_response = client.chat.completions.create(
    model="hello",
    messages=[{
        "role": "user",
        "content": [
            # NOTE: The prompt formatting with the image token `<image>` is not needed
            # since the prompt will be processed automatically by the API server.
            {"type": "text", "text": "What’s in this image?"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{images[0]}"}},
        ],
    }],
)
print("Chat completion output:", chat_response.choices[0].message.content)

github-actions · 2024-11-05T04:02:21Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/multimodal/utils.py

vllm/entrypoints/chat_utils.py

DarkLight1337

The code looks good, but please add some tests to verify this.

mergify · 2024-11-06T05:14:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. @litianjian please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

litianjian · 2024-11-06T16:14:45Z

The code looks good, but please add some tests to verify this.

OK , I have updated the tests.

DarkLight1337

Thanks for adding this!

DarkLight1337 · 2024-11-07T03:37:22Z

It looks like the tests failed though, PTAL.

litianjian · 2024-11-07T09:15:44Z

It looks like the tests failed though, PTAL.

The tests succeeded in my local machine.

DarkLight1337 · 2024-11-07T09:25:07Z

~~It looks like vllm[video] isn't being installed for the test environment.~~ I have updated the dependencies.

xiaoajie738 · 2024-11-07T14:00:16Z

Excuse me, I have a question to ask, if we want to use vllm to call the qwen2-lv model, when we want to pass the video as input, if I extract the frame and crop it on the client to get the desired video frame sequence, call the chat interface of vllm by passing the picture sequence or call the chat interface by directly passing the video and the sampling frequency and the desired size implemented in this pr, will there be any difference in the results of these two calling methods? I understand that video passing video will bring more network bandwidth pressure and latency, thanks for the answer!

DarkLight1337 · 2024-11-07T14:03:07Z

Excuse me, I have a question to ask, if we want to use vllm to call the qwen2-lv model, when we want to pass the video as input, if I extract the frame and crop it on the client to get the desired video frame sequence, call the chat interface of vllm by passing the picture sequence or call the chat interface by directly passing the video and the sampling frequency and the desired size implemented in this pr, will there be any difference in the results of these two calling methods? I understand that video passing video will bring more network bandwidth pressure and latency, thanks for the answer!

The HF processor will be called regardless of whether you have done preprocessing beforehand. I am not sure whether HF processor is intelligent enough to return early if the image has already been cropped though.

Signed-off-by: DarkLight1337 <[email protected]>

xiaoajie738 · 2024-11-07T15:25:09Z

Excuse me, I have a question to ask, if we want to use vllm to call the qwen2-lv model, when we want to pass the video as input, if I extract the frame and crop it on the client to get the desired video frame sequence, call the chat interface of vllm by passing the picture sequence or call the chat interface by directly passing the video and the sampling frequency and the desired size implemented in this pr, will there be any difference in the results of these two calling methods? I understand that video passing video will bring more network bandwidth pressure and latency, thanks for the answer!

The HF processor will be called regardless of whether you have done preprocessing beforehand. I am not sure whether HF processor is intelligent enough to return early if the image has already been cropped though.

"Apologies if I wasn’t clear enough. I wanted to ask whether I can convert the video into individual frames and then call the chat interface by passing them as multiple images using the image_url field, rather than using the video_url field introduced in this PR. Is this approach feasible?"

DarkLight1337 · 2024-11-07T15:31:13Z

Excuse me, I have a question to ask, if we want to use vllm to call the qwen2-lv model, when we want to pass the video as input, if I extract the frame and crop it on the client to get the desired video frame sequence, call the chat interface of vllm by passing the picture sequence or call the chat interface by directly passing the video and the sampling frequency and the desired size implemented in this pr, will there be any difference in the results of these two calling methods? I understand that video passing video will bring more network bandwidth pressure and latency, thanks for the answer!

The HF processor will be called regardless of whether you have done preprocessing beforehand. I am not sure whether HF processor is intelligent enough to return early if the image has already been cropped though.

"Apologies if I wasn’t clear enough. I wanted to ask whether I can convert the video into individual frames and then call the chat interface by passing them as multiple images using the image_url field, rather than using the video_url field introduced in this PR. Is this approach feasible?"

Yes, Qwen2-VL supports both multi-image and video input.

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 · 2024-11-07T16:35:59Z

Tests should pass now!

Signed-off-by: DarkLight1337 <[email protected]>

litianjian · 2024-11-08T02:02:15Z

Tests should pass now!

Thank you for your patience.

Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> Signed-off-by: Isotr0py <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> Signed-off-by: OmerD <[email protected]>

hujh818 · 2024-11-11T03:23:44Z

@litianjian Thank you very much for your work. However, currently when video_url is passed in, we cannot control the logic of video frame extraction and image resizing. As a result, we cannot finely control the output result of the video. I am wondering if it is possible to specify a video_process function at the same time when the url is passed in, similar to encode_video when processing base64. After vllm downloads the video, use this function to process the video. The reason for hoping to adopt this method instead of inputting in base64 is that transmitting base64 video occupies too much bandwidth and is prone to network congestion.

DarkLight1337 · 2024-11-11T03:26:48Z

For some models (e.g. Qwen2-VL), You can set --mm-processor-kwargs on startup to configure the HF processor class.

Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> Signed-off-by: Loc Huynh <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]>

add video online support

06b8c37

mergify bot added the frontend label Nov 5, 2024

DarkLight1337 assigned DarkLight1337 and litianjian and unassigned litianjian Nov 5, 2024

DarkLight1337 reviewed Nov 5, 2024

View reviewed changes

vllm/multimodal/utils.py Show resolved Hide resolved

vllm/multimodal/utils.py Show resolved Hide resolved

DarkLight1337 reviewed Nov 5, 2024

View reviewed changes

vllm/entrypoints/chat_utils.py Outdated Show resolved Hide resolved

remove openai reference in video_url

2762d66

mergify bot added documentation Improvements or additions to documentation ci/build labels Nov 5, 2024

format

4458c3e

DarkLight1337 reviewed Nov 5, 2024

View reviewed changes

mergify bot added the needs-rebase label Nov 6, 2024

add video online test

efeb3ea

litianjian requested review from robertgshaw2-neuralmagic and simon-mo as code owners November 6, 2024 16:14

Merge branch 'main' into video_server_vllm

18e6d27

mergify bot removed the needs-rebase label Nov 6, 2024

format code

6487017

DarkLight1337 approved these changes Nov 7, 2024

View reviewed changes

Fix test dependencies

9dd2378

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 added 2 commits November 7, 2024 16:30

Avoid OOM in tests

392e541

Signed-off-by: DarkLight1337 <[email protected]>

Update timeouts

0a1f84c

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 7, 2024

DarkLight1337 enabled auto-merge (squash) November 7, 2024 16:35

Fix dependencies

3cc6968

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 merged commit 28b2877 into vllm-project:main Nov 7, 2024
69 of 70 checks passed

DarkLight1337 mentioned this pull request Nov 8, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

50 tasks

sayakpaul mentioned this pull request Nov 9, 2024

[RFC]: Support for video input #7558

Closed

rickyyx pushed a commit to rickyyx/vllm that referenced this pull request Nov 13, 2024

Online video support for VLMs (vllm-project#10020)

0963933

Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]>

lk-chen mentioned this pull request Nov 19, 2024

[Misc]: pip-compile requirements-test.in fails with conflict from "decord" #10466

Closed

1 task

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

Online video support for VLMs (vllm-project#10020)

09af43b

Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Online video support for VLMs #10020

Online video support for VLMs #10020

litianjian commented Nov 5, 2024 •

edited

Loading

github-actions bot commented Nov 5, 2024

DarkLight1337 left a comment •

edited

Loading

mergify bot commented Nov 6, 2024

litianjian commented Nov 6, 2024

DarkLight1337 left a comment

DarkLight1337 commented Nov 7, 2024

litianjian commented Nov 7, 2024

DarkLight1337 commented Nov 7, 2024 •

edited

Loading

xiaoajie738 commented Nov 7, 2024

DarkLight1337 commented Nov 7, 2024

xiaoajie738 commented Nov 7, 2024

DarkLight1337 commented Nov 7, 2024

DarkLight1337 commented Nov 7, 2024

litianjian commented Nov 8, 2024

hujh818 commented Nov 11, 2024

DarkLight1337 commented Nov 11, 2024

Online video support for VLMs #10020

Online video support for VLMs #10020

Conversation

litianjian commented Nov 5, 2024 • edited Loading