Skip to content

Commit

Permalink
[Doc] Update vlm.rst to include an example on videos (vllm-project#9155)
Browse files Browse the repository at this point in the history
Co-authored-by: Cyrus Leung <[email protected]>
  • Loading branch information
sayakpaul and DarkLight1337 authored Oct 8, 2024
1 parent f17ae0f commit 0d104d9
Showing 1 changed file with 27 additions and 0 deletions.
27 changes: 27 additions & 0 deletions docs/source/models/vlm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,33 @@ Instead of passing in a single image, you can pass in a list of images.
A code example can be found in `examples/offline_inference_vision_language_multi_image.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py>`_.

Multi-image input can be extended to perform video captioning. We show this with `Qwen2-VL <https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct>`_ as it supports videos:

.. code-block:: python
# Specify the maximum number of frames per video to be 4. This can be changed.
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
# Create the request payload.
video_frames = ... # load your video making sure it only has the number of frames specified earlier.
message = {
"role": "user",
"content": [
{"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
],
}
for i in range(len(video_frames)):
base64_image = encode_image(video_frames[i]) # base64 encoding.
new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
message["content"].append(new_image)
# Perform inference and log output.
outputs = llm.chat([message])
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
Online Inference
----------------

Expand Down

0 comments on commit 0d104d9

Please sign in to comment.