Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VLM support for image and video processing with SmolVLM support #206

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

cyrilzakka
Copy link

Hey all,

@pcuenca and I are submitting a PR to add support for image and video inference along with built in support for smolVLM. Would love a second pair of eyes on this!

cyrilzakka and others added 30 commits February 12, 2025 10:21
Text inputs, with hardcoded values and considering a single image.
Image patching still not done.

You need to define HF_TOKEN in the environment to be able to download
the model.
I believe pre-processing matches transformers', but inference fails
because of some dimension mismatch.
The configuration fixes that make this work have been applied.
Generation (single image) works now 🔥
Also changed the input type to `image` to keep the sequence of frames
untouched :)
Additional smolvlm changes and adjustments
Images are always upscaled, so always tiled.
Fix single image pre-processing
@awni awni requested a review from davidkoski February 20, 2025 14:24
@awni
Copy link
Member

awni commented Feb 20, 2025

Wow awesome PR! Thanks! @davidkoski is out for a few more days so apologies for the delay reviewing and getting this merged but we'll definitely get it landed as soon as possible.

@pcuenca
Copy link
Contributor

pcuenca commented Feb 20, 2025

No rush! Happy to iterate when David is back!

@chenemii
Copy link

Came from hugging face blog, very cool! Tried on 13 pro max, works for some videos but crashes a lot. Is there a device requirement?

@pcuenca
Copy link
Contributor

pcuenca commented Feb 22, 2025

Hi @chenemii! We have tested on iPhone 14 to 16, and haven't had time to work on much optimization yet. It probably crashes on your iPhone because of peak RAM use while processing video. The problem is not the amount of RAM in the device, but the per-process limits enforced by iOS, which vary per model family.

We'll run more tests when we open the Test Flight beta, if you want you can sign up here.

@pcuenca
Copy link
Contributor

pcuenca commented Feb 22, 2025

@chenemii A couple of ideas though:

  • Limit the max total number of frames to something like 20 here. The default configuration uses 64.
  • Limit Metal memory with something like the following. It could result in slower execution:
                let maxMetalMemory = Int(round(0.82 * Double(os_proc_available_memory())))
                MLX.GPU.set(memoryLimit: maxMetalMemory, relaxed: false)

@chenemii
Copy link

chenemii commented Feb 22, 2025

@pcuenca Good point, signed up for testing. I can help validate for the 13

@davidkoski
Copy link
Collaborator

I am back -- I will look at this today or tomorrow. Very exciting!

var maxProcessingImageSize: CGFloat { CGFloat(config.size.longestEdge) } // 2048
var fixedImageSize: CGFloat { CGFloat(config.maxImageSize.longestEdge) } // 384 for big models, 512 for small models (200-500M)
var imageSequenceLength: Int { config.imageSequenceLength }
var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ }
var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ } // Limited to reduce memory consumption on phones

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants