-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VLM support for image and video processing with SmolVLM support #206
base: main
Are you sure you want to change the base?
Conversation
Video/image fixes
Text inputs, with hardcoded values and considering a single image. Image patching still not done. You need to define HF_TOKEN in the environment to be able to download the model.
I believe pre-processing matches transformers', but inference fails because of some dimension mismatch.
The configuration fixes that make this work have been applied.
Generation (single image) works now 🔥
Also changed the input type to `image` to keep the sequence of frames untouched :)
smolvlm processing
Some cleanup
Additional smolvlm changes and adjustments
Images are always upscaled, so always tiled.
Fix single image pre-processing
Multiply fps if duration < 10
Wow awesome PR! Thanks! @davidkoski is out for a few more days so apologies for the delay reviewing and getting this merged but we'll definitely get it landed as soon as possible. |
No rush! Happy to iterate when David is back! |
Came from hugging face blog, very cool! Tried on 13 pro max, works for some videos but crashes a lot. Is there a device requirement? |
Hi @chenemii! We have tested on iPhone 14 to 16, and haven't had time to work on much optimization yet. It probably crashes on your iPhone because of peak RAM use while processing video. The problem is not the amount of RAM in the device, but the per-process limits enforced by iOS, which vary per model family. We'll run more tests when we open the Test Flight beta, if you want you can sign up here. |
@chenemii A couple of ideas though:
let maxMetalMemory = Int(round(0.82 * Double(os_proc_available_memory())))
MLX.GPU.set(memoryLimit: maxMetalMemory, relaxed: false) |
@pcuenca Good point, signed up for testing. I can help validate for the 13 |
I am back -- I will look at this today or tomorrow. Very exciting! |
var maxProcessingImageSize: CGFloat { CGFloat(config.size.longestEdge) } // 2048 | ||
var fixedImageSize: CGFloat { CGFloat(config.maxImageSize.longestEdge) } // 384 for big models, 512 for small models (200-500M) | ||
var imageSequenceLength: Int { config.imageSequenceLength } | ||
var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ } | |
var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ } // Limited to reduce memory consumption on phones |
Hey all,
@pcuenca and I are submitting a PR to add support for image and video inference along with built in support for smolVLM. Would love a second pair of eyes on this!