VLM support for image and video processing with SmolVLM support #206

cyrilzakka · 2025-02-18T20:08:40Z

Hey all,

@pcuenca and I are submitting a PR to add support for image and video inference along with built in support for smolVLM. Would love a second pair of eyes on this!

Video/image fixes

Text inputs, with hardcoded values and considering a single image. Image patching still not done. You need to define HF_TOKEN in the environment to be able to download the model.

I believe pre-processing matches transformers', but inference fails because of some dimension mismatch.

The configuration fixes that make this work have been applied.

Generation (single image) works now 🔥

Also changed the input type to `image` to keep the sequence of frames untouched :)

smolvlm processing

Some cleanup

Additional smolvlm changes and adjustments

Images are always upscaled, so always tiled.

Fix single image pre-processing

Multiply fps if duration < 10

awni · 2025-02-20T14:26:21Z

Wow awesome PR! Thanks! @davidkoski is out for a few more days so apologies for the delay reviewing and getting this merged but we'll definitely get it landed as soon as possible.

pcuenca · 2025-02-20T14:54:42Z

No rush! Happy to iterate when David is back!

chenemii · 2025-02-22T16:21:00Z

Came from hugging face blog, very cool! Tried on 13 pro max, works for some videos but crashes a lot. Is there a device requirement?

pcuenca · 2025-02-22T18:39:45Z

Hi @chenemii! We have tested on iPhone 14 to 16, and haven't had time to work on much optimization yet. It probably crashes on your iPhone because of peak RAM use while processing video. The problem is not the amount of RAM in the device, but the per-process limits enforced by iOS, which vary per model family.

We'll run more tests when we open the Test Flight beta, if you want you can sign up here.

pcuenca · 2025-02-22T18:45:37Z

@chenemii A couple of ideas though:

Limit the max total number of frames to something like 20 here. The default configuration uses 64.
Limit Metal memory with something like the following. It could result in slower execution:

                let maxMetalMemory = Int(round(0.82 * Double(os_proc_available_memory())))
                MLX.GPU.set(memoryLimit: maxMetalMemory, relaxed: false)

chenemii · 2025-02-22T18:47:56Z

@pcuenca Good point, signed up for testing. I can help validate for the 13

Style

davidkoski · 2025-02-25T19:40:39Z

I am back -- I will look at this today or tomorrow. Very exciting!

pcuenca · 2025-02-27T13:23:20Z

Libraries/MLXVLM/Models/Idefics3.swift

+    var maxProcessingImageSize: CGFloat { CGFloat(config.size.longestEdge) }  // 2048
+    var fixedImageSize: CGFloat { CGFloat(config.maxImageSize.longestEdge) }  // 384 for big models, 512 for small models (200-500M)
+    var imageSequenceLength: Int { config.imageSequenceLength }
+    var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ }


Suggested change

var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ }

var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ } // Limited to reduce memory consumption on phones

cyrilzakka and others added 30 commits February 12, 2025 10:21

Update MediaProcessing.swift

b7c61ac

Merge pull request #1 from ml-explore/main

cc31f91

Video/image fixes

smolvlm processing

1ba603c

Text inputs, with hardcoded values and considering a single image. Image patching still not done. You need to define HF_TOKEN in the environment to be able to download the model.

Fix arch name

610a457

Optional config values

113f93d

Perform image tiling

dc6b71f

I believe pre-processing matches transformers', but inference fails because of some dimension mismatch.

Inference runs, but generations are random.

1cf5906

Restore Idefics3 processor, add SmolVLMProcessor

da3f80f

The configuration fixes that make this work have been applied.

clean

ad6f05c

Reorder to unhardcode rows, cols

7be743d

Remove unused var

6b63fcf

Fix typo lol

521b927

Generation (single image) works now 🔥

Initial support for image and video processing

0eaab62

Added global token to video prompt

76707dd

Merge remote-tracking branch 'cyril/main' into smolvlm-processing

5f39269

Fix prompt handling in video

ee2cd3b

Also changed the input type to `image` to keep the sequence of frames untouched :)

Merge pull request #2 from pcuenca/smolvlm-processing

cb22d0d

smolvlm processing

Small cleanup

a94f419

Add preprocessor configuration

0fe3a46

Unhardcode some values from config

a831a12

Remove prints

7a6d2c6

Chaining API -> some vars are now lets.

c73bfe3

Merge pull request #3 from pcuenca/smolvlm-changes

2807b88

Some cleanup

Change llm-tool to follow the smolvlm format

d407259

Fix system prompt, use prompts by Miquel

b86cdf2

Add system prompt

97ed22b

Update Applications/VLMEval/ContentView.swift

08b1e8c

Merge pull request #4 from pcuenca/smolvlm-changes

6f5e2f4

Additional smolvlm changes and adjustments

Fix single image pre-processing

9d7ad6e

Images are always upscaled, so always tiled.

Merge pull request #5 from pcuenca/image-preprocessing

ac482a3

Fix single image pre-processing

pcuenca and others added 2 commits February 19, 2025 00:14

Multiply fps if duration < 10

481756e

Merge pull request #6 from pcuenca/adjust-fps

b232921

Multiply fps if duration < 10

awni requested a review from davidkoski February 20, 2025 14:24

pcuenca and others added 3 commits February 22, 2025 19:49

swift-format

00394f1

Temporarily reduce max frames to 20

7d1934d

Merge pull request #7 from pcuenca/style

61a95b9

Style

davidkoski mentioned this pull request Feb 26, 2025

Add SmolVLMProcessor for SmolVLM Model support #211

Open

pcuenca reviewed Feb 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLM support for image and video processing with SmolVLM support #206

VLM support for image and video processing with SmolVLM support #206

cyrilzakka commented Feb 18, 2025

awni commented Feb 20, 2025

pcuenca commented Feb 20, 2025

chenemii commented Feb 22, 2025

pcuenca commented Feb 22, 2025

pcuenca commented Feb 22, 2025 •

edited

Loading

chenemii commented Feb 22, 2025 •

edited

Loading

davidkoski commented Feb 25, 2025

pcuenca Feb 27, 2025

	var maxVideoFrames: Int { 20 /config.videoSampling.maxFrames/ }
	var maxVideoFrames: Int { 20 /config.videoSampling.maxFrames/ } // Limited to reduce memory consumption on phones

VLM support for image and video processing with SmolVLM support #206

Are you sure you want to change the base?

VLM support for image and video processing with SmolVLM support #206

Conversation

cyrilzakka commented Feb 18, 2025

awni commented Feb 20, 2025

pcuenca commented Feb 20, 2025

chenemii commented Feb 22, 2025

pcuenca commented Feb 22, 2025

pcuenca commented Feb 22, 2025 • edited Loading

chenemii commented Feb 22, 2025 • edited Loading

davidkoski commented Feb 25, 2025

pcuenca Feb 27, 2025

Choose a reason for hiding this comment

pcuenca commented Feb 22, 2025 •

edited

Loading

chenemii commented Feb 22, 2025 •

edited

Loading