Skip to content

Conversation

@deadprogram
Copy link
Contributor

This PR adds mtmd_get_vision_image_size() and mtmd_get_vision_patch_size() functions as mentioned in #16703

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases, I don't think exposing these data is a good idea.

The API for getting audio bit rate is required because there is no resampling logic inside mtmd. Providing input with wrong bit rate will make the model to malfunction.

On the other hand, image size and patch size are not public API as mtmd can work with any input image sizes. The image will be resized internally.

Unless you can show the code where you use this, I don't think we should add these API. Maybe what you're doing is already possible via mtmd_image_tokens_get_n_tokens

Comment on lines +115 to +117
// get vision image size in pixels, for example 1024
// return -1 if vision is not supported
MTMD_API int mtmd_get_vision_image_size(mtmd_context * ctx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a good idea to add this, as many models now support dynamic resolution and this won't give the correct size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, some models are fixed res, some are naflex.

The presumption is that if you want to do your own pre-processing, you will need to be familiar with what the model you are using expects.

@deadprogram
Copy link
Contributor Author

In most cases, I don't think exposing these data is a good idea.

The API for getting audio bit rate is required because there is no resampling logic inside mtmd. Providing input with wrong bit rate will make the model to malfunction.

On the other hand, image size and patch size are not public API as mtmd can work with any input image sizes. The image will be resized internally.

Unless you can show the code where you use this, I don't think we should add these API. Maybe what you're doing is already possible via mtmd_image_tokens_get_n_tokens

Actually my intention is this would be the first step towards being able to bypass the clip_image_preprocess function via a configuration param. If you already have an image processing pipeline, it would be great to be able to not double process the same image data twice.

Allowing for skipping this step allows for things like hardware acceleration of image pre-processing without having to actually do any such implementation inside of mtmd itself.

@ngxson
Copy link
Collaborator

ngxson commented Oct 22, 2025

I don't want to support such use case as it will make mtmd API as bloated as clip API. The preprocessing can vary vastly among model (i.e. some needs to align to specific ratios, some needs to pad with specific color). If the user have to make a presumption before using an API, then what's the point of being a public API?

Overall, I won't merge this change as it's only specific to your use case. You can instead maintain your own fork, which gives you more control over the pipeline.

@deadprogram
Copy link
Contributor Author

If the user have to make a presumption before using an API, then what's the point of being a public API?

Sorry if I was unclear here. What I meant to say was that if the user would choose to bypass the current clip_image_preprocess function, they would have to understand how to process the image themselves. Not for all users.

@deadprogram
Copy link
Contributor Author

In any case, since this change is not acceptable to the maintainers, I will rethink my approach. Thanks for the feedback. Now closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants