Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle vision language model chat templates #7

Open
DePasqualeOrg opened this issue Dec 11, 2024 · 2 comments
Open

Handle vision language model chat templates #7

DePasqualeOrg opened this issue Dec 11, 2024 · 2 comments

Comments

@DePasqualeOrg
Copy link
Collaborator

Now that mlx-libraries supports vision models (thanks to @davidkoski's Herculean efforts), we should try to support multimodal chat templates. I'll list some models here for reference.

These models have no chat template in tokenizer_config.json:

  • Paligemma
  • Pixtral
@johnmai-dev
Copy link
Owner

Thank you for your suggestion. I will follow up on this issue next month, as I've been a bit busy recently.

@davidkoski
Copy link

For reference here is the code from mlx-vlm that generates the messages:

and the matching code in the swift VLM for Qwen2-VL:

The Qwen2-VL chat template expects structured content like this:

[{'role': 'user', 'content': [{'type': 'text', 'text': 'What are these?'}, {'type': 'image'}, {'type': 'image'}, {'type': 'image'}]}]

rather than [[String:String]].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants