Handle vision language model chat templates #7

DePasqualeOrg · 2024-12-11T14:14:16Z

Now that mlx-libraries supports vision models (thanks to @davidkoski's Herculean efforts), we should try to support multimodal chat templates. I'll list some models here for reference.

These models have no chat template in tokenizer_config.json:

Paligemma
Pixtral

The text was updated successfully, but these errors were encountered:

johnmai-dev · 2024-12-12T03:23:34Z

Thank you for your suggestion. I will follow up on this issue next month, as I've been a bit busy recently.

davidkoski · 2024-12-12T16:14:39Z

For reference here is the code from mlx-vlm that generates the messages:

https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/prompt_utils.py#L45

and the matching code in the swift VLM for Qwen2-VL:

https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/MLXVLM/Models/Qwen2VL.swift#L689

The Qwen2-VL chat template expects structured content like this:

[{'role': 'user', 'content': [{'type': 'text', 'text': 'What are these?'}, {'type': 'image'}, {'type': 'image'}, {'type': 'image'}]}]

rather than [[String:String]].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle vision language model chat templates #7

Handle vision language model chat templates #7

DePasqualeOrg commented Dec 11, 2024

johnmai-dev commented Dec 12, 2024

davidkoski commented Dec 12, 2024

Handle vision language model chat templates #7

Handle vision language model chat templates #7

Comments

DePasqualeOrg commented Dec 11, 2024

johnmai-dev commented Dec 12, 2024

davidkoski commented Dec 12, 2024