Can we use a custom chat template (or no template at all) for vision fine-tuning? #1331

Any-Winter-4079 · 2024-11-24T13:28:18Z

I've seen on the Pixtral-12B Colab notebook that:

To format the dataset, all vision finetuning tasks should be formatted as follows:

[
{ "role": "user",
  "content": [{"type": "text",  "text": Q}, {"type": "image", "image": image} ]
},
{ "role": "assistant",
  "content": [{"type": "text",  "text": A} ]
},
]

For my use case for instance, I have a robot which I'm thinking of fine-tuning using:

'Vision'
'Hearing'
'Short-term thoughts'
and 'Long-term memories' as input.
And the next thought as output.

Since for most of the time the robot thinks by itself (i.e. no user interaction, which only occasionally interacts via 'Vision' and 'Hearing'), using user-assistant turns is not ideal.

I understand I could force Vision, Hearing, etc as 'role': 'user', but it's not ideal because in reality the robot is interacting with its own thoughts (and speaking in 1st person to itself) for the most part and it's just confusing.

And while my use case might be unique :), I think custom templates/different prompting 'templates' are not so much (i.e. I've seen quite a few use cases)

So, I'm wondering if custom templates (or no templates at all, masking all tokens up to a certain part of the text then fine-tuning using the model's prediction on the last tokens, e.g. here after 'Output') are possible for vision fine-tuning in unsloth. I'm quite new to unsloth, so apologies if this is answered somewhere already!

(And thank you for the great work you guys do here)

The text was updated successfully, but these errors were encountered:

danielhanchen · 2024-11-25T11:03:08Z

Yes custom templates are allowed (or pure text) - (interesting idea as well!)

I haven't yet provided an example though, but Pixtral also allows multi image inputs (unlike other models), so it might be pretty cool to see it working - https://huggingface.co/mistral-community/pixtral-12b for example has some inference code which shows how you can prompt the model dynamically without following a certain template.

Any-Winter-4079 changed the title ~~Can we use a custom chat template for vision fine-tuning?~~ Can we use a custom chat template (or no template at all) for vision fine-tuning? Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we use a custom chat template (or no template at all) for vision fine-tuning? #1331

Can we use a custom chat template (or no template at all) for vision fine-tuning? #1331

Any-Winter-4079 commented Nov 24, 2024 •

edited

Loading

danielhanchen commented Nov 25, 2024

Can we use a custom chat template (or no template at all) for vision fine-tuning? #1331

Can we use a custom chat template (or no template at all) for vision fine-tuning? #1331

Comments

Any-Winter-4079 commented Nov 24, 2024 • edited Loading

danielhanchen commented Nov 25, 2024

Any-Winter-4079 commented Nov 24, 2024 •

edited

Loading