You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To format the dataset, all vision finetuning tasks should be formatted as follows:
[
{ "role": "user",
"content": [{"type": "text", "text": Q}, {"type": "image", "image": image} ]
},
{ "role": "assistant",
"content": [{"type": "text", "text": A} ]
},
]
For my use case for instance, I have a robot which I'm thinking of fine-tuning using:
'Vision'
'Hearing'
'Short-term thoughts'
and 'Long-term memories' as input.
And the next thought as output.
Since for most of the time the robot thinks by itself (i.e. no user interaction, which only occasionally interacts via 'Vision' and 'Hearing'), using user-assistant turns is not ideal.
I understand I could force Vision, Hearing, etc as 'role': 'user', but it's not ideal because in reality the robot is interacting with its own thoughts (and speaking in 1st person to itself) for the most part and it's just confusing.
And while my use case might be unique :), I think custom templates/different prompting 'templates' are not so much (i.e. I've seen quite a few use cases)
So, I'm wondering if custom templates (or no templates at all, masking all tokens up to a certain part of the text then fine-tuning using the model's prediction on the last tokens, e.g. here after 'Output') are possible for vision fine-tuning in unsloth. I'm quite new to unsloth, so apologies if this is answered somewhere already!
(And thank you for the great work you guys do here)
The text was updated successfully, but these errors were encountered:
Any-Winter-4079
changed the title
Can we use a custom chat template for vision fine-tuning?
Can we use a custom chat template (or no template at all) for vision fine-tuning?
Nov 24, 2024
Yes custom templates are allowed (or pure text) - (interesting idea as well!)
I haven't yet provided an example though, but Pixtral also allows multi image inputs (unlike other models), so it might be pretty cool to see it working - https://huggingface.co/mistral-community/pixtral-12b for example has some inference code which shows how you can prompt the model dynamically without following a certain template.
I've seen on the Pixtral-12B Colab notebook that:
For my use case for instance, I have a robot which I'm thinking of fine-tuning using:
Since for most of the time the robot thinks by itself (i.e. no user interaction, which only occasionally interacts via 'Vision' and 'Hearing'), using user-assistant turns is not ideal.
I understand I could force Vision, Hearing, etc as 'role': 'user', but it's not ideal because in reality the robot is interacting with its own thoughts (and speaking in 1st person to itself) for the most part and it's just confusing.
And while my use case might be unique :), I think custom templates/different prompting 'templates' are not so much (i.e. I've seen quite a few use cases)
So, I'm wondering if custom templates (or no templates at all, masking all tokens up to a certain part of the text then fine-tuning using the model's prediction on the last tokens, e.g. here after 'Output') are possible for vision fine-tuning in unsloth. I'm quite new to unsloth, so apologies if this is answered somewhere already!
(And thank you for the great work you guys do here)
The text was updated successfully, but these errors were encountered: