Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we use a custom chat template (or no template at all) for vision fine-tuning? #1331

Open
Any-Winter-4079 opened this issue Nov 24, 2024 · 1 comment

Comments

@Any-Winter-4079
Copy link

Any-Winter-4079 commented Nov 24, 2024

I've seen on the Pixtral-12B Colab notebook that:

To format the dataset, all vision finetuning tasks should be formatted as follows:

[
{ "role": "user",
  "content": [{"type": "text",  "text": Q}, {"type": "image", "image": image} ]
},
{ "role": "assistant",
  "content": [{"type": "text",  "text": A} ]
},
]

For my use case for instance, I have a robot which I'm thinking of fine-tuning using:

  • 'Vision'
  • 'Hearing'
  • 'Short-term thoughts'
  • and 'Long-term memories' as input.
  • And the next thought as output.
Screenshot 2024-11-24 at 13 59 26

Since for most of the time the robot thinks by itself (i.e. no user interaction, which only occasionally interacts via 'Vision' and 'Hearing'), using user-assistant turns is not ideal.

I understand I could force Vision, Hearing, etc as 'role': 'user', but it's not ideal because in reality the robot is interacting with its own thoughts (and speaking in 1st person to itself) for the most part and it's just confusing.

And while my use case might be unique :), I think custom templates/different prompting 'templates' are not so much (i.e. I've seen quite a few use cases)

So, I'm wondering if custom templates (or no templates at all, masking all tokens up to a certain part of the text then fine-tuning using the model's prediction on the last tokens, e.g. here after 'Output') are possible for vision fine-tuning in unsloth. I'm quite new to unsloth, so apologies if this is answered somewhere already!

(And thank you for the great work you guys do here)

@Any-Winter-4079 Any-Winter-4079 changed the title Can we use a custom chat template for vision fine-tuning? Can we use a custom chat template (or no template at all) for vision fine-tuning? Nov 24, 2024
@danielhanchen
Copy link
Contributor

Yes custom templates are allowed (or pure text) - (interesting idea as well!)

I haven't yet provided an example though, but Pixtral also allows multi image inputs (unlike other models), so it might be pretty cool to see it working - https://huggingface.co/mistral-community/pixtral-12b for example has some inference code which shows how you can prompt the model dynamically without following a certain template.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants