-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for multi-modal models #662
Comments
Here is what the transformers interface looks like
I agree regarding complexity, the generator would need to manage I propose:
Would love to know your thoughts! |
Any updates on this thread? |
Hey! I just wanted to know if multimodal models can be used with the connector being implemented in issue #728 |
Yes you should be able to use this with multimodal models! |
For anyone who stumbles on this later, check out the relevant cookbook for working with vision models here. |
Presentation of the new feature
There are more and more accessible multi-modal models out there, such as llava, and constrained generation applies to every auto-regressive text generation model disregarding their input.
Where does it fit in Outlines?
Maybe the most reasonable way would be to let users pass tuples
(prompt, image)
to the API functions and usemultipledispatch
to dispatch both onmodel
andprompt
. Or create a newMultimodalModel
class and only dispatch on the model type like we currently do.We need to make sure users can't unknowingly shoot themselves in the foot, the
MultimodalModal
class would make this easy.My main concern is that we might need to make
generator
more complex, or duplicate part of it.Are you willing to open a PR?
Yes, although I'd appreciate if someone else were willing to take the lead. Happy to help with the design.
The text was updated successfully, but these errors were encountered: