-
Notifications
You must be signed in to change notification settings - Fork 36
Vision multimodal #369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Vision multimodal #369
Conversation
hint=FieldHint.architecture, | ||
) | ||
# TODO: ====== Appropriate name?? ====== | ||
decoder: BlockSequenceConfig = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encoder
peft=self._peft, | ||
) | ||
# TODO: ====== Appropriate name?? ====== | ||
self.decoder = self._config.decoder.get_layer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encoder
peft=self._peft, | ||
) | ||
# TODO: ====== Hidden dim ====== | ||
self.adapter = self._config.adapter.get_layer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to make the adapter part of the encoder, because adapter tensor shapes depend on decoder. And we also want to mix and match existing pre trained encoders and decoders...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same with every module basically, their shapes all need to match. I'm organizing the modules so thy manage their internal hidden shapes, but input and output shapes are managed by the parent modules (hidden_dim
argument), so in that case it makes sense to keep the adapter here.
The todo refers to the MLP assuming matching input and output dimensions, that's an easy fix but I haven't gotten to it yet.
✨ Description
An attempt at integrating multimodal vision models to main. Still a lot of work to do...