Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment for a flat architecture? #50

Open
chokevin8 opened this issue Dec 17, 2024 · 3 comments
Open

Experiment for a flat architecture? #50

chokevin8 opened this issue Dec 17, 2024 · 3 comments

Comments

@chokevin8
Copy link

Hi @johnnynunez @ahatamiz , thank you for your brilliant work! I just have a question- have you guys considered using a flat architecture rather than a hierarchical one, and would this implementation be simple enough to just modify the code? Am particularly interested in utilizing this for a self-supervised training application. Any input from you guys would be appreciated, thank you so much!

@ahatamiz
Copy link
Collaborator

Hi @chokevin8 ! yes we have an internal version with flat (or isotropic) architecture but no plans to release publicly.

The performance is comparable (or better) than ViTs with even the most advanced training techniques likes DeiTIII.

@chokevin8
Copy link
Author

chokevin8 commented Dec 19, 2024

@ahatamiz Thanks for your quick response! Hmm, I see, that's impressive the performance is equal or better than SOTA ViTs. I'm aware that you can't give us a lot of information regarding this but is there any hints you can drop in how to implement this? Meaning would there have to be any changes in the Micro architecture as well? (Obviously macro architecture would have to be changed to make the backbone flat) Lastly, have you ever tried using self-supervised learning with this flat architecture? Thank you!

@ahatamiz
Copy link
Collaborator

Hi @chokevin8 the implementation is quite easy. You can take in the blocks used in stages 3/4 and make a isotropic model out of it without changing the resolution (simply replacing a ViT layout with this setup should work for constructing different model types).

But note that you need to maintain our strategy in dividing the number of layers to allocate the first half for MambaMixer blocks and the second half for self-attention blocks.

It should work for any types of training including SSL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants