-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CoCa: fix MultimodalTransformer init + Mask CLS token at end of seq #551
base: main
Are you sure you want to change the base?
Conversation
@gpucce thoughts? is there some way init_parameters could've somehow been called? |
No I think it used the default ones, I think the VisionTransformer doesn't call it either? I mean it calls it but it does nothing |
but text calls, and also what confuses me is how text_projection works since its initialized with torch.empty ( open_clip/src/open_clip/transformer.py Line 674 in fb72f4d
|
first fix the CI, then see my comment in #550 |
…c/open_clip into fix_multimodal_transformer
@iejMac I added one more change that should make this ready for the temptative retraining |
here is the run for this PR - https://wandb.ai//iejmac/open-clip/reports/CoCa-v2-fixes-comparison--Vmlldzo0NzY0ODIy |
@rom1504 thought on this? These changes don't brake current checkpoints, one issue, and actually initialize the MultimodalTransformer. I can try to start a run on Stability |
@rwightman @rom1504 @iejMac hi, I worked on this PR, as it is it has a few changes in tests, adds transformers compat and fixes the issues. This is the best working initialization I have found checking only up to 18 epochs. If you have some time to check this would be useful @JeniaJitsev I used a bit of laionize for this, but hopefully will have positive effect on mammut too. This is the report of the first 18 epochs compared to the old coca run https://wandb.ai/gpucce/coca_tests/reports/CoCa-V2-ViT-B-32---Vmlldzo1MDc4NTkz |
@gpucce so discussing here so I might possibly combine this with #660 checks, this was days before my second child was born so yeah, it got lost in the stack but I did take a peek (and subsequently forgot). The cls mask change, what does it do to existing trained CoCa models? The weight init was commented out in ViT because that's how OpenAI left it for the vision tower (and I thought we might try a different init some day), it relied on default PyTorch init. But there they used randn for all Param attributes The empty is indeed a WTF. I feel following the text encoder init is a better default than going fully with PyTorch defaults for the multimodal decoder though right? Any reason why you wanted to comment it all out? Could just add the call to init_parameters() and tweak the projection init if zeros was more desirable? |
The cls mask thing appears to not affect the Perf of existing models in both zeroshot-classification and captioning. The .empty was my mistake from the final refactoring, I had a .Linear in the coca model and while moving it in transformer I lost it. About the init I was going along with vision but indeed doing same as for text could be better, I couldn't run more experiments, even for the zero init could be that a small enough randn is slightly better. If you make it with text init and zeros a decision then I can run b-32 for some epochs as perf test on your PR branch. So works as a small sanity check on the PR effect on the model. |
No description provided.