-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SigLIP-style loss for better DDP #3119
Comments
Hello!
For example, you want to train with a batch size of 16384 and you have 8 GPUs. Each GPU can handle 64 samples at a time before OOM-ing. Then you can use DDP with 8 processes, a The total batch size will be 131072, and that many samples are collected at the start of the step. These are then split up into 8 subbatches of 16384 samples and divided across the 8 GPUs. Each GPU will process these with CMNRL as if it's the only GPU, and the only communication is once all 8 are finished: when the gradients are averaged across all GPUs. Please let me know if I'm overlooking something! Also, looking at your script - the inputs (i.e. Another note: I like the idea of a dynamic scale by making it a Parameter. I experimented with this briefly, but to turn it into something "meaningful", I'd have to override the optimizer kwargs with a much higher learning rate here: sentence-transformers/sentence_transformers/trainer.py Lines 1206 to 1219 in 679ab5d
|
I think your explanation is right; CMNRL + DDP will basically do plain DDP (plus accumulating gradients w/in each device under the hood) and won't share negatives. The motivation of the issue is to enable this sharing of negatives in a DDP setup. Here's what I mean. Say we have SigLIP is not exactly plain DDP. It allows each anchor to efficiently see all other negatives in the global batch (see Figure 1 and Section 3.3 of SigLIP), i.e., each of the Ofc, we can achieve seeing as many negatives as we want via accumulating mini-batches in CMNRL. But that accumulation will happen in sequence instead of in parallel. Does this motivation sound right? Now that I'm writing this out though, I realized that SigLIP will do I'm now less convinced about it / will probably happily stick w/ CMNRL for now :-)
The script uses
Good to know! That's something I missed in my experiment; the |
The motivation is indeed sound: sharing negatives allows for bigger batches, but I think it does not necessarily make sense to use cross-device negatives if we can also arbitrarily increase the batch size per device. All it really does is introduce cross-device communication, right?
Oops, I totally missed that. I was indeed stuck on I think there is merit in providing more elaboration in the documentation somewhere (either the Distributed Training section, or the CMNRL API reference) about really large batch size cases. I see a lot of papers use cross-device batches to try and increase their batch size, not realising that So researchers seem to look for cross-device solutions in Sentence Transformers too, not realising that we purposefully don't support it (as I think CMNRL is an equivalent or superior option).
|
It does, but SigLIP's cross-device communication is relatively efficient. SigLIP shifts embeddings from each device to its neighbor ("collective permute" in the paper) instead of doing all-gathers. I believe this shift has to happen Overall, I think my concerns are mostly washed away. Thank you for the discussion! |
Another application of the sigmoid/multi-label loss is that it seamlessly supports multi-positive and multi-negative data. (Pretty sure both positives and negatives can come in variable numbers.) I'm not yet certain about the benefits of this over unfolding the positives and feeding them to MNRL. But it seems like a potential use case. Maybe gets higher training throughput. What do you think? Update: I'm investigating this in https://github.com/kddubey/mpnrl |
Hello,
SigLIP demonstrates that formulating contrastive learning as a collection of independent classifications works well.
One benefit is that the loss easily allows training with more negatives—easily b/c every summand in the loss is independent of the rest. There's no softmax/normalization like there is in MNRL.
CachedMultipleNegativesRankingLoss
already enables large batch sizes for single-device training. But for DDP, IIUC, any implementation of MNRL requires more communication overhead than SigLIP b/c of the softmax.The SigLIP paper studies image-text data. To see if their sigmoid-style loss works for text data, I ran a tiny experiment here. It demonstrates that the performance is on par with MNRL on STS. (The notebook doesn't implement the actual, distributed SigLIP training scheme; it just checks that sigmoid instead of softmax works well for STS.)
I'm wondering if you or others are interested in incorporating a SigLIP-style training scheme into SentenceTransformers.
Thanks!
The text was updated successfully, but these errors were encountered: