-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about training with BatchHardTripletLoss #636
Comments
Hi @tide90 BatchSemiHardTripletLoss is actually a bit more complicated: This is then optimized so that (anchor, positive) are close and (anchor, negative) are far apart in vector space. |
Could you be more specific? I just know the case training a siamese network (sahred weights) like with tensorflow. I do not see how training happens if you just compute 2 embeddings but just feeding one input? I understand the BatchHard Loss better, but do not know what is the underlying model architecture? This loss expects a single input and compute per batch on the fly different pairs. That is understandable. Although you have obviously same "model training" usage for different types of inputs as with quora vs bi-enocder (using 2 vs 1 input). So in tensorflow you would assume having different model architectures. EDIT: origianlly I have always simaese networks with contrastive loss in my mind. And you make this image also in your paper. So maybe my image having a siamese network is wrong? |
Yes, the image does not match for BatchHard loss. For a nice write-up on triplet loss and batch hard loss: |
Sorry, but maybe it is too trivial. I feel you don't go into my points above. |
Is training with BatchHardTripletLoss as simple as just using this example below the low from the docs (and also shown in the first post above)? I just need to feed 1 sentences and its corresponding class label? Since it computes automatic in a more intelligent way the triplets within each batch, I do not have to take care to group them by my own like in the **MultipleNegativesRankingLoss **? #641 |
An example is here: You should ensure that each batch has for every included class label at least two example. This can realized, again, via a pytorch datasampler that constructs your batches with the needed properties. Such a sampler is implemented here: |
@nreimers Thanks! I think this is exactly what was latest an question in other issues about MultirankingRanking loss. And this sampler is a good starting point for this loss? And this code example is out there since aug 04. In the BatchHard example: Why do you use there a triplet generator? It seems only for dev/test cases. But training is without giving triplets? This goes towards the direction of the issuer. I feel also a little confusion about this, because from |
@datistiquo The only differences are how the losses are computed. And how to compute the loss depends on your available training data and the properties these have. So based on what labeled data you have, you have to choose the right loss. BatchHard generates the triplets online (as described in the above blog post). So no need to generate triplets yourself, the loss will look into the batch and create all possible triplets from it. For evaluation, however, we want to see how well it works for specific triplets. So we create some fixed triplets and evaluate the model on it. |
Hey @nreimers in the batchhard example you use SentenceLabelDataset but also showed this LabelSampler. Are both the same? I thought the triplet were computed within the loss function and not via the dataset? Also I assume if commenting out the loss used in this example and use one of the other (out commented below) losses there would need no changes? |
In the SentenceLabelDataset you have in the getitem
|
@nreimers . I want to get back to the issue with training as also mentioned from @datistiquo. In
I cannot find out why and how the model training depends on the kind of trainingdata. When using different losses I have to have a kind of siamese or triplet or just a single network pass (for the bacthTripletsLosses). Where is this happining? I think this is also mentioned in the second last post about the input depending on the various losses but having the "same synatx" for training. I would also love to know:
Is it correct that the dataloader uses the SentenceLabelDataset object to genrate the bacthes on the fly while training? So is this for a preselection to have only some triplets on each batch before calculating the actual triplets? Thanks! |
@tide90 The relevant part is in the loss functions. They define how the loss is computed and which sentences are compared. The SentenceLabelDataset and the LabelSampler are rather old, outdated and complicated scripts In the BatchHardExample, provide_positive and provide_negative are both False. In that case, SentenceLabelDataset just returns the InputExample without any changes. So it is identical to a list of InputExample and returning an element at a specific position. But the SentenceLabelDataset sorts the elements so that examples with the same label are right next to each other, e.g: The LabelSampler then ensures that each batch has multiple samples with the same label. If an example with e.g. label 1 is picked, it will pick other examples also with label 1. The implementation is rather complicated and stems from an very old version of sentence transformers. A more efficient implementation would be possible (especially with the upcoming version 0.4.1). For more details on datasets, dataloaders and data samplers, I highly recommend to have a look at this article that explains the fundamentals: |
I understand Sampler as something which takes the batch (from the dataloader) and yields (1 example) sequentially with some sought operations (like randomly pick 1 example). I got the impression of the SentenceTranforer Freamework that all is implemented (otherwise they weren't all these losses). So this framework needs to be adpated with my own dataloaders etc. But I would assume to have standard examples for each loss (dataset, dataloaders) to be able to use them properly. A basic example like above is good, but this now makes it diffuclt to apply to the other batchhardlosses because of swithcing the provide_negative or provide_positive (A comment in the code when to do what would be good). A comment or hint especially for the MultiRankingLoss should be there. I still not understand why you have this:
The loss takes as input just the examples, and calculates the triplets.
Do interchnaged the words SentenceLabelDataset and LabelSampler ?
So, using this loss I need to use this Sampler? But this Sampler is not used in the dataset? Very confusing. |
Actually, I thought using some of these BatchTripletLosses I do not need any advcanced dataloader/dataset because the loss calculated all which is needed. The only thing I need is maybe to ensured at least 2 positives examples in each batch? |
@tide90 The SentenceLabelDataset has been substantially overworked: And the example was also updated: If you use that dataset, it is ensured that each batch has at least 2 examples from each label class. There is no longer a need for a sampler and it can be quite easily be done using the dataset class from torch. |
@nreimers Thank you. I test it right now. So, you train all this batch hard losses with single examples and labels, like normal text classification. Can I use such a trained model to do semantic matching via the sentence encodings and cosine simalirity (see below)? I have the IR evaluator (like here), where I compute the similarity between two sentences. If I train with the loss and the euclidean distance, should I use the euclidean distance for using the sentence embeddings ( for the similarity between two sentences) too? I guess that is somehow important, right? In your IR evaluator you use by default the cosine sim. So this should you know before combining the batch hard losses with the IR-evaluator. Also, the question is if the cosine sim is even suitable if you trained via cosine distance. So, the loss and the distance function are forming the embeddings via training, so you should use the same distance for evaluation for the embeddings to yield sim meaning. Do you understand what I mean? |
@nreimers . I would assume that you should use the same distance metric like in your loss, as such the vectors are trained via this metric. So using this metric in your evaluation/tests is better. What do you think about evaluation of word embeddings using the cosine similiartiy although if the loss uses the cosine distance? I think the cosine sim score should capture the same informations as the cosine distance? As it is just some translation in vectorspace. |
@nreimers Was my question clear or is it too trivial? :-) I think intuitively you need the same metric for calculating the semantic of an emebedding like during the training in the loss... |
Cosine distance and cosine similarity are basically the same. If vectors have limited length, e. g. vector length are all below a threshold. Then cosine similarity and euclidean distance are more or less the same (plus some factors) |
Dear all, I have just started using the triplet losses for my research. However, since I have not studied much of the PyTorch, I have the next two queries: i) How can I see and store the triplets that are formatted on the testing phase? For example, in the training_batch_hard_trec.py script, what should I add there for saving the test set based on triplets? Thanks for your time. |
|
@nreimers thanks for the answer. Could you help me with the command for saving the test set? Thanks again for your time. |
https://stackoverflow.com/questions/11218477/how-can-i-use-pickle-to-save-a-dict Works also with any other data type in python |
Maybe it is a naive questions (as I am not native to pyTorch).
When training as an example shown here (to the above loss mentioned):
how is there a siamese model trained where I have two inputs? Because you are using a SentenceTransaformer (which maps single input to output). Also in your bi-encoder example you build a sentence transformer from scratch.
I just wonder how training in a siamese manner happens?
In my understanding SentenceTranformer is a siamese bi-encoder (like in your paper).
Otherwise in your Quora example: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/quora_duplicate_questions/training_multi-task-learning.py
also SentenceTransformer model is trained and gets the 2 inputs for sentences pairs. I wonder where and when the model "knows" how to fit depending on the number of inputs? I feel I miss something. When is "siamese" model trained and when a "single" model with 1 input?
The text was updated successfully, but these errors were encountered: