Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'scale' hyperparameter in MultipleNegativesRankingLoss #3054

Open
gnatesan opened this issue Nov 14, 2024 · 7 comments
Open

'scale' hyperparameter in MultipleNegativesRankingLoss #3054

gnatesan opened this issue Nov 14, 2024 · 7 comments
Labels
question Further information is requested

Comments

@gnatesan
Copy link

gnatesan commented Nov 14, 2024

I am looking through the MultipleNegativesRankingLoss.py code and I have question about the 'scale' hyperparameter. Also known as the 'temperature', the scale is used to stretch or compress the range of output values from the similarity function. A larger scale creates greater distinction between positive and negative examples in terms of similarity score differences. The line below is how the scale is used in the forward function of the loss.

scores = self.similarity_fct(embeddings_a, embeddings_b) * self.scale

Currently, the scale is set to 20 for when cosine similarity is used as the distance metric.

Why was 20 selected as the scale for when using cosine similarity on the embeddings? Is this the optimal scale value for cosine similarity? Would this hyperparameter need to be optimized during fine-tuning?

@tomaarsen
Copy link
Collaborator

tomaarsen commented Nov 14, 2024

Hello!

I'm not actually super sure on the origin of this parameter, Nils Reimers added it before I took over. My understanding is that the scale is the inverse of the temperature, i.e. the scale of 20 corresponds to a temperature of 0.05. Perhaps the reason that 20 was chosen was due to the common 0.05 temperature in InfoNCE.

  • A lower temperature (i.e. a higher scale) in InfoNCE/in-batch negatives loss should result in higher focus on the positive example.
  • A higher temperature (i.e. a lower scale) in InfoNCE/in-batch negatives loss should result in a more general distribution over the positive and negative examples.

Here's an example script of manually going through the loss:

from torch import nn
import torch
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Let's take 1 sample and send it through our loss, except now "manually"
anchor = "is toprol xl the same as metoprolol?"
positive = "Metoprolol succinate is also known by the brand name Toprol XL. It is the extended-release form of metoprolol. Metoprolol succinate is approved to treat high blood pressure, chronic chest pain, and congestive heart failure."
negative_1 = "The Are You Experienced album was apparently mastered from the original stereo UK master tapes (according to Steve Hoffman - one of the very few who has heard both the master tapes and the CDs produced over the years). ... The CD booklets were a little sparse, but at least they stayed true to the album's original design."
negative_2 = "Matryoshka dolls are made of wood from lime, balsa, alder, aspen, and birch trees; lime is probably the most common wood type. ... After cutting, the trees are stripped of most of their bark, although a few inner rings of bark are left to bind the wood and keep it from splitting."
negative_3 = "The eyes are always the same size from birth to death. Baby eyes are proportionally larger than adult eyes, but they are still smaller."
# For now we assume that these negatives are in the same sample, so we train with 5 columns: anchor, positive, negative_1, negative_2, negative_3

# We now encode both the anchor, and the "candidate positives" out of which we want to find the real positive
anchor_embedding = model.encode(anchor)
candidate_embeddings = model.encode([positive, negative_1, negative_2, negative_3])
print(anchor_embedding.shape)
# (384,) a.k.a. 1 embedding of 384 dimensions
print(candidate_embeddings.shape)
# (4, 384) a.k.a. 4 embeddings of 384 dimensions

similarities = model.similarity(anchor_embedding, candidate_embeddings)
print(similarities)
# tensor([[0.7811, 0.0835, 0.0644, 0.0639]])
# a.k.a anchor is most similar to positive, not very similar to the 3 negatives

# Let's set up our loss
cross_entropy_loss = nn.CrossEntropyLoss()

# And we need a label, i.e. we need to know which of the 4 non-anchor embeddings is the positive one
# We can do this by setting label as the index of the true positive in the candidate_embeddings.
# In this case, the true positive is the first one, so the label is 0
label = 0

# And let's iterate over the scales to calculate the loss:
for scale in range(30):
    # Now we can calculate the loss
    loss = cross_entropy_loss(similarities * scale, torch.tensor([label]))
    print(f"Loss with scale {scale}: {loss.item():.4f}")

With these results:

Loss with scale 0: 1.3863
Loss with scale 1: 0.9059
Loss with scale 2: 0.5450
Loss with scale 3: 0.3046
Loss with scale 4: 0.1613
Loss with scale 5: 0.0825
Loss with scale 6: 0.0414
Loss with scale 7: 0.0206
Loss with scale 8: 0.0102
Loss with scale 9: 0.0050
Loss with scale 10: 0.0025
Loss with scale 11: 0.0012
Loss with scale 12: 0.0006
Loss with scale 13: 0.0003
Loss with scale 14: 0.0001
Loss with scale 15: 0.0001
Loss with scale 16: 0.0000
Loss with scale 17: 0.0000
Loss with scale 18: 0.0000
Loss with scale 19: 0.0000
Loss with scale 20: 0.0000
Loss with scale 21: 0.0000
Loss with scale 22: 0.0000
Loss with scale 23: 0.0000
Loss with scale 24: 0.0000
Loss with scale 25: 0.0000
Loss with scale 26: 0.0000
Loss with scale 27: 0.0000
Loss with scale 28: 0.0000
Loss with scale 29: 0.0000

Let's manually create some similarities and go through those:

from torch import nn
import torch

similarities_list = [
    torch.tensor([[0.7811, 0.0835, 0.0644, 0.0639]]),
    torch.tensor([[0.5842, 0.5243, 0.5351, 0.5124]]),
    torch.tensor([[0.4842, 0.5243, 0.5351, 0.5124]]),
    torch.tensor([[0.2424, 0.4243, 0.5382, 0.4244]]),
]
descriptions = [
    "Great similarity to positive",
    "Slightly more similarity to positive",
    "Slightly less similarity to positive",
    "Low similarity to positive",
]

# Let's set up our loss
cross_entropy_loss = nn.CrossEntropyLoss()

# And we need a label, i.e. we need to know which of the 4 non-anchor embeddings is the positive one
# We can do this by setting label as the index of the true positive in the candidate_embeddings.
# In this case, the true positive is the first one, so the label is 0
label = 0

for similarities, description in zip(similarities_list, descriptions):
    # And let's iterate over the scales to calculate the loss:
    print(description)
    for scale in range(30):
        # Now we can calculate the loss
        loss = cross_entropy_loss(similarities * scale, torch.tensor([label]))
        print(f"Loss with scale {scale}: {loss.item():.4f}")
    print()
Great similarity to positive
Loss with scale 0: 1.3863
Loss with scale 1: 0.9059
Loss with scale 2: 0.5450
Loss with scale 3: 0.3046
Loss with scale 4: 0.1613
Loss with scale 5: 0.0825
Loss with scale 6: 0.0414
Loss with scale 7: 0.0206
Loss with scale 8: 0.0102
Loss with scale 9: 0.0050
Loss with scale 10: 0.0025
Loss with scale 11: 0.0012
Loss with scale 12: 0.0006
Loss with scale 13: 0.0003
Loss with scale 14: 0.0001
Loss with scale 15: 0.0001
Loss with scale 16: 0.0000
Loss with scale 17: 0.0000
Loss with scale 18: 0.0000
Loss with scale 19: 0.0000
Loss with scale 20: 0.0000
Loss with scale 21: 0.0000
Loss with scale 22: 0.0000
Loss with scale 23: 0.0000
Loss with scale 24: 0.0000
Loss with scale 25: 0.0000
Loss with scale 26: 0.0000
Loss with scale 27: 0.0000
Loss with scale 28: 0.0000
Loss with scale 29: 0.0000

Slightly more similarity to positive
Loss with scale 0: 1.3863
Loss with scale 1: 1.3415
Loss with scale 2: 1.2974
Loss with scale 3: 1.2541
Loss with scale 4: 1.2116
Loss with scale 5: 1.1700
Loss with scale 6: 1.1291
Loss with scale 7: 1.0891
Loss with scale 8: 1.0499
Loss with scale 9: 1.0116
Loss with scale 10: 0.9742
Loss with scale 11: 0.9377
Loss with scale 12: 0.9020
Loss with scale 13: 0.8673
Loss with scale 14: 0.8334
Loss with scale 15: 0.8005
Loss with scale 16: 0.7684
Loss with scale 17: 0.7373
Loss with scale 18: 0.7071
Loss with scale 19: 0.6777
Loss with scale 20: 0.6493
Loss with scale 21: 0.6218
Loss with scale 22: 0.5952
Loss with scale 23: 0.5694
Loss with scale 24: 0.5445
Loss with scale 25: 0.5205
Loss with scale 26: 0.4973
Loss with scale 27: 0.4750
Loss with scale 28: 0.4534
Loss with scale 29: 0.4327

Slightly less similarity to positive
Loss with scale 0: 1.3863
Loss with scale 1: 1.4163
Loss with scale 2: 1.4466
Loss with scale 3: 1.4773
Loss with scale 4: 1.5083
Loss with scale 5: 1.5397
Loss with scale 6: 1.5714
Loss with scale 7: 1.6035
Loss with scale 8: 1.6359
Loss with scale 9: 1.6686
Loss with scale 10: 1.7016
Loss with scale 11: 1.7349
Loss with scale 12: 1.7686
Loss with scale 13: 1.8026
Loss with scale 14: 1.8368
Loss with scale 15: 1.8714
Loss with scale 16: 1.9062
Loss with scale 17: 1.9413
Loss with scale 18: 1.9767
Loss with scale 19: 2.0124
Loss with scale 20: 2.0484
Loss with scale 21: 2.0846
Loss with scale 22: 2.1211
Loss with scale 23: 2.1578
Loss with scale 24: 2.1948
Loss with scale 25: 2.2320
Loss with scale 26: 2.2695
Loss with scale 27: 2.3072
Loss with scale 28: 2.3451
Loss with scale 29: 2.3833

Low similarity to positive
Loss with scale 0: 1.3863
Loss with scale 1: 1.5567
Loss with scale 2: 1.7378
Loss with scale 3: 1.9288
Loss with scale 4: 2.1289
Loss with scale 5: 2.3376
Loss with scale 6: 2.5539
Loss with scale 7: 2.7774
Loss with scale 8: 3.0073
Loss with scale 9: 3.2431
Loss with scale 10: 3.4842
Loss with scale 11: 3.7302
Loss with scale 12: 3.9807
Loss with scale 13: 4.2352
Loss with scale 14: 4.4934
Loss with scale 15: 4.7550
Loss with scale 16: 5.0197
Loss with scale 17: 5.2873
Loss with scale 18: 5.5575
Loss with scale 19: 5.8301
Loss with scale 20: 6.1049
Loss with scale 21: 6.3816
Loss with scale 22: 6.6602
Loss with scale 23: 6.9405
Loss with scale 24: 7.2223
Loss with scale 25: 7.5054
Loss with scale 26: 7.7898
Loss with scale 27: 8.0754
Loss with scale 28: 8.3619
Loss with scale 29: 8.6494

So: a higher scale is harsher when the performance is bad (last case), while it's softer when the performance is good (first case). Beyond that, a higher scale is softer when the positive is slightly better than the negatives, and it's harder when the positive is slightly worse than the negatives.

@daegonYu once asked what'd happen if we set the scale to e.g. 50, and here it is:

Great similarity to positive
Loss with scale 50: 0.0000

Slightly more similarity to positive
Loss with scale 50: 0.1514

Slightly less similarity to positive
Loss with scale 50: 3.2294

Low similarity to positive
Loss with scale 50: 14.7967

What scale results in the best overall performance is an unanswered question. I think it would be really fascinating actually if someone set up a training script that trains e.g. 40 small models, as I'm not sure if the best performance would be around 20, the default.

  • Tom Aarsen

@tomaarsen
Copy link
Collaborator

I started some training jobs for testing the different scale values.

@gnatesan
Copy link
Author

Ok, so I'm trying to intuitively understand this. If I am using a well curated dataset for training where each anchor has a positive example and the negative examples are generated from the other positive examples in the batch (in-batch negatives), then wouldn't it make sense to use a higher scale? Since there is good performance with the positive example's similarity score being better than that of the negative example (assuming the model can generate these similarity scores because the in-batch negative sentences should be significantly different from the positive sentence for an example).

@tomaarsen
Copy link
Collaborator

In my opinion, the intuition is a tad hard to understand, and it really seems like the best approach for now is to just run some tests.

Speaking of which, these are the findings from my experiments yesterday:

image

I suspect that the difference between the scale parameters shrinks a lot once you add more training data, but perhaps it's worthwhile to consider a higher scale under similar settings?

I'd consider testing with your data to see what works best for you.

  • Tom Aarsen

@daegonYu
Copy link
Contributor

Thank you for sharing the good experimental results.

I saw the experimental results with different Scale values ​​above. What's interesting is that when Scale is 0, the Loss is the same at 1.3863. This means that if you train with Scale 0, you can get the same Loss value when training with data corresponding to "Great similarity to positive" and when training with data corresponding to "Low similarity to positive". I wonder if that means the model is not training properly.

In addition, I personally posted a GitHub issue (microsoft/unilm#1588) about Microsoft's E5 model. Can you tell me why the following phenomenon occurs? "The logits are calculated with cosine_similarity / t. Therefore, the logits will fall in [-100, 100] with t = 0.01 and [-50, 50] with t=0.02, etc.

However, this does not mean the learned cosine similarity will be in a wider range. On the contrary, the cosine similarity tends to concentrate as the temperature becomes lower."

I understand that logits will fall in [-100, 100], which means that a lower temperature allows the logits to vary in a wider range, but I still do not understand why the cosine similarity tends to concentrate as the temperature becomes lower. Is this just an experimental result and the cause is unknown?

@tomaarsen
Copy link
Collaborator

tomaarsen commented Nov 18, 2024

I saw the experimental results with different Scale values ​​above. What's interesting is that when Scale is 0, the Loss is the same at 1.3863. This means that if you train with Scale 0, you can get the same Loss value when training with data corresponding to "Great similarity to positive" and when training with data corresponding to "Low similarity to positive". I wonder if that means the model is not training properly.

Indeed, it's because we do scores = self.similarity_fct(embeddings_a, embeddings_b) * self.scale. With scale as 0, all similarity scores are 0. So: with scale=0, the embeddings don't actually matter - it doesn't learn. That's also why the performance is so bad.

As for the cosine similarity concentration: I'm actually not sure why this happens. My intuition however is that because
a lower temperature (i.e. a higher scale) should result in higher focus on the positive example, and this might also imply that we don't focus as much on keeping the similarity score of the negative samples lower. Perhaps this means that the similarity to a negative sample is higher, resulting in almost all similarity scores concentrating around the same point (e.g. 0.75 or something).

  • Tom Aarsen

@daegonYu
Copy link
Contributor

Thank you so much for your great advice. Your intuition has helped me a lot. Thanks!

@tomaarsen tomaarsen added the question Further information is requested label Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants