Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abnormally low values for NanoBEIR benchmark #1627

Closed
minsik-ai opened this issue Dec 24, 2024 · 10 comments · Fixed by #1687
Closed

Abnormally low values for NanoBEIR benchmark #1627

minsik-ai opened this issue Dec 24, 2024 · 10 comments · Fixed by #1687
Labels
bug Something isn't working

Comments

@minsik-ai
Copy link

Continuing from #1588

NanoBEIR performance on Touche2020 and NFCorpus is too low compared to reported values.

You can check out some of the values here: embeddings-benchmark/results#72

@isaac-chung
Copy link
Collaborator

@minsik-ai could you please specify:

  1. which model you tried, the script and/or commands you used,
  2. the corresponding results file in that PR you linked, and
  3. what values (metrics) you're comparing

Thanks in advance!

@Samoed
Copy link
Collaborator

Samoed commented Dec 24, 2024

The original blog only presents results for e5-mistral based models, and it's hard to evaluate because we don't know which prompts were used during testing. I think @ArthurCamara might be able to share some insights on how they evaluated models on NanoBEIR.

@Samoed
Copy link
Collaborator

Samoed commented Dec 24, 2024

I've evaluated multilingual-e5-small on mteb NanoBEIR and sentence transformers. Code. Here scores is ndcg@10

Task MTEB Sentece Transformers
NanoArguAna 0.44536 0.444486
NanoClimateFever 0.2222 0.30642
NanoDBPedia 0.17534 0.6053
NanoFever 0.80845 0.30642
NanoFiQA2018 0.34363 0.4430
NanoHotpotQA 0.56911 0.81012
NanoMSMARCO 0.62091 0.62091
NanoNFCorpus 0.05535 0.2885
NanoNQ 0.67664 0.68618
NanoQuora 0.90621 0.97279
NanoSCIDOCS 0.20826 0.34377
NanoSciFact 0.71129 0.72457
NanoTouche2020 0.19598 0.49540

Not matching results:

  • NanoArguAna
  • NanoClimateFever
  • NanoDBPedia
  • NanoFever
  • NanoFiQA2018
  • NanoHotpotQA
  • NanoNFCorpus
  • NanoQuora
  • NanoSCIDOCS
  • NanoTouche2020

Matching results:

  • NanoArguAna
  • NanoMSMARCO
  • NanoSciFact (diff 0.01)

@minsik-ai
Copy link
Author

@Samoed 's findings is the main difference I've seen!
You can see NFCorpus is at 0.05 range for MTEB, compared to Sentence Transformers which have 0.2 range.
I've also run additional experiments with intfloat/e5-mistral-7b-instruct and have seen similar performance degradation.

@KennethEnevoldsen KennethEnevoldsen added the bug Something isn't working label Dec 25, 2024
@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Dec 25, 2024

Hmm the diffrerence here seems so stark that I will label it as a bug. It might be worth excluding it from the registrered benchmark until we have these results (comment it out in benhcmarks.py)

@ArthurCamara
Copy link

The original blog only presents results for e5-mistral based models, and it's hard to evaluate because we don't know which prompts were used during testing. I think @ArthurCamara might be able to share some insights on how they evaluated models on NanoBEIR.

Hi there! we evaluated E5-Mistral with the same prompts as they used in the original paper (using the SentenceTransformers implementation, IIRC).

Just a quick reminder, scores between the Nano and full versions of BeIR are not supposed to match, even in magnitude. It is supposed to be used as a quick (and approximate) way to measure relative performance (i.e., if model A performs better than model B in NanoBeIR, it should also perform better in the full dataset)

I've been postponing a full benchmark evaluation across multiple models for a while now, as I'm busy with other work-related stuff. But I'm hoping i can get some time over the holidays to do it properly soon.

@minsik-ai
Copy link
Author

minsik-ai commented Dec 25, 2024

I've evaluated multilingual-e5-small on mteb NanoBEIR and sentence transformers. Code. Here scores is ndcg@10

Task MTEB Sentece Transformers
NanoArguAna 0.44536 0.444486
NanoClimateFever 0.2222 0.30642
NanoDBPedia 0.17534 0.6053
NanoFever 0.80845 0.30642
NanoFiQA2018 0.34363 0.4430
NanoHotpotQA 0.56911 0.81012
NanoMSMARCO 0.62091 0.62091
NanoNFCorpus 0.05535 0.2885
NanoNQ 0.67664 0.68618
NanoQuora 0.90621 0.97279
NanoSCIDOCS 0.20826 0.34377
NanoSciFact 0.71129 0.72457
NanoTouche2020 0.19598 0.49540
Not matching results:

  • NanoArguAna
  • NanoClimateFever
  • NanoDBPedia
  • NanoFever
  • NanoFiQA2018
  • NanoHotpotQA
  • NanoNFCorpus
  • NanoQuora
  • NanoSCIDOCS
  • NanoTouche2020

Matching results:

  • NanoArguAna
  • NanoMSMARCO
  • NanoSciFact (diff 0.01)

Correct me if I'm wrong, but I think 2 NanoBEIR implementations (one from SentenceTransformers, one from MTEB) are compared here! We are not comparing NanoBEIR to full BEIR. You can check the code to verify it.

@minsik-ai
Copy link
Author

Do we have any traction on this issue? I plan to have a look at SentenceTransformers code and spot any differences, meanwhile would appreciate any insights on why SentenceTransformers and MTEB differ.

@Samoed
Copy link
Collaborator

Samoed commented Dec 31, 2024

For I don't have any idea why the results are different, but I'll try to find issues too

@Samoed
Copy link
Collaborator

Samoed commented Jan 2, 2025

Task MTEB Sentence Transformers #1687
NanoArguAna 0.44536 0.444486 0.44536
NanoClimateFever 0.2222 0.30642 0.30643
NanoDBPedia 0.17534 0.6053 0.60535
NanoFever 0.80845 0.8348586 0.83486
NanoFiQA2018 0.34363 0.4430 0.43529
NanoHotpotQA 0.56911 0.81012 0.81012
NanoMSMARCO 0.62091 0.62091 0.62091
NanoNFCorpus 0.05535 0.2885 0.2882
NanoNQ 0.67664 0.68618 0.68618
NanoQuora 0.90621 0.97279 0.9728
NanoSCIDOCS 0.20826 0.34377 0.34378
NanoSciFact 0.71129 0.72457 0.72458
NanoTouche2020 0.19598 0.49540 0.49541

@Samoed Samoed mentioned this issue Jan 2, 2025
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants