Abnormally low values for NanoBEIR benchmark #1627

minsik-ai · 2024-12-24T12:03:15Z

Continuing from #1588

NanoBEIR performance on Touche2020 and NFCorpus is too low compared to reported values.

You can check out some of the values here: embeddings-benchmark/results#72

isaac-chung · 2024-12-24T16:13:22Z

@minsik-ai could you please specify:

which model you tried, the script and/or commands you used,
the corresponding results file in that PR you linked, and
what values (metrics) you're comparing

Thanks in advance!

Samoed · 2024-12-24T17:43:53Z

The original blog only presents results for e5-mistral based models, and it's hard to evaluate because we don't know which prompts were used during testing. I think @ArthurCamara might be able to share some insights on how they evaluated models on NanoBEIR.

Samoed · 2024-12-24T20:12:37Z

I've evaluated multilingual-e5-small on mteb NanoBEIR and sentence transformers. Code. Here scores is ndcg@10

Task	MTEB	Sentece Transformers
NanoArguAna	0.44536	0.444486
NanoClimateFever	0.2222	0.30642
NanoDBPedia	0.17534	0.6053
NanoFever	0.80845	0.30642
NanoFiQA2018	0.34363	0.4430
NanoHotpotQA	0.56911	0.81012
NanoMSMARCO	0.62091	0.62091
NanoNFCorpus	0.05535	0.2885
NanoNQ	0.67664	0.68618
NanoQuora	0.90621	0.97279
NanoSCIDOCS	0.20826	0.34377
NanoSciFact	0.71129	0.72457
NanoTouche2020	0.19598	0.49540

Not matching results:

NanoArguAna
NanoClimateFever
NanoDBPedia
NanoFever
NanoFiQA2018
NanoHotpotQA
NanoNFCorpus
NanoQuora
NanoSCIDOCS
NanoTouche2020

Matching results:

NanoArguAna
NanoMSMARCO
NanoSciFact (diff 0.01)

minsik-ai · 2024-12-25T01:45:12Z

@Samoed 's findings is the main difference I've seen!
You can see NFCorpus is at 0.05 range for MTEB, compared to Sentence Transformers which have 0.2 range.
I've also run additional experiments with intfloat/e5-mistral-7b-instruct and have seen similar performance degradation.

KennethEnevoldsen · 2024-12-25T08:40:45Z

Hmm the diffrerence here seems so stark that I will label it as a bug. It might be worth excluding it from the registrered benchmark until we have these results (comment it out in benhcmarks.py)

ArthurCamara · 2024-12-25T10:23:21Z

The original blog only presents results for e5-mistral based models, and it's hard to evaluate because we don't know which prompts were used during testing. I think @ArthurCamara might be able to share some insights on how they evaluated models on NanoBEIR.

Hi there! we evaluated E5-Mistral with the same prompts as they used in the original paper (using the SentenceTransformers implementation, IIRC).

Just a quick reminder, scores between the Nano and full versions of BeIR are not supposed to match, even in magnitude. It is supposed to be used as a quick (and approximate) way to measure relative performance (i.e., if model A performs better than model B in NanoBeIR, it should also perform better in the full dataset)

I've been postponing a full benchmark evaluation across multiple models for a while now, as I'm busy with other work-related stuff. But I'm hoping i can get some time over the holidays to do it properly soon.

minsik-ai · 2024-12-25T12:52:17Z

I've evaluated multilingual-e5-small on mteb NanoBEIR and sentence transformers. Code. Here scores is ndcg@10

Task MTEB Sentece Transformers
NanoArguAna 0.44536 0.444486
NanoClimateFever 0.2222 0.30642
NanoDBPedia 0.17534 0.6053
NanoFever 0.80845 0.30642
NanoFiQA2018 0.34363 0.4430
NanoHotpotQA 0.56911 0.81012
NanoMSMARCO 0.62091 0.62091
NanoNFCorpus 0.05535 0.2885
NanoNQ 0.67664 0.68618
NanoQuora 0.90621 0.97279
NanoSCIDOCS 0.20826 0.34377
NanoSciFact 0.71129 0.72457
NanoTouche2020 0.19598 0.49540
Not matching results:

NanoArguAna

NanoClimateFever

NanoDBPedia

NanoFever

NanoFiQA2018

NanoHotpotQA

NanoNFCorpus

NanoQuora

NanoSCIDOCS

NanoTouche2020

Matching results:

NanoArguAna

NanoMSMARCO

NanoSciFact (diff 0.01)

Correct me if I'm wrong, but I think 2 NanoBEIR implementations (one from SentenceTransformers, one from MTEB) are compared here! We are not comparing NanoBEIR to full BEIR. You can check the code to verify it.

minsik-ai · 2024-12-31T02:09:27Z

Do we have any traction on this issue? I plan to have a look at SentenceTransformers code and spot any differences, meanwhile would appreciate any insights on why SentenceTransformers and MTEB differ.

Samoed · 2024-12-31T04:58:00Z

For I don't have any idea why the results are different, but I'll try to find issues too

Samoed · 2025-01-02T21:02:08Z

Task	MTEB	Sentence Transformers	#1687
NanoArguAna	0.44536	0.444486	0.44536
NanoClimateFever	0.2222	0.30642	0.30643
NanoDBPedia	0.17534	0.6053	0.60535
NanoFever	0.80845	0.8348586	0.83486
NanoFiQA2018	0.34363	0.4430	0.43529
NanoHotpotQA	0.56911	0.81012	0.81012
NanoMSMARCO	0.62091	0.62091	0.62091
NanoNFCorpus	0.05535	0.2885	0.2882
NanoNQ	0.67664	0.68618	0.68618
NanoQuora	0.90621	0.97279	0.9728
NanoSCIDOCS	0.20826	0.34377	0.34378
NanoSciFact	0.71129	0.72457	0.72458
NanoTouche2020	0.19598	0.49540	0.49541

KennethEnevoldsen added the bug Something isn't working label Dec 25, 2024

Samoed mentioned this issue Jan 2, 2025

fix: NanoBeir #1687

Merged

2 tasks

Samoed closed this as completed in #1687 Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abnormally low values for NanoBEIR benchmark #1627

Abnormally low values for NanoBEIR benchmark #1627

minsik-ai commented Dec 24, 2024

isaac-chung commented Dec 24, 2024

Samoed commented Dec 24, 2024 •

edited

Loading

Samoed commented Dec 24, 2024 •

edited

Loading

minsik-ai commented Dec 25, 2024

KennethEnevoldsen commented Dec 25, 2024 •

edited

Loading

ArthurCamara commented Dec 25, 2024

minsik-ai commented Dec 25, 2024 •

edited

Loading

minsik-ai commented Dec 31, 2024

Samoed commented Dec 31, 2024

Samoed commented Jan 2, 2025 •

edited

Loading

Abnormally low values for NanoBEIR benchmark #1627

Abnormally low values for NanoBEIR benchmark #1627

Comments

minsik-ai commented Dec 24, 2024

isaac-chung commented Dec 24, 2024

Samoed commented Dec 24, 2024 • edited Loading

Samoed commented Dec 24, 2024 • edited Loading

minsik-ai commented Dec 25, 2024

KennethEnevoldsen commented Dec 25, 2024 • edited Loading

ArthurCamara commented Dec 25, 2024

minsik-ai commented Dec 25, 2024 • edited Loading

minsik-ai commented Dec 31, 2024

Samoed commented Dec 31, 2024

Samoed commented Jan 2, 2025 • edited Loading

Samoed commented Dec 24, 2024 •

edited

Loading

Samoed commented Dec 24, 2024 •

edited

Loading

KennethEnevoldsen commented Dec 25, 2024 •

edited

Loading

minsik-ai commented Dec 25, 2024 •

edited

Loading

Samoed commented Jan 2, 2025 •

edited

Loading