Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaderboard 2.0: Missing results #1571

Open
x-tabdeveloping opened this issue Dec 9, 2024 · 14 comments
Open

Leaderboard 2.0: Missing results #1571

x-tabdeveloping opened this issue Dec 9, 2024 · 14 comments
Labels
leaderboard issues related to the leaderboard

Comments

@x-tabdeveloping
Copy link
Collaborator

Some pretty essential results seem to be missing from the new leaderboard.
Here's a list of things that we should probably fix before releasing the leaderboard:

MTEB(Multilingual)

I have only looked into models we promised to run in the review response, the problems might be more widespread

  • openai/text-embedding-3-large
    • SprintDuplicateQuestions
  • openai/text-embedding-3-small
    • OpusparcusPC
    • SprintDuplicateQuestions
  • gte-Qwen2-7B-instruct
    • Core17InstructionRetrieval
    • HagridRetrieval
    • MIRACLRetrievalHardNegatives
    • News21InstructionRetrieval
    • Robust04InstructionRetrieval
    • WebLINXCandidatesReranking
  • BAAI/bge-large-en-v1.5
  • gritlm-8x7B
    • Core17InstructionRetrieval
    • HagridRetrieval
    • MIRACLRetrievalHardNegatives
    • Robust04InstructionRetrieval
    • WebLINXCandidatesReranking
  • salesforce/SFR-Embeddings-2R
    • Robust04InstructionRetrieval
    • News21InstructionRetrieval
    • Core17InstructionRetrieval
  • snowflake/arctic-embed-m-v1.5:
    • missing almost all tasks: This is a problem with all snowflake models
  • WhereIsAI/UAE-Large-V1:
    • missing almost all tasks
  • stella_en_1.5B_v5:
    • Core17InstructionRetrieval
    • News21InstructionRetrieval
    • Robust04InstructionRetrieval
  • stella_en_400M_v5
    • Core17InstructionRetrieval
    • News21InstructionRetrieval
    • Robust04InstructionRetrieval
  • intfloat/e5-base-v2
    • Core17InstructionRetrieval
    • News21InstructionRetrieval
    • Robust04InstructionRetrieval
  • intfloat/e5-large-v2
    • Core17InstructionRetrieval
  • intfloat/e5-small-v2
    • Core17InstructionRetrieval
    • News21InstructionRetrieval
    • Robust04InstructionRetrieval
  • jina/jina-embeddings-v3
  • mixedbread-ai/mxbai-embed-v1
    • BrazilianToxicTweetsClassification
    • CEDRClassification
    • Core17InstructionRetrieval
    • KorHateSpeechMLClassification
    • MalteseNewsClassification
    • MultiEURLEXMultilabelClassification
    • News21InstructionRetrieval
    • Robust04InstructionRetrieval
  • nomic-ai/nomic-embed-text-v1.5
    • BUCC.v2
    • BibleNLPBitextMining
    • BornholmBitextMining
    • BrazilianToxicTweetsClassification
    • CEDRClassification
    • Core17InstructionRetrieval
    • DiaBlaBitextMining
    • FloresBitextMining
    • IN22GenBitextMining
    • IndicGenBenchFloresBitextMining
    • KorHateSpeechMLClassification
    • MalteseNewsClassification
    • MultiEURLEXMultilabelClassification
    • NTREXBitextMining
    • News21InstructionRetrieval
    • NollySentiBitextMining
    • NorwegianCourtsBitextMining
    • NusaTranslationBitextMining
    • NusaXBitextMining
    • Robust04InstructionRetrieval
    • Tatoeba
  • NV-embed-v2 missing metadata Add new models nvidia, gte, linq #1436

MTEB(eng, classic)

Problematic tasks:

  • CQADupstackProgrammersRetrieval (almost all models have nans)

Problematic models:

  • openai/text-embedding-3-large:
    • SprintDuplicateQuestions
    • SummEval
  • openai/text-embedding-3-small:
    • SprintDuplicateQuestions
    • SummEval
  • e5-mistral-7b-instruct:
    • STS22
  • multilingual-e5-large-instruct:
    - [ ] ArxivClusteringP2P
    - [ ] BiorxivClusteringP2P
    - [ ] BiorxivClusteringS2S
    - [ ] DBPedia
    - [ ] HotpotQA
    - [ ] MedrxivClusteringP2P
    - [ ] MedrxivClusteringS2S
  • snowflake/arctic-embed-m-v1.5:
    • AmazonCounterfactualClassification
    • ArguAna
    • MassiveIntentClassification
    • NFCorpus
    • SCIDOCS
    • SICK-R
    • STS12
    • STS13
    • STS14
    • STS15
    • STS16
    • STS17
    • SciFact
    • SprintDuplicateQuestions
    • TRECCOVID
    • ToxicConversationsClassification
@x-tabdeveloping
Copy link
Collaborator Author

@Muennighoff Anything pops out to you as something that has been run before but is missing? Also, can you run some or all of these?

@x-tabdeveloping
Copy link
Collaborator Author

There are also some problematic tasks in MTEB(deu) and MTEB(fra), I will look into those too

@x-tabdeveloping
Copy link
Collaborator Author

MTEB(fra)

Quite a few individual results are missing for a lot of models, maybe something wrong with the scraping or data loading.

Problematic tasks:

  • MLSUMClusteringP2P almost all models missing

MTEB(deu)

Problematic tasks:

  • TenKGnadClusteringP2P **almost all models missing **

In both benchmarks we're missing some tasks from 7B models.

@x-tabdeveloping
Copy link
Collaborator Author

MTEB(eng, beta)

Problematic tasks:

  • HotpotQAHardNegatives majority of models missing
  • MedrxivClusteringS2S.v2 most notably: stella, gte, SFR, GritLM8x7b, bge, UAE
  • StackExchangeClusteringP2P.v2 almost all missing
  • SummEvalSummarization.v2 almost all missing
  • TwentyNewsgroupsClustering.v2 almost all missing

@x-tabdeveloping
Copy link
Collaborator Author

@Samoed @Muennighoff @KennethEnevoldsen I would really appreciate your help investigating and fixing these issues

@Muennighoff
Copy link
Contributor

Great overview! Ofc can run anything that is missing if the model is loadable via mteb and ideally it's a python list of task & model pairs like here

@x-tabdeveloping
Copy link
Collaborator Author

I'll compile you a list

@KennethEnevoldsen
Copy link
Contributor

we might also include this in the list: embeddings-benchmark/results#65 (review)

As a side note, we would like to run these models [list in issue] on the remaining tasks in the MTEB(Medical) benchmark. However, we initially held off due to API cost constraints. Do you have access to credits with these providers that we could use for this purpose? Alternatively, would it be possible for you to run them on your side?

(won't have too much time to look into this as I am at the neurips conference, but I will take a closer look once I get back)

@x-tabdeveloping
Copy link
Collaborator Author

Well, I have compiled a list of all task results that are missing for all benchmarks in the new leaderboard + all models that already show up in the leaderboard (have metadata and don't miss all results on a benchmark).
The list is very long, but luckily the most important models usually only miss one or two things. There also seem to be patterns, I'm wondering if it's got something to do with our versioning scheme for tasks.

I used the following script to get the missing results:

import json
from pathlib import Path

import pandas as pd
from tqdm import tqdm

import mteb
from mteb.leaderboard.table import scores_to_tables

benchmarks = mteb.get_benchmarks()

all_results = mteb.load_results()

results = {
    benchmark.name: benchmark.load_results(base_results=all_results)
    .join_revisions()
    .filter_models()
    for benchmark in tqdm(benchmarks, desc="Loading all benchmark results")
}


def to_pandas(gr_df) -> pd.DataFrame:
    cols = gr_df.value["headers"]
    data = gr_df.value["data"]
    return pd.DataFrame(data, columns=cols)


all_task_tables = {
    name: to_pandas(scores_to_tables(res.get_scores(format="long"))[1]).set_index(
        "Model"
    )
    for (name, res) in results.items()
}

missing_results = {}
for bench_name, table in all_task_tables.items():
    missing_results[bench_name] = {}
    for model_name, model_res in table.iterrows():
        nas = model_res.loc[model_res.isna()].index.to_list()
        if nas:
            missing_results[bench_name][model_name] = nas

And got this file:
missing_results.json

In the following format:

{
    "<benchmark_name>": {"<model_name>": ["task_name1", "task_name2", ...]}
}

We'll probably have to prioritize some models over others

@x-tabdeveloping
Copy link
Collaborator Author

Here it is for only the top 50 models, this should be a bit more reasonable to run:
missing_important.json

And here it is as a list of model-task pairs, as requested @Muennighoff :
missing_model_task_list_important.json

@isaac-chung isaac-chung added the leaderboard issues related to the leaderboard label Dec 18, 2024
@KennethEnevoldsen
Copy link
Contributor

(see embeddings-benchmark/results#80)

@Muennighoff
Copy link
Contributor

Muennighoff commented Dec 30, 2024

Sorry for the delay, somehow missed the message. Running all from your original message (#1571 (comment)) now. Looks like it is ~30K which should be fine. Will open a PR on the results repo once done.

Hope there are no bugs in the mteb implemented models 😁

@Muennighoff
Copy link
Contributor

Opened a PR here: embeddings-benchmark/results#83 - Still pushing some results

hope we're not breaking github 😅

Screenshot 2024-12-30 at 10 11 55 AM

@Muennighoff
Copy link
Contributor

Muennighoff commented Jan 1, 2025

I'm opening issues for all models/tasks that are failing and filtering them out in the meantime with the below addon to your script

file_path = "missing_results_9.json"
missing_results_f = {k: {k2: v2 for k2, v2 in v.items() if not (any(x in k2 for x in ["bm25s", "jasper", "bge-m3", "Zeta", "NoInstruct", "KaLM-embedding", "lodestone-base-4096-v1", "MiniCPM", "cai-stellaris-text", "e5-R-mistral-7b", "FollowIR", "Grit"]) or any(x in v2 for x in ["RuBQReranking", "STS22", "JSTS", "BUCC"]))} for k, v in missing_results.items()}
with open(file_path, "w") as json_file:
    json.dump(missing_results_f, json_file, indent=4)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
leaderboard issues related to the leaderboard
Projects
None yet
Development

No branches or pull requests

4 participants