-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leaderboard 2.0: Missing results #1571
Comments
@Muennighoff Anything pops out to you as something that has been run before but is missing? Also, can you run some or all of these? |
There are also some problematic tasks in MTEB(deu) and MTEB(fra), I will look into those too |
MTEB(fra)Quite a few individual results are missing for a lot of models, maybe something wrong with the scraping or data loading. Problematic tasks:
MTEB(deu)Problematic tasks:
In both benchmarks we're missing some tasks from 7B models. |
MTEB(eng, beta)Problematic tasks:
|
@Samoed @Muennighoff @KennethEnevoldsen I would really appreciate your help investigating and fixing these issues |
Great overview! Ofc can run anything that is missing if the model is loadable via mteb and ideally it's a python list of task & model pairs like here
|
I'll compile you a list |
we might also include this in the list: embeddings-benchmark/results#65 (review)
(won't have too much time to look into this as I am at the neurips conference, but I will take a closer look once I get back) |
Well, I have compiled a list of all task results that are missing for all benchmarks in the new leaderboard + all models that already show up in the leaderboard (have metadata and don't miss all results on a benchmark). I used the following script to get the missing results: import json
from pathlib import Path
import pandas as pd
from tqdm import tqdm
import mteb
from mteb.leaderboard.table import scores_to_tables
benchmarks = mteb.get_benchmarks()
all_results = mteb.load_results()
results = {
benchmark.name: benchmark.load_results(base_results=all_results)
.join_revisions()
.filter_models()
for benchmark in tqdm(benchmarks, desc="Loading all benchmark results")
}
def to_pandas(gr_df) -> pd.DataFrame:
cols = gr_df.value["headers"]
data = gr_df.value["data"]
return pd.DataFrame(data, columns=cols)
all_task_tables = {
name: to_pandas(scores_to_tables(res.get_scores(format="long"))[1]).set_index(
"Model"
)
for (name, res) in results.items()
}
missing_results = {}
for bench_name, table in all_task_tables.items():
missing_results[bench_name] = {}
for model_name, model_res in table.iterrows():
nas = model_res.loc[model_res.isna()].index.to_list()
if nas:
missing_results[bench_name][model_name] = nas And got this file: In the following format: {
"<benchmark_name>": {"<model_name>": ["task_name1", "task_name2", ...]}
} We'll probably have to prioritize some models over others |
Here it is for only the top 50 models, this should be a bit more reasonable to run: And here it is as a list of model-task pairs, as requested @Muennighoff : |
Sorry for the delay, somehow missed the message. Running all from your original message (#1571 (comment)) now. Looks like it is ~30K which should be fine. Will open a PR on the results repo once done. Hope there are no bugs in the mteb implemented models 😁 |
Opened a PR here: embeddings-benchmark/results#83 - Still pushing some results hope we're not breaking github 😅 |
I'm opening issues for all models/tasks that are failing and filtering them out in the meantime with the below addon to your script
|
Some pretty essential results seem to be missing from the new leaderboard.
Here's a list of things that we should probably fix before releasing the leaderboard:
MTEB(Multilingual)
I have only looked into models we promised to run in the review response, the problems might be more widespread
MTEB(eng, classic)
Problematic tasks:
Problematic models:
- [ ] ArxivClusteringP2P
- [ ] BiorxivClusteringP2P
- [ ] BiorxivClusteringS2S
- [ ] DBPedia
- [ ] HotpotQA
- [ ] MedrxivClusteringP2P
- [ ] MedrxivClusteringS2S
The text was updated successfully, but these errors were encountered: