Improve leaderboard 2.0 readability #1317

KennethEnevoldsen · 2024-10-24T11:44:15Z

A couple of comments for readability:

remove org from model name
and add reference link feat: Update metadata for all models #1316
multiply scores by 100 and keep one decimal, e.g. 78.1 (@orionw not sure if this also works for followIR?). I believe anything more precise does not translate to a meaningful performance difference
It might be ideal to move the "performance pr. task" to a separate table in a fold-down menu.
We could also add highlighting of highest scores using styling elements
Add ranks to summary table
Replace average rank with borda count and rank (Implement borda count and rank in leaderboard #1344 )

Originally posted by @KennethEnevoldsen in #1312 (comment)

x-tabdeveloping · 2024-10-24T11:45:31Z

By fold-down menu you mean an accordion right?

KennethEnevoldsen · 2024-10-24T11:52:31Z

It was my initial idea yes, but I suppose multiple things could work - tabs would also be an option:

x-tabdeveloping · 2024-10-24T11:55:36Z

That ain't dumb! might try that one then.

orionw · 2024-10-24T15:26:30Z

multiply scores by 100 and keep one decimal, e.g. 78.1 (@orionw not sure if this also works for followIR?)

It does work for FollowIR!

Also is the v2 leaderboard up somewhere or is this a picture from development?

x-tabdeveloping · 2024-10-24T16:39:54Z

It's still in development. I'm using the leaderboard_2, brnach for new changes. You can run it by:

from mteb.leaderboard import demo

demo.launch()

x-tabdeveloping · 2024-10-25T07:02:04Z

I can host a demo version on my HF profile btw if it's something we'd be interested in having @orionw

orionw · 2024-10-25T12:06:21Z

Ah, no problem @x-tabdeveloping! For some reason I misunderstood and thought it was already up. Thanks for the offer, but no need to add extra work during your development. It’s looking great already though! 🚀

x-tabdeveloping · 2024-10-28T07:29:27Z

Here's a demo of the current version: https://huggingface.co/spaces/kardosdrur/mmteb_leaderboard_demo

tomaarsen · 2024-10-29T22:32:13Z

Thanks for sharing the dev version!

Muennighoff · 2024-11-08T18:45:58Z

The leaderboard looks really amazing! Probably already planned but

some indication of contamination would be great as we discussed (maybe we just manually add it to the metadata for now where we know it (e.g. trained_on_{task_name}_{task_split}: true or training_datasets: [(Emotion, train), (Amazon, test), ...] or something else) and invite users to update the metadata via PR)
maybe adding some statistics, e.g. on the current LB we have the below at the bottom

Total Datasets: 213
Total Languages: 113
Total Scores: 88857
Total Models: 469

(could be auto-displayed per-benchmark when selecting a benchmark)

Maybe we want to link it with the arena somehow? (e.g. one dropdown option could be the arena and it links to the arena space; or we just have a banner at the bottom or top to motivate people to checkout the arena or similar; you probably have better ideas!)

x-tabdeveloping · 2024-11-11T07:37:10Z

@Muennighoff I'm on it!

x-tabdeveloping · 2024-11-11T07:50:17Z

Hey @Muennighoff what does Total scores mean?

Muennighoff · 2024-11-11T08:19:57Z

Total scores is the total number of scores i.e. how many numbers there are in the table. Maybe there's a better name for it 🤔

KennethEnevoldsen · 2024-11-11T09:13:35Z

Might be worth moving integration with Arena to a separate issue (It might work well with #1432). I think it might warrant some more discussion. To begin with we could also add it to the description of MTEB(eng, beta). Something like:

"English also has an arena-style benchmark for evaluating embeddings. You can check this out here".

x-tabdeveloping · 2024-11-11T14:50:45Z

I'm a bit stopped in my tracks because of glaring issues with Gradio's dataframes (1, 2). I have implemented the plot though, and will add overview info to the benchmarks' descriptions.

Muennighoff · 2024-11-27T17:23:57Z

New leaderboard looks great (https://huggingface.co/spaces/mteb/leaderboard_2_demo)!!

Some more feedback:

A lot of nan scores (e.g. for MTEB classic) - maybe making them empty instead?
Maybe task information should be sorted (either by name or task type)
I think the difference between the two means (Mean (Task), Mean (TaskType)) is not immediately clear - is it that one averages over task types first and then across the task type averages? Also not clear based on which the ranking is determined; maybe removing one of the means?
The column breaking is a bit odd e.g. how the S of STS lands on a newline
Some task <> benchmark associations seem wrong, e.g. for MTEB(fra) it shows non-French languages and datasets like Bornholm
Probably planned but still quite a few models missing; would be great to have them all! (e.g. only 42 models for MTEB classic vs 337 in the current leaderboard)

Muennighoff · 2024-11-27T17:28:18Z

For BRIGHT it seems like it does not have languages when selecting it and the BRIGHT Long version is gone. 🤔

Also if running more models can help in any way just let me know and I can run more!

Muennighoff · 2024-11-27T17:33:08Z

Also some results seem to not agree with each other between the old and new leaderboard e.g. the ranking and scores for the Law tab look quite different

x-tabdeveloping · 2024-11-27T17:44:20Z

Thanks for the feedback! I believe some of these issues have to do with the benchmarks.py file that we have in the package and not with the leaderboard itself, I'd recommend that we make issues of this and investigate.

Yet other things are problems with Gradio, about which I have filed issues a while ago and received no response. (for instance we have NaN's in one table and nothing in the other despite both of them being created the same way, this is because their pandas styling doesn't work as intended, or the column wrapping, I believe I can't do anything about this unless they fix it in the library).

Missing models are due to the fact that we lack metadata objects on these. We will implement quite a few of them soon, I will look into this.

Missing results are a strange one, since we should, in theory have results on the entirety of MTEB(classic), this will have to be investigated.

One mean is an overall mean the other one is mean over task type means. The ranks are calculated using borda counts, so neither of these are used for ranks. I can remove one of them or try to make it clearer qhat the difference is between them.

I'll make sure to sort tasks :)

Muennighoff · 2024-11-27T17:47:15Z

One mean is an overall mean the other one is mean over task type means. The ranks are calculated using borda counts, so neither of these are used for ranks. I can remove one of them or try to make it clearer qhat the difference is between them.

Maybe it'd be nice to explain the metric somewhere (e.g. when hovering over it, but not sure that is possible)

Samoed · 2024-11-27T18:34:42Z

Awesome work!

In MTEB(rus), almost all scores for retrieval are missing. I think this is because of MiracleRetrieval, which has the (MIRACL) suffix.

Missing models are due to the fact that we lack metadata objects on these. We will implement quite a few of them soon, I will look into this.

Do you mean the MTEB repository? Because I found models with model_meta in the results repository (e.g., NV-Embed-v2) that are not appearing on the leaderboard.

KennethEnevoldsen · 2024-11-27T19:23:25Z

Great feedback - I have added issues on most of the problems (to not miss any and allow us to distribute them among contributors)

Maybe it'd be nice to explain the metric somewhere (e.g. when hovering over it, but not sure that is possible)

completely agree I have added an outline in #1512 feel free to check them out.

In MTEB(rus), almost all scores for retrieval are missing. I think this is because of MiracleRetrieval, which has the (MIRACL) suffix.

@Samoed not sure what the problem is here will you add an issue on it?

Do you mean the MTEB repository? Because I found models with model_meta in the results repository (e.g., NV-Embed-v2) that are not appearing on the leaderboard.

we mean this (embedding-benchmark/mteb) repository, where there is not metadata for NV-Embed-v2 (we can add it though).

I am unsure if we should require metadata for a new model submission, but that is how it is currently.

KennethEnevoldsen · 2024-11-27T19:44:35Z

@Muennighoff where do you see the disagreement in law? (see #1516)

Samoed · 2024-11-27T20:09:07Z

@KennethEnevoldsen Now MiracleReranking has as main score ndcg_at_10, but previously it had NDCG@10(MIRACL)

KennethEnevoldsen · 2024-11-27T22:09:31Z

I believe this is due to a fix by @orionw, where he showed that there was no difference in scores

x-tabdeveloping · 2024-11-28T07:39:45Z

@Muennighoff I really wish we could do the hovering thing, but both me and Kenneth looked into this and it doesn't seem like it's possible to do with Gradio :(

x-tabdeveloping · 2024-11-28T07:40:55Z

Btw can we move this to a different thread (#1303 )? It's a bit weird to discuss everything, especially data related issues in a thread named Leaderboard 2.0 readability

KennethEnevoldsen · 2024-11-28T19:38:21Z

Will close this thread - let us move the discussions into the relevant subthreads)

x-tabdeveloping mentioned this issue Oct 24, 2024

WIP: Leaderboard UI improvements #1320

Merged

isaac-chung added the leaderboard issues related to the leaderboard label Nov 9, 2024

isaac-chung mentioned this issue Nov 9, 2024

Overview issue: Leaderboard 2.0 release #1405

Open

8 tasks

x-tabdeveloping mentioned this issue Nov 11, 2024

Leaderboard 2.0: added performance x n_parameters plot + more benchmark info #1437

Merged

KennethEnevoldsen mentioned this issue Nov 27, 2024

leaderboard v2.0: disagreement in scores between v2.0 and v1.0 #1516

Open

KennethEnevoldsen mentioned this issue Nov 27, 2024

leaderboard 2.0: Performance on subtasks (Bright retrieval) #1517

Open

KennethEnevoldsen closed this as completed Nov 28, 2024

Samoed mentioned this issue Dec 4, 2024

MIRACLRetrieval results are missing for most models. #1550

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve leaderboard 2.0 readability #1317

Improve leaderboard 2.0 readability #1317

KennethEnevoldsen commented Oct 24, 2024 •

edited by isaac-chung

Loading

x-tabdeveloping commented Oct 24, 2024 •

edited

Loading

KennethEnevoldsen commented Oct 24, 2024

x-tabdeveloping commented Oct 24, 2024

orionw commented Oct 24, 2024

x-tabdeveloping commented Oct 24, 2024

x-tabdeveloping commented Oct 25, 2024

orionw commented Oct 25, 2024

x-tabdeveloping commented Oct 28, 2024

tomaarsen commented Oct 29, 2024

Muennighoff commented Nov 8, 2024 •

edited

Loading

x-tabdeveloping commented Nov 11, 2024

x-tabdeveloping commented Nov 11, 2024

Muennighoff commented Nov 11, 2024

KennethEnevoldsen commented Nov 11, 2024

x-tabdeveloping commented Nov 11, 2024

Muennighoff commented Nov 27, 2024

Muennighoff commented Nov 27, 2024

Muennighoff commented Nov 27, 2024

x-tabdeveloping commented Nov 27, 2024

Muennighoff commented Nov 27, 2024

Samoed commented Nov 27, 2024

KennethEnevoldsen commented Nov 27, 2024 •

edited

Loading

KennethEnevoldsen commented Nov 27, 2024

Samoed commented Nov 27, 2024

KennethEnevoldsen commented Nov 27, 2024

x-tabdeveloping commented Nov 28, 2024

x-tabdeveloping commented Nov 28, 2024 •

edited

Loading

KennethEnevoldsen commented Nov 28, 2024

Improve leaderboard 2.0 readability #1317

Improve leaderboard 2.0 readability #1317

Comments

KennethEnevoldsen commented Oct 24, 2024 • edited by isaac-chung Loading

x-tabdeveloping commented Oct 24, 2024 • edited Loading

KennethEnevoldsen commented Oct 24, 2024

x-tabdeveloping commented Oct 24, 2024

orionw commented Oct 24, 2024

x-tabdeveloping commented Oct 24, 2024

x-tabdeveloping commented Oct 25, 2024

orionw commented Oct 25, 2024

x-tabdeveloping commented Oct 28, 2024

tomaarsen commented Oct 29, 2024

Muennighoff commented Nov 8, 2024 • edited Loading

x-tabdeveloping commented Nov 11, 2024

x-tabdeveloping commented Nov 11, 2024

Muennighoff commented Nov 11, 2024

KennethEnevoldsen commented Nov 11, 2024

x-tabdeveloping commented Nov 11, 2024

Muennighoff commented Nov 27, 2024

Muennighoff commented Nov 27, 2024

Muennighoff commented Nov 27, 2024

x-tabdeveloping commented Nov 27, 2024

Muennighoff commented Nov 27, 2024

Samoed commented Nov 27, 2024

KennethEnevoldsen commented Nov 27, 2024 • edited Loading

KennethEnevoldsen commented Nov 27, 2024

Samoed commented Nov 27, 2024

KennethEnevoldsen commented Nov 27, 2024

x-tabdeveloping commented Nov 28, 2024

x-tabdeveloping commented Nov 28, 2024 • edited Loading

KennethEnevoldsen commented Nov 28, 2024

KennethEnevoldsen commented Oct 24, 2024 •

edited by isaac-chung

Loading

x-tabdeveloping commented Oct 24, 2024 •

edited

Loading

Muennighoff commented Nov 8, 2024 •

edited

Loading

KennethEnevoldsen commented Nov 27, 2024 •

edited

Loading

x-tabdeveloping commented Nov 28, 2024 •

edited

Loading