-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve leaderboard 2.0 readability #1317
Comments
By fold-down menu you mean an accordion right? |
That ain't dumb! might try that one then. |
It does work for FollowIR! Also is the v2 leaderboard up somewhere or is this a picture from development? |
It's still in development. I'm using the leaderboard_2, brnach for new changes. You can run it by: from mteb.leaderboard import demo
demo.launch() |
I can host a demo version on my HF profile btw if it's something we'd be interested in having @orionw |
Ah, no problem @x-tabdeveloping! For some reason I misunderstood and thought it was already up. Thanks for the offer, but no need to add extra work during your development. It’s looking great already though! 🚀 |
Here's a demo of the current version: https://huggingface.co/spaces/kardosdrur/mmteb_leaderboard_demo |
Thanks for sharing the dev version! |
The leaderboard looks really amazing! Probably already planned but
(could be auto-displayed per-benchmark when selecting a benchmark)
|
@Muennighoff I'm on it! |
Hey @Muennighoff what does |
|
Might be worth moving integration with Arena to a separate issue (It might work well with #1432). I think it might warrant some more discussion. To begin with we could also add it to the description of MTEB(eng, beta). Something like: "English also has an arena-style benchmark for evaluating embeddings. You can check this out here". |
New leaderboard looks great (https://huggingface.co/spaces/mteb/leaderboard_2_demo)!! Some more feedback:
|
For BRIGHT it seems like it does not have languages when selecting it and the BRIGHT Long version is gone. 🤔 Also if running more models can help in any way just let me know and I can run more! |
Also some results seem to not agree with each other between the old and new leaderboard e.g. the ranking and scores for the Law tab look quite different |
Thanks for the feedback! I believe some of these issues have to do with the Yet other things are problems with Gradio, about which I have filed issues a while ago and received no response. (for instance we have NaN's in one table and nothing in the other despite both of them being created the same way, this is because their pandas styling doesn't work as intended, or the column wrapping, I believe I can't do anything about this unless they fix it in the library). Missing models are due to the fact that we lack metadata objects on these. We will implement quite a few of them soon, I will look into this. Missing results are a strange one, since we should, in theory have results on the entirety of MTEB(classic), this will have to be investigated. One mean is an overall mean the other one is mean over task type means. The ranks are calculated using borda counts, so neither of these are used for ranks. I can remove one of them or try to make it clearer qhat the difference is between them. I'll make sure to sort tasks :) |
Maybe it'd be nice to explain the metric somewhere (e.g. when hovering over it, but not sure that is possible) |
Awesome work! In
Do you mean the MTEB repository? Because I found models with |
Great feedback - I have added issues on most of the problems (to not miss any and allow us to distribute them among contributors)
completely agree I have added an outline in #1512 feel free to check them out.
@Samoed not sure what the problem is here will you add an issue on it?
we mean this (embedding-benchmark/mteb) repository, where there is not metadata for NV-Embed-v2 (we can add it though). I am unsure if we should require metadata for a new model submission, but that is how it is currently. |
@Muennighoff where do you see the disagreement in law? (see #1516) |
@KennethEnevoldsen Now |
I believe this is due to a fix by @orionw, where he showed that there was no difference in scores |
@Muennighoff I really wish we could do the hovering thing, but both me and Kenneth looked into this and it doesn't seem like it's possible to do with Gradio :( |
Btw can we move this to a different thread (#1303 )? It's a bit weird to discuss everything, especially data related issues in a thread named Leaderboard 2.0 readability |
Will close this thread - let us move the discussions into the relevant subthreads) |
A couple of comments for readability:
Originally posted by @KennethEnevoldsen in #1312 (comment)
The text was updated successfully, but these errors were encountered: