Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

standardize descriptive stats #1321

Closed
KennethEnevoldsen opened this issue Oct 24, 2024 · 2 comments · Fixed by #1375
Closed

standardize descriptive stats #1321

KennethEnevoldsen opened this issue Oct 24, 2024 · 2 comments · Fixed by #1375

Comments

@KennethEnevoldsen
Copy link
Contributor

Currently, descriptive stats are quite inconsistent. This leads problems e.g. if we want to calculate the number of characters pr. task to estimate the number of compute tokens needed.

All of these calculations could be automated and are in the _calculate_metrics_from_split, however, it is not calculated for all datasets. It would be great to have a test that tests that these are calculated consistently across all tasks.

Additionally, this data is currently included in the metadata, which might not be ideal (often requiring copy-paste, which could lead to potential errors). A solution could be to write it to a json from which the data is fetched when needed. Tests can then fail if this cache is not full.

@Samoed
Copy link
Collaborator

Samoed commented Oct 24, 2024

I've been considering improvements to metadata_metrics as well.

A solution could be to write it to a JSON file from which the data is fetched when needed. Tests can then fail if this cache is not complete.

That's a great suggestion! Are you suggesting storing a JSON file with all the metadata directly in the mteb repository?

@KennethEnevoldsen
Copy link
Contributor Author

Yep - packaged into the package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants