standardize descriptive stats #1321

KennethEnevoldsen · 2024-10-24T12:50:29Z

Currently, descriptive stats are quite inconsistent. This leads problems e.g. if we want to calculate the number of characters pr. task to estimate the number of compute tokens needed.

All of these calculations could be automated and are in the _calculate_metrics_from_split, however, it is not calculated for all datasets. It would be great to have a test that tests that these are calculated consistently across all tasks.

Additionally, this data is currently included in the metadata, which might not be ideal (often requiring copy-paste, which could lead to potential errors). A solution could be to write it to a json from which the data is fetched when needed. Tests can then fail if this cache is not full.

The text was updated successfully, but these errors were encountered:

Samoed · 2024-10-24T13:41:06Z

I've been considering improvements to metadata_metrics as well.

A solution could be to write it to a JSON file from which the data is fetched when needed. Tests can then fail if this cache is not complete.

That's a great suggestion! Are you suggesting storing a JSON file with all the metadata directly in the mteb repository?

KennethEnevoldsen · 2024-10-25T19:41:36Z

Yep - packaged into the package.

Samoed mentioned this issue Nov 2, 2024

Standardize descriptive stats #1375

Merged

2 tasks

KennethEnevoldsen closed this as completed in #1375 Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

standardize descriptive stats #1321

standardize descriptive stats #1321

KennethEnevoldsen commented Oct 24, 2024

Samoed commented Oct 24, 2024

KennethEnevoldsen commented Oct 25, 2024

standardize descriptive stats #1321

standardize descriptive stats #1321

Comments

KennethEnevoldsen commented Oct 24, 2024

Samoed commented Oct 24, 2024

KennethEnevoldsen commented Oct 25, 2024