Skip to content

Commit

Permalink
Merge pull request #8 from ai-forever/updates_for_v110
Browse files Browse the repository at this point in the history
Updates for v1.1.0
  • Loading branch information
LSinev authored Jan 12, 2024
2 parents 3e40ea0 + 3849737 commit 4490a2f
Show file tree
Hide file tree
Showing 114 changed files with 41,454 additions and 4,394 deletions.
29 changes: 29 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# Pyre type checker
.pyre/

# vim stuff
*.ropeproject
*.swp

# intellij
*.idea

# models
*.model
*.pt
*.pth
*.ckpt
*.bin
.s3_cache/

# MAC
.DS_Store

# DVC
.dvc/plots
105 changes: 62 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
<img alt="License" src="https://img.shields.io/badge/License-MIT-yellow.svg">
</a>
<a href="https://github.com/ai-forever/MERA/releases">
<img alt="Release" src="https://img.shields.io/badge/release-v1.0.0-blue">
<img alt="Release" src="https://img.shields.io/badge/release-v1.1.0-blue">
</a>

</p>
Expand All @@ -21,56 +21,54 @@
</p>
</h2>


## About MERA

MERA benchmark brings together all industry and academic players in one place to study the capabilities of fundamental models, draw attention to AI problems, develop collaboration within the Russian Federation and in the international arena, and create an independent unified system for measuring all current models.
MERA benchmark brings together all industry and academic players in one place to study the capabilities of fundamental models, draw attention to AI problems, develop collaboration within the Russian Federation and in the international arena and create an independent unified system for measuring all current models. This repository is a customized version of original [**Language Model Evaluation Harness**](https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.3.0) (**LM-Harness** `v0.3.0`).

Our contributions of this project are:
Our contributions to this project are:

- Instruction-based tasks available in 🤗HuggingFace dataset card [link](https://huggingface.co/datasets/ai-forever/MERA).
- LM-Harness evaluation code for models.
- Website of the benchmark with the [Leaderboard](https://mera.a-ai.ru/) and the scoring system inside.
- Instruction-based tasks available on 🤗 HuggingFace [dataset card](https://huggingface.co/datasets/ai-forever/MERA).
- Customized version of LM-Harness evaluation code for models (`v0.3.0`).
- Benchmark website with the [Leaderboard](https://mera.a-ai.ru/en/leaderboard) and the scoring submission system.
- Baselines of the open models and Human Benchmark.

`v1.0.0`

The MERA benchmark includes 21 text tasks (17 base tasks + 4 diagnostic tasks). See the task-table for a complete list.

| Name | Task Name | Task Type | Test Size | N-shots | Metrics |
| --- | --- | --- | --- | --- | --- |
| BPS | bps | Code, Math | 1000 | 2 | acc |
| CheGeKa | chegeka | World Knowledge | 416 | 4 | f1 / em |
| LCS | lcs | Code, Math | 500 | 2 | acc |
| MathLogicQA | mathlogicqa | Math + Logic | 1143 | 5 | acc |
| MultiQ | multiq | Reasoning QA | 900 | 0 | f1 / em |
| PARus | parus | Common Sense | 500 | 0 | acc |
| RCB | rcb | NLI | 438 | 0 | f1_macro / acc |
| ruDetox | rudetox | Ethics | 1000 | 0 | sta, sim, fl, j |
| ruEthics | ruethics | Ethics | 645 | 0 | 5 mcc |
| ruHateSpeech | ruhatespeech | Ethics | 268 | 0 | acc |
| ruHHH | ruhhh | Ethics | 178 | 0 | acc |
| ruHumanEval | ruhumaneval | Math, Code, PLP | 164 | 0 | pass@k |
| ruMMLU | rummlu | Reasoning | 961 | 5 | acc |
| ruModAr | rumodar | Math, Logic | 6000 | 0 | acc |
| ruMultiAr | rumultiar | Math | 1024 | 5 | acc |
| ruOpenBookQA | ruopenbookqa | World Knowledge | 400 | 5 | f1_macro / acc |
| ruTiE | rutie | Reasoning, Dialogue Context, Memory | 430 | 0 | acc |
| ruWorldTree | ruworldtree | World Knowledge | 525 | 5 | f1_macro / acc |
| RWSD | rwsd | Reasoning | 260 | 0 | acc |
| SimpleAr | simplear | Math | 1000 | 5 | acc |
| USE | use | Exam | 900 | 0 | grade_norm |
| MathLogicQA | mathlogicqa | Math, Logic | 1143 | 5 | Acc |
| MultiQ | multiq | Reasoning | 900 | 0 | EM / F1 |
| PARus | parus | Common Sense | 500 | 0 | Acc |
| RCB | rcb | NLI | 438 | 0 | Acc / F1_macro |
| ruModAr | rumodar | Math, Logic | 6000 | 0 | Acc |
| ruMultiAr | rumultiar | Math | 1024 | 5 | Acc |
| ruOpenBookQA | ruopenbookqa | World Knowledge | 400 | 5 | Acc / F1_macro |
| ruTiE | rutie | Reasoning, Dialogue Context, Memory | 430 | 0 | Acc |
| ruWorldTree | ruworldtree | World Knowledge | 525 | 5 | Acc / F1_macro |
| RWSD | rwsd | Reasoning | 260 | 0 | Acc |
| SimpleAr | simplear | Math | 1000 | 5 | Acc |
| BPS | bps | Code, Math | 1000 | 2 | Acc |
| CheGeKa | chegeka | World Knowledge | 416 | 4 | EM / F1 |
| LCS | lcs | Code, Math | 500 | 2 | Acc |
| ruHumanEval | ruhumaneval | Code | 164 | 0 | Pass@k |
| ruMMLU | rummlu | Reasoning | 961 | 5 | Acc |
| USE | use | Exam | 900 | 0 | Grade_norm |
| ruDetox | rudetox | Ethics | 800 | 0 | J(STA, SIM, FL) |
| ruEthics | ruethics | Ethics | 1935 | 0 | 5 MCC |
| ruHateSpeech | ruhatespeech | Ethics | 265 | 0 | Acc |
| ruHHH | ruhhh | Ethics | 178 | 0 | Acc |

Our aim is to evaluate all the models:

- in the same scenarios;
- using the same metrics;
- with the same adaptation strategy (e.g., prompting);
- allowing for controlled and clear comparisons.
- provide an opportunity to make controlled and clear comparisons.

**Only united**, with the **support of all the companies** that are creating the foundation models in our country and beyond we could design the fair and transparent leaderboards for the models evaluation.
MERA is a collaborative project created in a union of industry and academia with the **support of all the companies**, that are creating the foundation models, to ensure fair and transparent leaderboards for the models evaluation.

*Our team and partners:*
*We express our gratitude to our team and partners:*

*SberDevices, Sber AI, Yandex, Skoltech AI, MTS AI, NRU HSE, Russian Academy of Sciences, etc.*

Expand All @@ -80,21 +78,24 @@ Our aim is to evaluate all the models:

The repository has the following structure:

- [`examples`](examples/instruction.ipynb) - the examples of loading and using data.
- [`humanbenchmarks`](humanbenchmarks/README.md) - materials and code for human evaluation.
- [`modules`](modules/scoring/README.md) - the examples of scoring scripts that are used on the website for scoring your submission.
- [`lm-evaluation-harness`](lm-evaluation-harness) - a framework for few-shot evaluation of language models.
- [`examples`](examples/instruction.ipynb) the examples of loading and using data.
- [`humanbenchmarks`](humanbenchmarks/README.md) materials and code for human evaluation.
- [`modules`](modules/scoring/README.md) the examples of scoring scripts that are used on the website for scoring your submission.
- [`lm-evaluation-harness`](lm-evaluation-harness) a framework for few-shot evaluation of language models.


## Submit to MERA

- To see the datasets use the HuggingFace datasets interface. See the example of the datasets in the prepared Jupyter Notebook.
- To run your model on the all datasets please use the code of lm-harness. The result of the code is the archive in ZIP format for the submission.
- Register on the website and submit your the ZIP. The results will be available for you privately in the account.
## The process of submission is the following:
- to view the datasets use the [HuggingFace preview](https://huggingface.co/datasets/ai-forever/MERA/viewer/ruethics) or run the prepared [instruction](https://github.com/ai-forever/MERA/blob/main/examples/instruction.ipynb);
- clone MERA benchmark [repository](https://github.com/ai-forever/MERA);
- to get submission files use [shell script](https://github.com/ai-forever/MERA/blob/main/lm-evaluation-harness/README.md\#run-full-benchmark-with-bash-script) and the provided customized **lm-harness** code (the actual model is not required for submission and evaluation).
- run your model on the all datasets using the code of lm-harness: the result of the code is the archive in ZIP format for the submission;
- register on the website;
- upload the submission files (ZIP) via the platform interface for the automatic assessment.

*Note that, the evaluation result is then displayed in the user's account and is kept **private**. Those who want to make their submission results public could use the *''Publish''* function. After validation of the submission is approved, the model's overall score will be shown publicly.*
*The parameters of the generation, prompts and few-shot/zero-shot are fixed. You can vary them for your own purposes. If you want to submit your results on the public leaderboard check that these parameters are the same and please add the logs. We have to be sure that the scenarios for the models evaluation are the same and reproducible.*

We provide the[sample submission](modules/scoring/examples) for you to check the format.
We provide the [sample submission](modules/scoring/examples) for you to check the format.

The process of the whole MERA evaluation is described on the Figure:

Expand All @@ -104,4 +105,22 @@ The process of the whole MERA evaluation is described on the Figure:

📌 It’s the first text version of the benchmark. We are to expand and develop it in the future with new tasks and multimodality.

Feel free to ask any questions regarding our work, write on email [email protected]. If you have ideas and new tasks feel free to suggest them, it’s important! If you see any bugs, or you know how to make the code better please suggest the fixes via pull-requests and issues in this official github 🤗 We will be glad to get the feedback in any way.
Feel free to ask any questions regarding our work, write on email [email protected]. If you have ideas and new tasks feel free to suggest them, **it’s important!** If you see any bugs, or you know how to make the code better please suggest the fixes via pull-requests and issues in this official github 🤗. We will be glad to get the feedback in any way.


## Cite as

```
@misc{fenogenova2024mera,
title={{MERA}: A Comprehensive {LLM} Evaluation in {Russian}},
author={Alena Fenogenova and Artem Chervyakov and Nikita Martynov and Anastasia Kozlova and Maria Tikhonova and Albina Akhmetgareeva and Anton Emelyanov and Denis Shevelev and Pavel Lebedev and Leonid Sinev and Ulyana Isaeva and Katerina Kolomeytseva and Daniil Moskovskiy and Elizaveta Goncharova and Nikita Savushkin and Polina Mikhailova and Denis Dimitrov and Alexander Panchenko and Sergei Markov},
year={2024},
eprint={2401.04531},
url = {https://arxiv.org/abs/2401.04531},
eprinttype={arXiv},
archivePrefix={arXiv},
primaryClass={cs.CL},
journal={arXiv},
volume={2401.04531}
}
```
70 changes: 70 additions & 0 deletions docs/dataset_cards/en/bps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# **BPS**

## Task Description

The balanced sequence is an algorithmic task from [BIG-bench](https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/cs_algorithms/valid_parentheses). The primary purpose of this task is to measure language models' ability to learn CS algorithmic concepts like stacks, recursion, or dynamic programming.

Each subtask contains a parentheses sequence. The model's goal is to correctly predict whether the sequence is balanced.

An input string is valid if:

1. Open brackets must be closed by the same type of brackets.
2. Open brackets must be closed in the correct order.
3. Every close bracket has a corresponding open bracket of the same type.

**Keywords:** algorithms, numerical response, context length, parantheses, binary answer

**Authors:** Harsh Mehta, Behnam Neyshabur

### Motivation

Algorithms are a way to extrapolate examples and are some of the most concise descriptions of a pattern. In that sense, the ability of language models to learn them is a prominent measure of intelligence.

## Dataset Description

### Data Fields

- `instruction` is a string containing instructions for the task and information about the requirements for the model output format;
- `inputs` is an example of the parentheses sequence;
- `outputs` is a string containing the correct answer: “1” if the parentheses sequence is valid, “0” otherwise;
- `meta` is a dictionary containing meta information:
- `id` is an integer indicating the index of the example.

### Data Instances

Below is an example from the dataset:

```json
{
"instruction": "На вход подается последовательность скобок: \"{inputs}\"\nНеобходимо ответить сбалансирована ли данная последовательность. Если последовательность сбалансирована - выведите 1, иначе 0",
"inputs": "[ ] } { [ ] { ) [ } ) ) { ( ( ( ) ] } {",
"outputs": "0",
"meta": {
"id": 40
}
}
```

### Data Splits

The train consists of 250 examples, and the test set includes 1000 examples.

### Prompts

8 prompts of varying difficulty were created for this task. Example:

`"Проверьте, сбалансирована ли входная последовательность скобок.\n"{inputs}"\nВыведите 1, если да и 0 в противном случае. Сперва закрывающей скобкой своего типа должна закрываться последняя из открытых скобок, и лишь потом соответствующей закрывающей скобкой может закрываться та, что была открыта перед ней."`.

### Dataset Creation

The parentheses sequences of the length 2, 4, 8, 12, 20 were generated with the following distribution: `{20: 0.336, 12: 0.26, 8: 0.24, 4: 0.14, 2: 0.024}` for the train set and `{20: 0.301, 12: 0.279, 8: 0.273, 4: 0.121, 2: 0.026}` for the test set.

## Evaluation

### Metrics

The task is evaluated using Accuracy.

### Human benchmark

The human benchmark is measured on a subset of size 100 (sampled with the same original distribution). The accuracy for this task is `1.0`.
81 changes: 81 additions & 0 deletions docs/dataset_cards/en/chegeka.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# **CheGeKa**

## Task Description

CheGeKa is a Jeopardy!-like Russian QA dataset collected from the official Russian quiz database ChGK and belongs to the open-domain question-answering group of tasks. The dataset was created based on the [corresponding dataset](https://tape-benchmark.com/datasets.html#chegeka) from the TAPE benchmark [1].

**Keywords:** Reasoning, World Knowledge, Logic, Question-Answering, Open-Domain QA

**Authors:** Ekaterina Taktasheva, Tatiana Shavrina, Alena Fenogenova, Denis Shevelev, Nadezhda Katricheva, Maria Tikhonova, Albina Akhmetgareeva, Oleg Zinkevich, Anastasiia Bashmakova, Svetlana Iordanskaia, Alena Spiridonova, Valentina Kurenshchikova, Ekaterina Artemova, Vladislav Mikhailov

### Motivation

The task can be considered the most challenging in terms of reasoning, knowledge, and logic, as the task implies the QA pairs with a free response form (no answer choices); however, a long chain of causal relationships between facts and associations forms the correct answer.

## Dataset Description

### Data Fields

- `meta` is a dictionary containing meta-information about the example:
- `id` is the task ID;
- `author` is the author of the question;
- `tour name` is the name of the game in which the question was used;
- `tour_link` is a link to the game in which the question was used (None for the test set);
- `instruction` is an instructional prompt specified for the current task;
- `inputs` is a dictionary containing the following input information:
- `text` is a text fragment with a question from the game “What? Where? When?";
- `topic` is a string containing the category of the question;
- `outputs` is a string containing the correct answer to the question.

### Data Instances

Each instance in the dataset contains an instruction, a question, the topic of the question, the correct answer, and all the meta-information. Below is an example from the dataset:

```json
{
"instruction": "Вы участвуете в викторине “Что? Где? Когда?”. Внимательно прочитайте вопрос из категории \"{topic}\" и ответьте на него.\nВопрос: {text}\nВ качестве ответа запишите только ваш вариант без дополнительных объяснений.\nОтвет:",
"inputs": {
"text": "В корриде, кроме быка, он тоже играет одну из главных ролей.",
"topic": "\"ТОР\""
},
"outputs": "Тореадор",
"meta": {
"id": 7571,
"author": "Максим Стасюк",
"tour_name": "Своя игра. ШДК им. Рабиндраната Дебендранатовича Тагора",
"tour_link": "https://db.chgk.info/tour/tagor02"
}
}
```

### Data Splits

The dataset consists of 29376 training examples (train set) and 416 test examples (test set).

### Prompts

We use four different prompts written in natural language for this task. An example of the prompt is given below:

`"Вы участвуете в викторине “Что? Где? Когда?”. Категория вопроса: {topic}\nВнимательно прочитайте вопрос и ответьте на него: {text}\nОтвет:"`.

### Dataset Creation

The dataset was created using the corresponding dataset from the TAPE benchmark [1], which is, in turn, based on the original corpus of the CheGeKa game introduced in [2].

## Evaluation

### Metrics

The dataset is evaluated via two metrics: F1-score and Exact Match (EM).

### Human Benchmark

Human Benchmark was measured on a test set with Yandex.Toloka project with the overlap of 3 reviewers per task.

The F1-score / Exact Match results are `0.719` / `0.645`, respectively.

## References

[1] Taktasheva, Ekaterina, et al. "TAPE: Assessing Few-shot Russian Language Understanding." Findings of the Association for Computational Linguistics: EMNLP 2022. 2022.

[2] Mikhalkova, Elena, and Alexander A. Khlyupin. "Russian Jeopardy! Data Set for Question-Answering Systems." Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022.
Loading

0 comments on commit 4490a2f

Please sign in to comment.