Skip to content

Commit

Permalink
add xstest to readme and changelog
Browse files Browse the repository at this point in the history
  • Loading branch information
aisi-inspect committed Sep 5, 2024
1 parent de43b7f commit ba03d98
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 19 deletions.
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
- Remove `chdir` option from `@tasks` (tasks now always chdir during execution).
- Do not process `.env` files in task directories (all required vars should be specified in the global `.env`).
- Only enable `strict` mode for OpenAI tool calls when all function parameters are required.
- Added [MMMU](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/benchmarks/mmmu), [CommonsenseQA](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/benchmarks/commonsense_qa) and [MMLU-Pro](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/benchmarks/mmlu_pro) benchmarks.
- Added [MMMU](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/benchmarks/mmmu), [CommonsenseQA](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/benchmarks/commonsense_qa), [MMLU-Pro](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/benchmarks/mmlu_pro), and [XSTest](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/benchmarks/xstest) benchmarks.
- Introduce two new scorers, `f1` (precision and recall in text matching) and `exact` (whether normalized text matches exactly).

## v0.3.25 (25 August 2024)
Expand Down
37 changes: 19 additions & 18 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,22 @@

This directory contains evals for several benchmarks. Datasets for evals are not embedded in the repository but are rather either downloaded either directly from their source URL or via Hugging Face datasets. To use Hugging Face datasets please install the datasets package with `pip install datasets`.

| Benchmark | Reference | Code | Dataset |
|-----------------------|-----------------|----------------:|-----------------|
| MMLU: Measuring Massive Multitask Language Understanding | <https://arxiv.org/abs/2009.03300> | [mmlu.py](mmlu.py) | Download |
| MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark | <https://arxiv.org/abs/2009.03300> | [mmlu_pro.py](mmlu_pro/mmlu_pro.py) | HuggingFace |
| MATH: Measuring Mathematical Problem Solving With the MATH Dataset | <https://arxiv.org/abs/2103.03874> | [mathematics.py](mathematics/mathematics.py) | Download |
| GPQA: A Graduate-Level Google-Proof Q&A Benchmark | <https://arxiv.org/abs/2311.12022> | [gpqa.py](gpqa.py) | Download |
| ARC: AI2 Reasoning Challenge | <https://arxiv.org/abs/1803.05457> | [arc.py](arc.py) | Hugging Face |
| GSM8K: Training Verifiers to Solve Math Word Problems | <https://arxiv.org/abs/2110.14168> | [gsm8k.py](gsm8k.py) | Hugging Face |
| HellaSwag: Can a Machine Really Finish Your Sentence? | <https://arxiv.org/abs/1905.07830> | [hellaswag.py](hellaswag.py) | Hugging Face |
| PIQA: Physical Interaction: Question Answering | <https://arxiv.org/abs/1911.11641> | [piqa.py](piqa.py) | Hugging Face |
| BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions | <https://arxiv.org/abs/1905.10044> | [boolq.py](boolq.py) | Hugging Face |
| TruthfulQA: Measuring How Models Mimic Human Falsehoods | <https://arxiv.org/abs/2109.07958v2> | [truthfulqa.py](truthfulqa.py) | Hugging Face |
| HumanEval: Evaluating Large Language Models Trained on Code | <https://arxiv.org/pdf/2107.03374> | [humaneval.py](humaneval/humaneval.py) | Hugging Face |
| DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs | <https://arxiv.org/pdf/1903.00161> | [drop.py](drop/drop.py) | Hugging Face |
| WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale | <https://arxiv.org/pdf/1907.10641> | [winogrande.py](winogrande/winogrande.py) | Hugging Face |
| RACE-H: A benchmark for testing reading comprehension and reasoning abilities of neural models. | <https://arxiv.org/abs/1704.04683> | [race-h.py](race-h/race-h.py) | Hugging Face |
| MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark. | <https://arxiv.org/abs/2311.16502> | [mmmu.py](mmmu/mmmu.py) | Hugging Face |
| CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge | <https://arxiv.org/pdf/1811.00937v2> | [commonsense_qa.py](commonsense_qa/commonsense_qa.py) | Hugging Face |
| Benchmark | Reference | Code | Dataset |
|---------------------|-----------------|----------------:|-----------------|
| MMLU: Measuring Massive Multitask Language Understanding | <https://arxiv.org/abs/2009.03300> | [mmlu.py](mmlu.py) | Download |
| MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark | <https://arxiv.org/abs/2009.03300> | [mmlu_pro.py](mmlu_pro/mmlu_pro.py) | HuggingFace |
| MATH: Measuring Mathematical Problem Solving With the MATH Dataset | <https://arxiv.org/abs/2103.03874> | [mathematics.py](mathematics/mathematics.py) | Download |
| GPQA: A Graduate-Level Google-Proof Q&A Benchmark | <https://arxiv.org/abs/2311.12022> | [gpqa.py](gpqa.py) | Download |
| ARC: AI2 Reasoning Challenge | <https://arxiv.org/abs/1803.05457> | [arc.py](arc.py) | Hugging Face |
| GSM8K: Training Verifiers to Solve Math Word Problems | <https://arxiv.org/abs/2110.14168> | [gsm8k.py](gsm8k.py) | Hugging Face |
| HellaSwag: Can a Machine Really Finish Your Sentence? | <https://arxiv.org/abs/1905.07830> | [hellaswag.py](hellaswag.py) | Hugging Face |
| PIQA: Physical Interaction: Question Answering | <https://arxiv.org/abs/1911.11641> | [piqa.py](piqa.py) | Hugging Face |
| BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions | <https://arxiv.org/abs/1905.10044> | [boolq.py](boolq.py) | Hugging Face |
| TruthfulQA: Measuring How Models Mimic Human Falsehoods | <https://arxiv.org/abs/2109.07958v2> | [truthfulqa.py](truthfulqa.py) | Hugging Face |
| HumanEval: Evaluating Large Language Models Trained on Code | <https://arxiv.org/pdf/2107.03374> | [humaneval.py](humaneval/humaneval.py) | Hugging Face |
| DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs | <https://arxiv.org/pdf/1903.00161> | [drop.py](drop/drop.py) | Hugging Face |
| WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale | <https://arxiv.org/pdf/1907.10641> | [winogrande.py](winogrande/winogrande.py) | Hugging Face |
| RACE-H: A benchmark for testing reading comprehension and reasoning abilities of neural models. | <https://arxiv.org/abs/1704.04683> | [race-h.py](race-h/race-h.py) | Hugging Face |
| MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark. | <https://arxiv.org/abs/2311.16502> | [mmmu.py](mmmu/mmmu.py) | Hugging Face |
| CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge | <https://arxiv.org/pdf/1811.00937v2> | [commonsense_qa.py](commonsense_qa/commonsense_qa.py) | Hugging Face |
| XSTest: A benchmark for identifying exaggerated safety behaviours in LLM's | <https://arxiv.org/abs/2308.01263> | [xstest.py](xstest/xstest.py) | Hugging Face |

0 comments on commit ba03d98

Please sign in to comment.