add xstest to readme and changelog

UKGovernmentBEIS · Sep 5, 2024 · ba03d98 · ba03d98
1 parent de43b7f
commit ba03d98
Show file tree

Hide file tree

Showing 2 changed files with 20 additions and 19 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -17,7 +17,7 @@
 - Remove `chdir` option from `@tasks` (tasks now always chdir during execution).
 - Do not process `.env` files in task directories (all required vars should be specified in the global `.env`).
 - Only enable `strict` mode for OpenAI tool calls when all function parameters are required.
-- Added [MMMU](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/benchmarks/mmmu), [CommonsenseQA](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/benchmarks/commonsense_qa) and [MMLU-Pro](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/benchmarks/mmlu_pro) benchmarks.
+- Added [MMMU](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/benchmarks/mmmu), [CommonsenseQA](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/benchmarks/commonsense_qa), [MMLU-Pro](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/benchmarks/mmlu_pro), and [XSTest](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/benchmarks/xstest) benchmarks.
 - Introduce two new scorers, `f1` (precision and recall in text matching) and `exact` (whether normalized text matches exactly).
 
 ## v0.3.25 (25 August 2024)

diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -2,21 +2,22 @@
 
 This directory contains evals for several benchmarks. Datasets for evals are not embedded in the repository but are rather either downloaded either directly from their source URL or via Hugging Face datasets. To use Hugging Face datasets please install the datasets package with `pip install datasets`.
 
-| Benchmark                                                                                     | Reference                            |                                      Code | Dataset      |
-|-----------------------|-----------------|----------------:|-----------------|
-| MMLU: Measuring Massive Multitask Language Understanding                                      | <https://arxiv.org/abs/2009.03300>   |                        [mmlu.py](mmlu.py) | Download     |
-| MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark                                      | <https://arxiv.org/abs/2009.03300>   |                        [mmlu_pro.py](mmlu_pro/mmlu_pro.py) | HuggingFace     |
-| MATH: Measuring Mathematical Problem Solving With the MATH Dataset                            | <https://arxiv.org/abs/2103.03874>   |          [mathematics.py](mathematics/mathematics.py) | Download     |
-| GPQA: A Graduate-Level Google-Proof Q&A Benchmark                                             | <https://arxiv.org/abs/2311.12022>   |                        [gpqa.py](gpqa.py) | Download     |
-| ARC: AI2 Reasoning Challenge                                                                  | <https://arxiv.org/abs/1803.05457>   |                          [arc.py](arc.py) | Hugging Face |
-| GSM8K: Training Verifiers to Solve Math Word Problems                                         | <https://arxiv.org/abs/2110.14168>   |                      [gsm8k.py](gsm8k.py) | Hugging Face |
-| HellaSwag: Can a Machine Really Finish Your Sentence?                                         | <https://arxiv.org/abs/1905.07830>   |              [hellaswag.py](hellaswag.py) | Hugging Face |
-| PIQA: Physical Interaction: Question Answering                                                | <https://arxiv.org/abs/1911.11641>   |                        [piqa.py](piqa.py) | Hugging Face |
-| BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions                        | <https://arxiv.org/abs/1905.10044>   |                      [boolq.py](boolq.py) | Hugging Face |
-| TruthfulQA: Measuring How Models Mimic Human Falsehoods                                       | <https://arxiv.org/abs/2109.07958v2> |            [truthfulqa.py](truthfulqa.py) | Hugging Face |
-| HumanEval: Evaluating Large Language Models Trained on Code                                   | <https://arxiv.org/pdf/2107.03374>   |    [humaneval.py](humaneval/humaneval.py) | Hugging Face |
-| DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs          | <https://arxiv.org/pdf/1903.00161>   |                   [drop.py](drop/drop.py) | Hugging Face |
-| WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale                                 | <https://arxiv.org/pdf/1907.10641>   | [winogrande.py](winogrande/winogrande.py) | Hugging Face |
-| RACE-H: A benchmark for testing reading comprehension and reasoning abilities of neural models. | <https://arxiv.org/abs/1704.04683>   |             [race-h.py](race-h/race-h.py) | Hugging Face |
-| MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark.            | <https://arxiv.org/abs/2311.16502>   |             [mmmu.py](mmmu/mmmu.py)       | Hugging Face |
-| CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge                 | <https://arxiv.org/pdf/1811.00937v2> |   [commonsense_qa.py](commonsense_qa/commonsense_qa.py) | Hugging Face |
+| Benchmark                                                                                       | Reference                            |                                                  Code | Dataset      |
+|---------------------|-----------------|----------------:|-----------------|
+| MMLU: Measuring Massive Multitask Language Understanding                                        | <https://arxiv.org/abs/2009.03300>   |                                    [mmlu.py](mmlu.py) | Download     |
+| MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark             | <https://arxiv.org/abs/2009.03300>   |                   [mmlu_pro.py](mmlu_pro/mmlu_pro.py) | HuggingFace  |
+| MATH: Measuring Mathematical Problem Solving With the MATH Dataset                              | <https://arxiv.org/abs/2103.03874>   |          [mathematics.py](mathematics/mathematics.py) | Download     |
+| GPQA: A Graduate-Level Google-Proof Q&A Benchmark                                               | <https://arxiv.org/abs/2311.12022>   |                                    [gpqa.py](gpqa.py) | Download     |
+| ARC: AI2 Reasoning Challenge                                                                    | <https://arxiv.org/abs/1803.05457>   |                                      [arc.py](arc.py) | Hugging Face |
+| GSM8K: Training Verifiers to Solve Math Word Problems                                           | <https://arxiv.org/abs/2110.14168>   |                                  [gsm8k.py](gsm8k.py) | Hugging Face |
+| HellaSwag: Can a Machine Really Finish Your Sentence?                                           | <https://arxiv.org/abs/1905.07830>   |                          [hellaswag.py](hellaswag.py) | Hugging Face |
+| PIQA: Physical Interaction: Question Answering                                                  | <https://arxiv.org/abs/1911.11641>   |                                    [piqa.py](piqa.py) | Hugging Face |
+| BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions                          | <https://arxiv.org/abs/1905.10044>   |                                  [boolq.py](boolq.py) | Hugging Face |
+| TruthfulQA: Measuring How Models Mimic Human Falsehoods                                         | <https://arxiv.org/abs/2109.07958v2> |                        [truthfulqa.py](truthfulqa.py) | Hugging Face |
+| HumanEval: Evaluating Large Language Models Trained on Code                                     | <https://arxiv.org/pdf/2107.03374>   |                [humaneval.py](humaneval/humaneval.py) | Hugging Face |
+| DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs            | <https://arxiv.org/pdf/1903.00161>   |                               [drop.py](drop/drop.py) | Hugging Face |
+| WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale                                   | <https://arxiv.org/pdf/1907.10641>   |             [winogrande.py](winogrande/winogrande.py) | Hugging Face |
+| RACE-H: A benchmark for testing reading comprehension and reasoning abilities of neural models. | <https://arxiv.org/abs/1704.04683>   |                         [race-h.py](race-h/race-h.py) | Hugging Face |
+| MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark.              | <https://arxiv.org/abs/2311.16502>   |                               [mmmu.py](mmmu/mmmu.py) | Hugging Face |
+| CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge                   | <https://arxiv.org/pdf/1811.00937v2> | [commonsense_qa.py](commonsense_qa/commonsense_qa.py) | Hugging Face |
+| XSTest: A benchmark for identifying exaggerated safety behaviours in LLM's                      | <https://arxiv.org/abs/2308.01263>   |                         [xstest.py](xstest/xstest.py) | Hugging Face |