-
Notifications
You must be signed in to change notification settings - Fork 143
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
671a239
commit 140c3d6
Showing
121 changed files
with
3,514 additions
and
1,811 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,26 +1,33 @@ | ||
## Benchmarks | ||
|
||
This directory contains evals for several benchmarks, including: | ||
This directory contains evals for several benchmarks. Datasets for evals are not embedded in the repository but are rather stored using either Git-LFS or Hugging Face (see below for instructions on the additional requirements for reading datasets from these sources). | ||
|
||
| Benchmark | Reference | Code | | ||
|------------------------|------------------------|-----------------------:| | ||
| MMLU: Measuring Massive Multitask Language Understanding | <https://arxiv.org/abs/2009.03300> | [mmlu.py](mmlu.py) | | ||
| MATH: Measuring Mathematical Problem Solving With the MATH Dataset | <https://arxiv.org/abs/2103.03874> | [mathematics.py](mathematics.py) | | ||
| GPQA: A Graduate-Level Google-Proof Q&A Benchmark | <https://arxiv.org/abs/2311.12022> | [gpqa.py](gpqa.py) | | ||
| ARC: AI2 Reasoning Challenge | <https://arxiv.org/abs/1803.05457> | [arc.py](arc.py) | | ||
| GSM8K: Training Verifiers to Solve Math Word Problems | <https://arxiv.org/abs/2110.14168> | [gsm8k.py](gsm8k.py) | | ||
| HellaSwag: Can a Machine Really Finish Your Sentence? | <https://arxiv.org/abs/1905.07830> | [hellaswag.py](hellaswag.py) | | ||
| Benchmark | Reference | Code | Dataset | | ||
|------------------------------------------------------------------------|------------------------------------|---------------------------------:|--------------| | ||
| MMLU: Measuring Massive Multitask Language Understanding | <https://arxiv.org/abs/2009.03300> | [mmlu.py](mmlu.py) | Git-LFS | | ||
| MATH: Measuring Mathematical Problem Solving With the MATH Dataset | <https://arxiv.org/abs/2103.03874> | [mathematics.py](mathematics.py) | Git-LFS | | ||
| GPQA: A Graduate-Level Google-Proof Q&A Benchmark | <https://arxiv.org/abs/2311.12022> | [gpqa.py](gpqa.py) | Hugging Face | | ||
| ARC: AI2 Reasoning Challenge | <https://arxiv.org/abs/1803.05457> | [arc.py](arc.py) | Hugging Face | | ||
| GSM8K: Training Verifiers to Solve Math Word Problems | <https://arxiv.org/abs/2110.14168> | [gsm8k.py](gsm8k.py) | Hugging Face | | ||
| HellaSwag: Can a Machine Really Finish Your Sentence? | <https://arxiv.org/abs/1905.07830> | [hellaswag.py](hellaswag.py) | Hugging Face | | ||
| PIQA: Physical Interaction: Question Answering | <https://arxiv.org/abs/1911.11641> | [piqa.py](piqa.py) | Hugging Face | | ||
| BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions | <https://arxiv.org/abs/1905.10044> | [boolq.py](boolq.py) | Hugging Face | | ||
|
||
The datasets for ARC, GSM8K, and HellaSwag are read from Hugging Face, so require the installation of the **datasets** package: | ||
### Git-LFS Datasets | ||
|
||
``` bash | ||
$ pip install datasets | ||
``` | ||
|
||
The datasets for MMLU and MATH are stored using [Git-LFS](https://git-lfs.com/). Once you have downloaded and installed LFS, switch to the repo source directory and run the following commands to sync the data from LFS: | ||
To use these datasets, first download and install [Git-LFS](https://git-lfs.com/). Once you have done this, switch to the repo source directory and run the following commands to sync the data from LFS: | ||
|
||
``` bash | ||
$ cd inspect_ai | ||
$ git lfs fetch --all | ||
$ git lfs pull | ||
``` | ||
``` | ||
|
||
### Hugging Face Datasets | ||
|
||
To use these datasets you need to install the **datasets** package as follows: | ||
|
||
``` bash | ||
$ pip install datasets | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
""" | ||
BoolQ | ||
Exploring the Surprising Difficulty of Natural Yes/No Questions | ||
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, | ||
Kristina Toutanova | ||
https://arxiv.org/abs/1905.10044 | ||
# Run against validations boolq dataset | ||
inspect eval boolq.py | ||
""" | ||
|
||
from inspect_ai import Task, task | ||
from inspect_ai.dataset import Sample, hf_dataset | ||
from inspect_ai.scorer import pattern | ||
from inspect_ai.solver import generate, prompt_template | ||
|
||
TEMPLATE = r""" | ||
Answer the following question with either Yes or No. Include nothing else in your response. | ||
Question: {prompt} | ||
""" | ||
|
||
|
||
def record_to_sample(record): | ||
if record["answer"]: | ||
target = "Yes" | ||
else: | ||
target = "No" | ||
|
||
return Sample(input=record["question"], target=target) | ||
|
||
|
||
@task | ||
def boolq(): | ||
dataset = hf_dataset( | ||
path="boolq", | ||
sample_fields=record_to_sample, | ||
split="validation", | ||
shuffle=True, | ||
) | ||
|
||
return Task( | ||
dataset=dataset, | ||
plan=[prompt_template(template=TEMPLATE), generate()], | ||
scorer=pattern(r"(Yes|No).?\Z"), | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
""" | ||
PIQA (Physical Interaction: Question Answering) | ||
Reasoning about Physical Commonsense in Natural Language | ||
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi | ||
https://arxiv.org/abs/1911.11641 | ||
# eval piqa validation set | ||
inspect eval piq.py | ||
""" | ||
|
||
from inspect_ai import Task, task | ||
from inspect_ai.dataset import Sample, hf_dataset | ||
from inspect_ai.scorer import answer | ||
from inspect_ai.solver import multiple_choice | ||
|
||
|
||
def record_to_sample(record): | ||
return Sample( | ||
input=record["goal"], | ||
target="A" if record["label"] == 0 else "B", | ||
choices=[record["sol1"], record["sol2"]], | ||
) | ||
|
||
|
||
TEMPLATE = r""" | ||
The entire content of your response should be of the following format: 'ANSWER: | ||
$LETTER' (without quotes) where LETTER is one of {letters}. | ||
Given either a question or a statement followed by two possible solutions | ||
labelled A and B, choose the most appropriate solution. If a question is given, | ||
the solutions answer the question. If a statement is given, the solutions | ||
explain how to achieve the statement. | ||
{question} | ||
{choices} | ||
""".strip() | ||
|
||
|
||
@task | ||
def piqa(): | ||
dataset = hf_dataset( | ||
path="piqa", | ||
sample_fields=record_to_sample, | ||
trust=True, | ||
split="validation", | ||
shuffle=True, | ||
) | ||
|
||
return Task( | ||
dataset=dataset, | ||
plan=[multiple_choice(template=TEMPLATE)], | ||
scorer=answer("letter"), | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.