Skip to content

Commit

Permalink
improved examples and tutorial pages (#379)
Browse files Browse the repository at this point in the history
* initial work on examples

* more work on examples

* core questions entered

* contributors

* more contributors

* move record_to_sample to end

* add demonstrates

* more work on examples

* div for example items

* some descriptions

* a bit of work on toc and headings

* add some descriptions

* add intercode

* mathematics

* add tutorial

* fix math thing

* reformat

* update paths

* link to tutorial and update examples

* remove whitespace

* use choice in hellaswag

* additional tutorial content

* improve math dataset tutorial

* intercode example

* add gdm example

* typography

---------

Co-authored-by: aisi-inspect <[email protected]>
  • Loading branch information
jjallaire and aisi-inspect authored Sep 12, 2024
1 parent db843d6 commit 08eb7b6
Show file tree
Hide file tree
Showing 18 changed files with 1,322 additions and 967 deletions.
14 changes: 7 additions & 7 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This directory contains evals for several benchmarks. Datasets for evals are not
| Benchmark | Reference | Code | Dataset |
|-----------------------------|--------------|--------------:|--------------|
| MMLU: Measuring Massive Multitask Language Understanding | <https://arxiv.org/abs/2009.03300> | [mmlu.py](mmlu.py) | Download |
| MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark | <https://arxiv.org/abs/2009.03300> | [mmlu_pro.py](mmlu_pro/mmlu_pro.py) | HuggingFace |
| MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark | <https://arxiv.org/abs/2406.01574> | [mmlu_pro.py](mmlu_pro/mmlu_pro.py) | HuggingFace |
| MATH: Measuring Mathematical Problem Solving With the MATH Dataset | <https://arxiv.org/abs/2103.03874> | [mathematics.py](mathematics/mathematics.py) | Download |
| GPQA: A Graduate-Level Google-Proof Q&A Benchmark | <https://arxiv.org/abs/2311.12022> | [gpqa.py](gpqa/gpqa.py) | Download |
| ARC: AI2 Reasoning Challenge | <https://arxiv.org/abs/1803.05457> | [arc.py](arc/arc.py) | Hugging Face |
Expand All @@ -14,16 +14,16 @@ This directory contains evals for several benchmarks. Datasets for evals are not
| PIQA: Physical Interaction: Question Answering | <https://arxiv.org/abs/1911.11641> | [piqa.py](piqa.py) | Hugging Face |
| BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions | <https://arxiv.org/abs/1905.10044> | [boolq.py](boolq/boolq.py) | Hugging Face |
| TruthfulQA: Measuring How Models Mimic Human Falsehoods | <https://arxiv.org/abs/2109.07958v2> | [truthfulqa.py](truthfulqa.py) | Hugging Face |
| HumanEval: Evaluating Large Language Models Trained on Code | <https://arxiv.org/pdf/2107.03374> | [humaneval.py](humaneval/humaneval.py) | Hugging Face |
| DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs | <https://arxiv.org/pdf/1903.00161> | [drop.py](drop/drop.py) | Hugging Face |
| WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale | <https://arxiv.org/pdf/1907.10641> | [winogrande.py](winogrande/winogrande.py) | Hugging Face |
| HumanEval: Evaluating Large Language Models Trained on Code | <https://arxiv.org/abs/2107.03374> | [humaneval.py](humaneval/humaneval.py) | Hugging Face |
| DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs | <https://arxiv.org/abs/1903.00161> | [drop.py](drop/drop.py) | Hugging Face |
| WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale | <https://arxiv.org/abs/1907.10641> | [winogrande.py](winogrande/winogrande.py) | Hugging Face |
| RACE-H: A benchmark for testing reading comprehension and reasoning abilities of neural models. | <https://arxiv.org/abs/1704.04683> | [race-h.py](race-h/race-h.py) | Hugging Face |
| MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark. | <https://arxiv.org/abs/2311.16502> | [mmmu.py](mmmu/mmmu.py) | Hugging Face |
| CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge | <https://arxiv.org/pdf/1811.00937v2> | [commonsense_qa.py](commonsense_qa/commonsense_qa.py) | Hugging Face |
| CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge | <https://arxiv.org/abs/1811.00937> | [commonsense_qa.py](commonsense_qa/commonsense_qa.py) | Hugging Face |
| XSTest: A benchmark for identifying exaggerated safety behaviours in LLM's | <https://arxiv.org/abs/2308.01263> | [xstest.py](xstest/xstest.py) | Hugging Face |
| MathVista: Evaluating Mathematical Reasoning in Visual Contexts | <https://arxiv.org/abs/2310.02255> | [mathvista.py](mathvista/mathvista.py) | Hugging Face |
| SQuAD: A Reading Comprehension Benchmark requiring reasoning over Wikipedia articles | <https://arxiv.org/pdf/1806.03822> | [squad.py](squad/squad.py) | Hugging Face |
| IFEval: Instruction-Following Evaluation for Large Language Models | <https://arxiv.org/pdf/2311.07911> | [ifeval.py](ifeval/ifeval.py) | Hugging Face |
| AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models | <https://arxiv.org/pdf/2304.06364> | [agieval_en.py](agieval/agieval_en.py) | Download |
| IFEval: Instruction-Following Evaluation for Large Language Models | <https://arxiv.org/abs/2311.07911> | [ifeval.py](ifeval/ifeval.py) | Hugging Face |
| AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models | <https://arxiv.org/abs/2304.06364> | [agieval_en.py](agieval/agieval_en.py) | Download |
| PubMedQA: A Dataset for Biomedical Research Question Answering | <https://arxiv.org/abs/1909.06146> | [pubmedqa.py](pubmedqa/pubmedqa.py) | Hugging Face
| MBPP: Mostly Basic Python Problems | <https://arxiv.org/abs/2108.07732> | [mbpp.py](mbpp/mbpp.py) | Hugging Face |
32 changes: 16 additions & 16 deletions benchmarks/arc/arc.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,22 +18,6 @@
from inspect_ai.solver import multiple_choice


def record_to_sample(record):
# read the labels and text
choices = record["choices"]
choices = dict(zip(choices["label"], choices["text"]))

# determine the target then normalize to letter
answerKey = record["answerKey"]
target = list(choices.keys()).index(answerKey)
target = chr(ord("A") + int(target))

# return sample
return Sample(
input=record["question"], choices=list(choices.values()), target=target
)


def arc_task(dataset_name):
return Task(
dataset=hf_dataset(
Expand All @@ -55,3 +39,19 @@ def arc_easy():
@task
def arc_challenge():
return arc_task("ARC-Challenge")


def record_to_sample(record):
# read the labels and text
choices = record["choices"]
choices = dict(zip(choices["label"], choices["text"]))

# determine the target then normalize to letter
answerKey = record["answerKey"]
target = list(choices.keys()).index(answerKey)
target = chr(ord("A") + int(target))

# return sample
return Sample(
input=record["question"], choices=list(choices.values()), target=target
)
18 changes: 9 additions & 9 deletions benchmarks/boolq/boolq.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,6 @@
"""


def record_to_sample(record):
if record["answer"]:
target = "Yes"
else:
target = "No"

return Sample(input=record["question"], target=target)


@task
def boolq():
dataset = hf_dataset(
Expand All @@ -45,3 +36,12 @@ def boolq():
plan=[prompt_template(template=TEMPLATE), generate()],
scorer=pattern(r"(Yes|No).?\Z"),
)


def record_to_sample(record):
if record["answer"]:
target = "Yes"
else:
target = "No"

return Sample(input=record["question"], target=target)
32 changes: 16 additions & 16 deletions benchmarks/gpqa/gpqa.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,22 +27,6 @@
DEFAULT_EPOCHS = 4


# map records to inspect samples (note that target is always "A" in the,
# dataset, we will shuffle the presentation of options to mitigate this)
def record_to_sample(record):
return Sample(
input=record["Question"],
choices=[
str(record["Correct Answer"]),
str(record["Incorrect Answer 1"]),
str(record["Incorrect Answer 2"]),
str(record["Incorrect Answer 3"]),
],
target="A",
id=record["Record ID"],
)


@task
def gpqa_diamond():
return Task(
Expand All @@ -57,3 +41,19 @@ def gpqa_diamond():
config=GenerateConfig(temperature=0.5),
epochs=DEFAULT_EPOCHS,
)


# map records to inspect samples (note that target is always "A" in the,
# dataset, we will shuffle the presentation of options to mitigate this)
def record_to_sample(record):
return Sample(
input=record["Question"],
choices=[
str(record["Correct Answer"]),
str(record["Incorrect Answer 1"]),
str(record["Incorrect Answer 2"]),
str(record["Incorrect Answer 3"]),
],
target="A",
id=record["Record ID"],
)
35 changes: 17 additions & 18 deletions benchmarks/gsm8k.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,24 +17,6 @@
from inspect_ai.scorer import match
from inspect_ai.solver import generate, prompt_template, system_message


def record_to_sample(record):
DELIM = "####"
input = record["question"]
answer = record["answer"].split(DELIM)
target = answer.pop().strip()
reasoning = DELIM.join(answer)
return Sample(input=input, target=target, metadata={"reasoning": reasoning.strip()})


def sample_to_fewshot(sample):
return (
f"{sample.input}\n\nReasoning:\n"
+ f"{sample.metadata['reasoning']}\n\n"
+ f"ANSWER: {sample.target}"
)


# setup for problem + instructions for providing answer
MATH_PROMPT_TEMPLATE = """
Solve the following math problem step by step. The last line of your response should be of the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer to the problem.
Expand Down Expand Up @@ -79,3 +61,20 @@ def gsm8k(fewshot=10, fewshot_seed=42):
plan=plan,
scorer=match(numeric=True),
)


def record_to_sample(record):
DELIM = "####"
input = record["question"]
answer = record["answer"].split(DELIM)
target = answer.pop().strip()
reasoning = DELIM.join(answer)
return Sample(input=input, target=target, metadata={"reasoning": reasoning.strip()})


def sample_to_fewshot(sample):
return (
f"{sample.input}\n\nReasoning:\n"
+ f"{sample.metadata['reasoning']}\n\n"
+ f"ANSWER: {sample.target}"
)
18 changes: 9 additions & 9 deletions benchmarks/hellaswag.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,6 @@
"""


def record_to_sample(record):
return Sample(
input=record["ctx"],
target=chr(ord("A") + int(record["label"])),
choices=record["endings"],
metadata=dict(source_id=record["source_id"]),
)


@task
def hellaswag():
# dataset
Expand All @@ -41,3 +32,12 @@ def hellaswag():
plan=[system_message(SYSTEM_MESSAGE), multiple_choice()],
scorer=choice(),
)


def record_to_sample(record):
return Sample(
input=record["ctx"],
target=chr(ord("A") + int(record["label"])),
choices=record["endings"],
metadata=dict(source_id=record["source_id"]),
)
17 changes: 8 additions & 9 deletions benchmarks/piqa.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,6 @@
from inspect_ai.scorer import choice
from inspect_ai.solver import multiple_choice


def record_to_sample(record):
return Sample(
input=record["goal"],
target="A" if record["label"] == 0 else "B",
choices=[record["sol1"], record["sol2"]],
)


TEMPLATE = r"""
The entire content of your response should be of the following format: 'ANSWER:
$LETTER' (without quotes) where LETTER is one of {letters}.
Expand Down Expand Up @@ -53,3 +44,11 @@ def piqa():
plan=[multiple_choice(template=TEMPLATE)],
scorer=choice(),
)


def record_to_sample(record):
return Sample(
input=record["goal"],
target="A" if record["label"] == 0 else "B",
choices=[record["sol1"], record["sol2"]],
)
22 changes: 11 additions & 11 deletions benchmarks/truthfulqa.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,6 @@
from inspect_ai.solver import multiple_choice


# The dataset uses a binary list for each target, where 1 indicates an answer is
# correct and 0 is incorrect. For example, if there are three options and the
# second is correct, the target would be [0, 1, 0].
#
# This function converts that to a list of letters corresponding to the correct
# answers, which allows us to use the `choice("letter")` scorer.
# e.g. [0, 1, 1] -> ["B", "C"]
def labels_to_positions(labels: list[int]) -> list[str]:
return [chr(ord("A") + i) for i, label in enumerate(labels) if label == 1]


@task
def truthfulqa(target="mc1"):
def record_to_sample(record):
Expand Down Expand Up @@ -58,3 +47,14 @@ def record_to_sample(record):
plan=[multiple_choice(multiple_correct=multiple_correct, shuffle=True)],
scorer=choice(),
)


# The dataset uses a binary list for each target, where 1 indicates an answer is
# correct and 0 is incorrect. For example, if there are three options and the
# second is correct, the target would be [0, 1, 0].
#
# This function converts that to a list of letters corresponding to the correct
# answers, which allows us to use the `choice("letter")` scorer.
# e.g. [0, 1, 1] -> ["B", "C"]
def labels_to_positions(labels: list[int]) -> list[str]:
return [chr(ord("A") + i) for i, label in enumerate(labels) if label == 1]
4 changes: 2 additions & 2 deletions docs/_examples/hellaswag.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ https://arxiv.org/abs/1905.07830
```{python}
from inspect_ai import Task, eval, task
from inspect_ai.dataset import Sample, hf_dataset
from inspect_ai.scorer import answer
from inspect_ai.scorer import choice
from inspect_ai.solver import multiple_choice, system_message
SYSTEM_MESSAGE = """
Expand Down Expand Up @@ -73,7 +73,7 @@ def hellaswag():
system_message(SYSTEM_MESSAGE),
multiple_choice()
],
scorer=answer("letter"),
scorer=choice(),
)
```

Expand Down
9 changes: 5 additions & 4 deletions docs/_quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,15 @@ book:
twitter-card:
title: "Inspect"
description: "Open-source framework for large language model evaluations"
image: images/inspect.png
image: /images/inspect.png
card-style: summary_large_image
open-graph:
title: "Inspect"
description: "Open-source framework for large language model evaluations"
image: images/inspect.png
image: /images/inspect.png
sidebar:
header: >
[![](images/aisi-logo.png){fig-alt="UK AI Safety Institute Website"}](https://www.gov.uk/government/organisations/ai-safety-institute)
[![](/images/aisi-logo.png){fig-alt="UK AI Safety Institute Website"}](https://www.gov.uk/government/organisations/ai-safety-institute)
page-footer:
left:
Expand Down Expand Up @@ -52,11 +52,12 @@ book:
- "index.qmd"
- part: "Basics"
chapters:
- tutorial.qmd
- workflow.qmd
- examples/index.qmd
- log-viewer.qmd
- text: "VS Code"
href: vscode.qmd
- examples.qmd

- part: "Components"
chapters:
Expand Down
Loading

0 comments on commit 08eb7b6

Please sign in to comment.