improved examples and tutorial pages (#379)

* initial work on examples * more work on examples * core questions entered * contributors * more contributors * move record_to_sample to end * add demonstrates * more work on examples * div for example items * some descriptions * a bit of work on toc and headings * add some descriptions * add intercode * mathematics * add tutorial * fix math thing * reformat * update paths * link to tutorial and update examples * remove whitespace * use choice in hellaswag * additional tutorial content * improve math dataset tutorial * intercode example * add gdm example * typography --------- Co-authored-by: aisi-inspect <[email protected]>
UKGovernmentBEIS · Sep 12, 2024 · 08eb7b6 · 08eb7b6
1 parent db843d6
commit 08eb7b6
Show file tree

Hide file tree

Showing 18 changed files with 1,322 additions and 967 deletions.
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -5,7 +5,7 @@ This directory contains evals for several benchmarks. Datasets for evals are not
 | Benchmark                                                                                       | Reference                            |                                                  Code | Dataset      |
 |-----------------------------|--------------|--------------:|--------------|
 | MMLU: Measuring Massive Multitask Language Understanding                                        | <https://arxiv.org/abs/2009.03300>   |                                    [mmlu.py](mmlu.py) | Download     |
-| MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark             | <https://arxiv.org/abs/2009.03300>   |                   [mmlu_pro.py](mmlu_pro/mmlu_pro.py) | HuggingFace  |
+| MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark             | <https://arxiv.org/abs/2406.01574>   |                   [mmlu_pro.py](mmlu_pro/mmlu_pro.py) | HuggingFace  |
 | MATH: Measuring Mathematical Problem Solving With the MATH Dataset                              | <https://arxiv.org/abs/2103.03874>   |          [mathematics.py](mathematics/mathematics.py) | Download     |
 | GPQA: A Graduate-Level Google-Proof Q&A Benchmark                                               | <https://arxiv.org/abs/2311.12022>   |                                    [gpqa.py](gpqa/gpqa.py) | Download     |
 | ARC: AI2 Reasoning Challenge                                                                    | <https://arxiv.org/abs/1803.05457>   |                                      [arc.py](arc/arc.py) | Hugging Face |
@@ -14,16 +14,16 @@ This directory contains evals for several benchmarks. Datasets for evals are not
 | PIQA: Physical Interaction: Question Answering                                                  | <https://arxiv.org/abs/1911.11641>   |                                    [piqa.py](piqa.py) | Hugging Face |
 | BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions                          | <https://arxiv.org/abs/1905.10044>   |                                  [boolq.py](boolq/boolq.py) | Hugging Face |
 | TruthfulQA: Measuring How Models Mimic Human Falsehoods                                         | <https://arxiv.org/abs/2109.07958v2> |                        [truthfulqa.py](truthfulqa.py) | Hugging Face |
-| HumanEval: Evaluating Large Language Models Trained on Code                                     | <https://arxiv.org/pdf/2107.03374>   |                [humaneval.py](humaneval/humaneval.py) | Hugging Face |
-| DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs            | <https://arxiv.org/pdf/1903.00161>   |                               [drop.py](drop/drop.py) | Hugging Face |
-| WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale                                   | <https://arxiv.org/pdf/1907.10641>   |             [winogrande.py](winogrande/winogrande.py) | Hugging Face |
+| HumanEval: Evaluating Large Language Models Trained on Code                                     | <https://arxiv.org/abs/2107.03374>   |                [humaneval.py](humaneval/humaneval.py) | Hugging Face |
+| DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs            | <https://arxiv.org/abs/1903.00161>   |                               [drop.py](drop/drop.py) | Hugging Face |
+| WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale                                   | <https://arxiv.org/abs/1907.10641>   |             [winogrande.py](winogrande/winogrande.py) | Hugging Face |
 | RACE-H: A benchmark for testing reading comprehension and reasoning abilities of neural models. | <https://arxiv.org/abs/1704.04683>   |                         [race-h.py](race-h/race-h.py) | Hugging Face |
 | MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark.              | <https://arxiv.org/abs/2311.16502>   |                               [mmmu.py](mmmu/mmmu.py) | Hugging Face |
-| CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge                   | <https://arxiv.org/pdf/1811.00937v2> | [commonsense_qa.py](commonsense_qa/commonsense_qa.py) | Hugging Face |
+| CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge                   | <https://arxiv.org/abs/1811.00937> | [commonsense_qa.py](commonsense_qa/commonsense_qa.py) | Hugging Face |
 | XSTest: A benchmark for identifying exaggerated safety behaviours in LLM's                      | <https://arxiv.org/abs/2308.01263>   |                         [xstest.py](xstest/xstest.py) | Hugging Face |
 | MathVista: Evaluating Mathematical Reasoning in Visual Contexts                                 | <https://arxiv.org/abs/2310.02255>   |                [mathvista.py](mathvista/mathvista.py) | Hugging Face |
 | SQuAD: A Reading Comprehension Benchmark requiring reasoning over Wikipedia articles | <https://arxiv.org/pdf/1806.03822>   |             [squad.py](squad/squad.py) | Hugging Face |
-| IFEval: Instruction-Following Evaluation for Large Language Models                 | <https://arxiv.org/pdf/2311.07911> |   [ifeval.py](ifeval/ifeval.py) | Hugging Face |
-| AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models                 | <https://arxiv.org/pdf/2304.06364> |   [agieval_en.py](agieval/agieval_en.py) | Download |
+| IFEval: Instruction-Following Evaluation for Large Language Models                 | <https://arxiv.org/abs/2311.07911> |   [ifeval.py](ifeval/ifeval.py) | Hugging Face |
+| AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models                 | <https://arxiv.org/abs/2304.06364> |   [agieval_en.py](agieval/agieval_en.py) | Download |
 | PubMedQA: A Dataset for Biomedical Research Question Answering                                  | <https://arxiv.org/abs/1909.06146>   |                   [pubmedqa.py](pubmedqa/pubmedqa.py) | Hugging Face
 | MBPP: Mostly Basic Python Problems                                                              | <https://arxiv.org/abs/2108.07732>   |   [mbpp.py](mbpp/mbpp.py) | Hugging Face |
diff --git a/benchmarks/arc/arc.py b/benchmarks/arc/arc.py
@@ -18,22 +18,6 @@
 from inspect_ai.solver import multiple_choice
 
 
-def record_to_sample(record):
-    # read the labels and text
-    choices = record["choices"]
-    choices = dict(zip(choices["label"], choices["text"]))
-
-    # determine the target then normalize to letter
-    answerKey = record["answerKey"]
-    target = list(choices.keys()).index(answerKey)
-    target = chr(ord("A") + int(target))
-
-    # return sample
-    return Sample(
-        input=record["question"], choices=list(choices.values()), target=target
-    )
-
-
 def arc_task(dataset_name):
     return Task(
         dataset=hf_dataset(
@@ -55,3 +39,19 @@ def arc_easy():
 @task
 def arc_challenge():
     return arc_task("ARC-Challenge")
+
+
+def record_to_sample(record):
+    # read the labels and text
+    choices = record["choices"]
+    choices = dict(zip(choices["label"], choices["text"]))
+
+    # determine the target then normalize to letter
+    answerKey = record["answerKey"]
+    target = list(choices.keys()).index(answerKey)
+    target = chr(ord("A") + int(target))
+
+    # return sample
+    return Sample(
+        input=record["question"], choices=list(choices.values()), target=target
+    )
diff --git a/benchmarks/boolq/boolq.py b/benchmarks/boolq/boolq.py
@@ -22,15 +22,6 @@
 """
 
 
-def record_to_sample(record):
-    if record["answer"]:
-        target = "Yes"
-    else:
-        target = "No"
-
-    return Sample(input=record["question"], target=target)
-
-
 @task
 def boolq():
     dataset = hf_dataset(
@@ -45,3 +36,12 @@ def boolq():
         plan=[prompt_template(template=TEMPLATE), generate()],
         scorer=pattern(r"(Yes|No).?\Z"),
     )
+
+
+def record_to_sample(record):
+    if record["answer"]:
+        target = "Yes"
+    else:
+        target = "No"
+
+    return Sample(input=record["question"], target=target)
diff --git a/benchmarks/gpqa/gpqa.py b/benchmarks/gpqa/gpqa.py
@@ -27,22 +27,6 @@
 DEFAULT_EPOCHS = 4
 
 
-# map records to inspect samples (note that target is always "A" in the,
-# dataset, we will shuffle the presentation of options to mitigate this)
-def record_to_sample(record):
-    return Sample(
-        input=record["Question"],
-        choices=[
-            str(record["Correct Answer"]),
-            str(record["Incorrect Answer 1"]),
-            str(record["Incorrect Answer 2"]),
-            str(record["Incorrect Answer 3"]),
-        ],
-        target="A",
-        id=record["Record ID"],
-    )
-
-
 @task
 def gpqa_diamond():
     return Task(
@@ -57,3 +41,19 @@ def gpqa_diamond():
         config=GenerateConfig(temperature=0.5),
         epochs=DEFAULT_EPOCHS,
     )
+
+
+# map records to inspect samples (note that target is always "A" in the,
+# dataset, we will shuffle the presentation of options to mitigate this)
+def record_to_sample(record):
+    return Sample(
+        input=record["Question"],
+        choices=[
+            str(record["Correct Answer"]),
+            str(record["Incorrect Answer 1"]),
+            str(record["Incorrect Answer 2"]),
+            str(record["Incorrect Answer 3"]),
+        ],
+        target="A",
+        id=record["Record ID"],
+    )
diff --git a/benchmarks/gsm8k.py b/benchmarks/gsm8k.py
@@ -17,24 +17,6 @@
 from inspect_ai.scorer import match
 from inspect_ai.solver import generate, prompt_template, system_message
 
-
-def record_to_sample(record):
-    DELIM = "####"
-    input = record["question"]
-    answer = record["answer"].split(DELIM)
-    target = answer.pop().strip()
-    reasoning = DELIM.join(answer)
-    return Sample(input=input, target=target, metadata={"reasoning": reasoning.strip()})
-
-
-def sample_to_fewshot(sample):
-    return (
-        f"{sample.input}\n\nReasoning:\n"
-        + f"{sample.metadata['reasoning']}\n\n"
-        + f"ANSWER: {sample.target}"
-    )
-
-
 # setup for problem + instructions for providing answer
 MATH_PROMPT_TEMPLATE = """
 Solve the following math problem step by step. The last line of your response should be of the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer to the problem.
@@ -79,3 +61,20 @@ def gsm8k(fewshot=10, fewshot_seed=42):
         plan=plan,
         scorer=match(numeric=True),
     )
+
+
+def record_to_sample(record):
+    DELIM = "####"
+    input = record["question"]
+    answer = record["answer"].split(DELIM)
+    target = answer.pop().strip()
+    reasoning = DELIM.join(answer)
+    return Sample(input=input, target=target, metadata={"reasoning": reasoning.strip()})
+
+
+def sample_to_fewshot(sample):
+    return (
+        f"{sample.input}\n\nReasoning:\n"
+        + f"{sample.metadata['reasoning']}\n\n"
+        + f"ANSWER: {sample.target}"
+    )
diff --git a/benchmarks/hellaswag.py b/benchmarks/hellaswag.py
@@ -15,15 +15,6 @@
 """
 
 
-def record_to_sample(record):
-    return Sample(
-        input=record["ctx"],
-        target=chr(ord("A") + int(record["label"])),
-        choices=record["endings"],
-        metadata=dict(source_id=record["source_id"]),
-    )
-
-
 @task
 def hellaswag():
     # dataset
@@ -41,3 +32,12 @@ def hellaswag():
         plan=[system_message(SYSTEM_MESSAGE), multiple_choice()],
         scorer=choice(),
     )
+
+
+def record_to_sample(record):
+    return Sample(
+        input=record["ctx"],
+        target=chr(ord("A") + int(record["label"])),
+        choices=record["endings"],
+        metadata=dict(source_id=record["source_id"]),
+    )
diff --git a/benchmarks/piqa.py b/benchmarks/piqa.py
@@ -14,15 +14,6 @@
 from inspect_ai.scorer import choice
 from inspect_ai.solver import multiple_choice
 
-
-def record_to_sample(record):
-    return Sample(
-        input=record["goal"],
-        target="A" if record["label"] == 0 else "B",
-        choices=[record["sol1"], record["sol2"]],
-    )
-
-
 TEMPLATE = r"""
 The entire content of your response should be of the following format: 'ANSWER:
 $LETTER' (without quotes) where LETTER is one of {letters}.
@@ -53,3 +44,11 @@ def piqa():
         plan=[multiple_choice(template=TEMPLATE)],
         scorer=choice(),
     )
+
+
+def record_to_sample(record):
+    return Sample(
+        input=record["goal"],
+        target="A" if record["label"] == 0 else "B",
+        choices=[record["sol1"], record["sol2"]],
+    )
diff --git a/benchmarks/truthfulqa.py b/benchmarks/truthfulqa.py
@@ -17,17 +17,6 @@
 from inspect_ai.solver import multiple_choice
 
 
-# The dataset uses a binary list for each target, where 1 indicates an answer is
-# correct and 0 is incorrect. For example, if there are three options and the
-# second is correct, the target would be [0, 1, 0].
-#
-# This function converts that to a list of letters corresponding to the correct
-# answers, which allows us to use the `choice("letter")` scorer.
-#     e.g. [0, 1, 1] -> ["B", "C"]
-def labels_to_positions(labels: list[int]) -> list[str]:
-    return [chr(ord("A") + i) for i, label in enumerate(labels) if label == 1]
-
-
 @task
 def truthfulqa(target="mc1"):
     def record_to_sample(record):
@@ -58,3 +47,14 @@ def record_to_sample(record):
         plan=[multiple_choice(multiple_correct=multiple_correct, shuffle=True)],
         scorer=choice(),
     )
+
+
+# The dataset uses a binary list for each target, where 1 indicates an answer is
+# correct and 0 is incorrect. For example, if there are three options and the
+# second is correct, the target would be [0, 1, 0].
+#
+# This function converts that to a list of letters corresponding to the correct
+# answers, which allows us to use the `choice("letter")` scorer.
+#     e.g. [0, 1, 1] -> ["B", "C"]
+def labels_to_positions(labels: list[int]) -> list[str]:
+    return [chr(ord("A") + i) for i, label in enumerate(labels) if label == 1]
diff --git a/docs/_examples/hellaswag.qmd b/docs/_examples/hellaswag.qmd
@@ -29,7 +29,7 @@ https://arxiv.org/abs/1905.07830
 ```{python}
 from inspect_ai import Task, eval, task
 from inspect_ai.dataset import Sample, hf_dataset
-from inspect_ai.scorer import answer
+from inspect_ai.scorer import choice
 from inspect_ai.solver import multiple_choice, system_message
 
 SYSTEM_MESSAGE = """
@@ -73,7 +73,7 @@ def hellaswag():
           system_message(SYSTEM_MESSAGE),
           multiple_choice()
         ],
-        scorer=answer("letter"),
+        scorer=choice(),
     )
 ```
 

diff --git a/docs/_quarto.yml b/docs/_quarto.yml
@@ -16,15 +16,15 @@ book:
    twitter-card:
       title: "Inspect"
       description: "Open-source framework for large language model evaluations"
-      image: images/inspect.png
+      image: /images/inspect.png
       card-style: summary_large_image
    open-graph: 
       title: "Inspect"
       description: "Open-source framework for large language model evaluations"
-      image: images/inspect.png
+      image: /images/inspect.png
    sidebar:
       header: >
-         [![](images/aisi-logo.png){fig-alt="UK AI Safety Institute Website"}](https://www.gov.uk/government/organisations/ai-safety-institute)
+         [![](/images/aisi-logo.png){fig-alt="UK AI Safety Institute Website"}](https://www.gov.uk/government/organisations/ai-safety-institute)
 
    page-footer: 
       left: 
@@ -52,11 +52,12 @@ book:
       - "index.qmd"     
       - part: "Basics"
         chapters:
+           - tutorial.qmd
            - workflow.qmd
+           - examples/index.qmd
            - log-viewer.qmd
            - text: "VS Code"
              href: vscode.qmd
-           - examples.qmd
 
       - part: "Components"
         chapters: