release v0.3.10

UKGovernmentBEIS · May 29, 2024 · 140c3d6 · 140c3d6
1 parent 671a239
commit 140c3d6
Show file tree

Hide file tree

Showing 121 changed files with 3,514 additions and 1,811 deletions.
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -20,7 +20,26 @@ jobs:
       - name: Lint and format with Ruff
         uses: chartboost/ruff-action@v1
 
-  build:
+  mypy:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.10", "3.11"]
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          python -m pip install .[dev]
+      - name: Run mypy
+        run: |
+          mypy --exclude tests/test_package src tests
+
+  test:
     runs-on: ubuntu-latest
     strategy:
       matrix:
@@ -38,7 +57,7 @@ jobs:
           python -m pip install .[dev]
       - name: Test with pytest
         run: |
-          pytest -rA -x --doctest-modules --color=yes --cov=inspect_ai
+          pytest -rA --doctest-modules --color=yes --cov=inspect_ai
 
   package:
     name: Build & inspect the package.

diff --git a/.github/workflows/vscode.yml b/.github/workflows/vscode.yml
@@ -2,14 +2,20 @@ on:
   push:
     tags:
       - "v[0-9]*"
+    paths:
+      - "tools/vscode/**"
+      - ".github/workflows/vscode.yml"
     branches:
       - "main"
   pull_request:
     branches:
       - "main"
+    paths:
+      - "tools/vscode/**"
+      - ".github/workflows/vscode.yml"
   workflow_dispatch:
 
-name: Deploy Extension
+name: Build VS Code Ext
 jobs:
   deploy:
     runs-on: ubuntu-latest

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,24 @@
 # Changelog
 
+## v0.3.10 (29 May 2024)
+
+- **BREAKING:** The `pattern` scorer has been modified to match against any (or all) regex match groups. This replaces the previous behaviour when there was more than one group, which would only match the second group.
+- Improved performance for Inspect View on very large datasets (virtualized sample list).
+- ToolChoice `any` option to indicate the model should use at least one tool (supported by Anthropic and Mistral, mapped to `auto` for OpenAI).
+- Tool calls can now return a simple scalar or `list[ContentText | ContentImage]`
+- Support for updated Anthropic tools beta (tool_choice and image tool results)
+- Report tool_error back to model if it provides invalid JSON for tool calls arguments (formerly this halted the entire eval with an error).
+- New `max_samples` option to control how many samples are run in parallel (still defaults to running all samples in parallel).
+- Add `boolq.py` benchmark.
+- Add `piqa.py` benchmark.
+- View: Improved markdown rendering (properly escape reference links)
+- Improved typing for example_dataset function.
+- Setuptools entry point for loading custom model extensions.
+- Break optional `tuple` return out of `ToolResult` type.
+- Bugfix: always read original sample message(s) for `TaskState.input_text`.
+- Bugfix: remove write counter from log (could have resulted in incomplete/invalid logs propagating to the viewer)
+- Bugfix: handle task names that include spaces in log viewer.
+
 ## v0.3.9 (14 May 2024)
 
 - Add `ollama` local model provider.
@@ -49,7 +68,7 @@
 - `Score` now has an optional `answer` field, which denotes the answer text extracted from model output.
 - Accuracy metrics now take an optional `ValueToFloat` function for customizing how textual values mapped to float.
 - Made `model_graded_qa` more flexible with separate `instruction` template and `grade_pattern`, as well providing `partial_credit` as an option.
-- Modify the default templates for `chain_of_thought()` and `self_critique()` to instruct the model to reply with `ANSWER: $ANSWER` at the end on its own line. 
+- Modify the default templates for `chain_of_thought()` and `self_critique()` to instruct the model to reply with `ANSWER: $ANSWER` at the end on its own line.
 - Improved numeric extraction for `match(numeric=True)` (better currency and decimal handling).
 - Improve `answer()` patterns so that they detect letter and word answers both within and at the end of model output.
 - `Plan` now has an optional `cleanup` function which can be used to free per-sample resources (e.g. Docker containers) even in the case of an evaluation error.

diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -1,26 +1,33 @@
 ## Benchmarks
 
-This directory contains evals for several benchmarks, including:
+This directory contains evals for several benchmarks. Datasets for evals are not embedded in the repository but are rather stored using either Git-LFS or Hugging Face (see below for instructions on the additional requirements for reading datasets from these sources).
 
-| Benchmark                                                          | Reference                          |                             Code |
-|------------------------|------------------------|-----------------------:|
-| MMLU: Measuring Massive Multitask Language Understanding           | <https://arxiv.org/abs/2009.03300> |               [mmlu.py](mmlu.py) |
-| MATH: Measuring Mathematical Problem Solving With the MATH Dataset | <https://arxiv.org/abs/2103.03874> | [mathematics.py](mathematics.py) |
-| GPQA: A Graduate-Level Google-Proof Q&A Benchmark                  | <https://arxiv.org/abs/2311.12022> |               [gpqa.py](gpqa.py) |
-| ARC: AI2 Reasoning Challenge                                       | <https://arxiv.org/abs/1803.05457> |                 [arc.py](arc.py) |
-| GSM8K: Training Verifiers to Solve Math Word Problems              | <https://arxiv.org/abs/2110.14168> |             [gsm8k.py](gsm8k.py) |
-| HellaSwag: Can a Machine Really Finish Your Sentence?              | <https://arxiv.org/abs/1905.07830> |     [hellaswag.py](hellaswag.py) |
+| Benchmark                                                              | Reference                          |                             Code | Dataset      |
+|------------------------------------------------------------------------|------------------------------------|---------------------------------:|--------------|
+| MMLU: Measuring Massive Multitask Language Understanding               | <https://arxiv.org/abs/2009.03300> |               [mmlu.py](mmlu.py) | Git-LFS      |
+| MATH: Measuring Mathematical Problem Solving With the MATH Dataset     | <https://arxiv.org/abs/2103.03874> | [mathematics.py](mathematics.py) | Git-LFS      |
+| GPQA: A Graduate-Level Google-Proof Q&A Benchmark                      | <https://arxiv.org/abs/2311.12022> |               [gpqa.py](gpqa.py) | Hugging Face |
+| ARC: AI2 Reasoning Challenge                                           | <https://arxiv.org/abs/1803.05457> |                 [arc.py](arc.py) | Hugging Face |
+| GSM8K: Training Verifiers to Solve Math Word Problems                  | <https://arxiv.org/abs/2110.14168> |             [gsm8k.py](gsm8k.py) | Hugging Face |
+| HellaSwag: Can a Machine Really Finish Your Sentence?                  | <https://arxiv.org/abs/1905.07830> |     [hellaswag.py](hellaswag.py) | Hugging Face |
+| PIQA: Physical Interaction: Question Answering                         | <https://arxiv.org/abs/1911.11641> |               [piqa.py](piqa.py) | Hugging Face |
+| BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions | <https://arxiv.org/abs/1905.10044> |             [boolq.py](boolq.py) | Hugging Face |
 
-The datasets for ARC, GSM8K, and HellaSwag are read from Hugging Face, so require the installation of the **datasets** package:
+### Git-LFS Datasets
 
-``` bash
-$ pip install datasets
-```
-
-The datasets for MMLU and MATH are stored using [Git-LFS](https://git-lfs.com/). Once you have downloaded and installed LFS, switch to the repo source directory and run the following commands to sync the data from LFS:
+To use these datasets, first download and install [Git-LFS](https://git-lfs.com/). Once you have done this, switch to the repo source directory and run the following commands to sync the data from LFS:
 
 ``` bash
 $ cd inspect_ai
 $ git lfs fetch --all
 $ git lfs pull
-```
+```
+
+### Hugging Face Datasets
+
+To use these datasets you need to install the **datasets** package as follows:
+
+``` bash
+$ pip install datasets
+```
+
diff --git a/benchmarks/boolq.py b/benchmarks/boolq.py
@@ -0,0 +1,47 @@
+"""
+BoolQ
+
+Exploring the Surprising Difficulty of Natural Yes/No Questions
+Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins,
+Kristina Toutanova
+https://arxiv.org/abs/1905.10044
+
+# Run against validations boolq dataset
+inspect eval boolq.py
+"""
+
+from inspect_ai import Task, task
+from inspect_ai.dataset import Sample, hf_dataset
+from inspect_ai.scorer import pattern
+from inspect_ai.solver import generate, prompt_template
+
+TEMPLATE = r"""
+Answer the following question with either Yes or No. Include nothing else in your response.
+
+Question: {prompt}
+"""
+
+
+def record_to_sample(record):
+    if record["answer"]:
+        target = "Yes"
+    else:
+        target = "No"
+
+    return Sample(input=record["question"], target=target)
+
+
+@task
+def boolq():
+    dataset = hf_dataset(
+        path="boolq",
+        sample_fields=record_to_sample,
+        split="validation",
+        shuffle=True,
+    )
+
+    return Task(
+        dataset=dataset,
+        plan=[prompt_template(template=TEMPLATE), generate()],
+        scorer=pattern(r"(Yes|No).?\Z"),
+    )
diff --git a/benchmarks/piqa.py b/benchmarks/piqa.py
@@ -0,0 +1,55 @@
+"""
+PIQA (Physical Interaction: Question Answering)
+
+Reasoning about Physical Commonsense in Natural Language
+Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi
+https://arxiv.org/abs/1911.11641
+
+# eval piqa validation set
+inspect eval piq.py
+"""
+
+from inspect_ai import Task, task
+from inspect_ai.dataset import Sample, hf_dataset
+from inspect_ai.scorer import answer
+from inspect_ai.solver import multiple_choice
+
+
+def record_to_sample(record):
+    return Sample(
+        input=record["goal"],
+        target="A" if record["label"] == 0 else "B",
+        choices=[record["sol1"], record["sol2"]],
+    )
+
+
+TEMPLATE = r"""
+The entire content of your response should be of the following format: 'ANSWER:
+$LETTER' (without quotes) where LETTER is one of {letters}.
+
+Given either a question or a statement followed by two possible solutions
+labelled A and B, choose the most appropriate solution. If a question is given,
+the solutions answer the question. If a statement is given, the solutions
+explain how to achieve the statement.
+
+{question}
+
+{choices}
+""".strip()
+
+
+@task
+def piqa():
+    dataset = hf_dataset(
+        path="piqa",
+        sample_fields=record_to_sample,
+        trust=True,
+        split="validation",
+        shuffle=True,
+    )
+
+    return Task(
+        dataset=dataset,
+        plan=[multiple_choice(template=TEMPLATE)],
+        scorer=answer("letter"),
+    )
diff --git a/docs/datasets.qmd b/docs/datasets.qmd
@@ -16,7 +16,7 @@ Of course, many real-world datasets won't be so trivial to read. Below we'll dis
 
 ## Dataset Samples
 
-The core data type underlying the use of datasets with Inspect is the `Sample`. A sample has an `input`, a `target`, an optional `id`, and an optional collection of `metadata`.
+The core data type underlying the use of datasets with Inspect is the `Sample`, which consists of a required `input` field and several other optional fields:
 
 **Class** `inspect_ai.dataset.Sample`
 
@@ -43,9 +43,9 @@ Can be read directly with:
 dataset = csv_dataset("security_guide.csv")
 ```
 
-Note that samples from datasets without and `id` field will automatically be assigned ids based on an auto-incrementing integer starting with 1.
+Note that samples from datasets without an `id` field will automatically be assigned ids based on an auto-incrementing integer starting with 1.
 
-If your samples include `choices`, then the label should be a numeric index into the available `choices` rather a letter (this is an implicit assumption of the `multiple_choice()` solver).
+If your samples include `choices`, then the `target` should be a numeric index into the available `choices` rather than a letter (this is an implicit assumption of the `multiple_choice()` solver).
 
 ## Field Mapping
 
@@ -65,7 +65,7 @@ dataset = json_dataset(
 )
 ```
 
-If you need to do more than just map field names and actually do custom processing of the data, you can instead pass a function which takes an `index` and `record` (represented as a `dict`) from the underlying file and returns a `Sample`. For example:
+If you need to do more than just map field names and actually do custom processing of the data, you can instead pass a function which takes a `record` (represented as a `dict`) from the underlying file and returns a `Sample`. For example:
 
 ``` python
 from inspect_ai.dataset import Sample, json_dataset
@@ -154,7 +154,7 @@ Using S3 is mostly a matter of substituting S3 URLs (e.g. `s3://my-bucket-name`)
 json_dataset("s3://my-bucket/dataset.jsonl")
 ```
 
-S3 buckets are normally access controlled so require authentication to read from. There are a wide variety of ways to configure your client for AWS authentication, all of which work with Inspect. See the article on [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) for additional details
+S3 buckets are normally access controlled so require authentication to read from. There are a wide variety of ways to configure your client for AWS authentication, all of which work with Inspect. See the article on [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) for additional details.
 
 ## Chat Messages
 

diff --git a/docs/models.qmd b/docs/models.qmd
@@ -356,14 +356,33 @@ The `__init__()` method *must* call the `super().__init__()` method, and typical
 
 The `generate()` method handles interacting with the model. In addition, there are some optional methods you can override to specify various behaviours and constraints (default max tokens and connections, identifying rate limit errors, etc.)
 
-Once you've created the class and decorated it with `@modelapi` as shown above, you can reference it as follows:
+### Model Registration
+
+If you are publishing a custom model within a Python package, you should register an `inspect_ai` [setuptools entry point](https://setuptools.pypa.io/en/latest/userguide/entry_point.html). This will ensure that inspect loads your extension before it attempts to resolve a model name that uses your provider.
+
+For example, if your package was named `custom_models` and your model provider was exported from a source file named `inspect_ai.py` at the root of your package, `pyproject.toml` that would look like this:
+
+``` toml
+[project.entry-points.inspect_ai]
+custom_models = "custom_models.inspect_ai"
+```
+
+### Using the Model
+
+Once you've created the class, decorated it with `@modelapi` as shown above, and registered it, then you can use it as follows:
+
+``` bash
+inspect eval ctf.py --model custom/my-model
+```
+
+Where `my-model` is the name of some model supported by your provider (this will be passed to `__init()__` in the `model_name` argument).
+
+You can also reference it from within Python calls to `get_model()` or `eval()`:
 
 ``` python
 # get a model instance
-model = get_model("custom/name-of-model")
+model = get_model("custom/my-model")
 
 # run an eval with the model
-eval(math, model = "custom/name-of-model")
-```
-
-In this example, the `model_name` argument passed to `__init__()` will be "name-of-model".
+eval(math, model = "custom/my-model")
+```
diff --git a/docs/scorers.qmd b/docs/scorers.qmd
@@ -73,7 +73,7 @@ def model_graded_qa(
     instructions: str | None = None,
     grade_pattern: str | None = None,
     partial_credit: bool = False,
-    model: str | Model | None = None,
+    model: list[str | Model] | str | Model | None = None,
 ) -> Scorer:
     ...
 ```
@@ -95,7 +95,7 @@ The built-in model graded scorers also support using multiple grader models (whe
 
 ```python
 model_graded_qa(
-    models = [
+    model = [
         "google/gemini-1.0-pro",
         "anthropic/claude-3-opus-20240229" 
         "together/meta-llama/Llama-3-70b-chat-hf",
@@ -173,7 +173,7 @@ Next, we'll take a look at the source code for a couple of the built in scorers
 Here is the source code for the built-in `includes()` scorer:
 
 ``` python
-@scorer(metrics=[accuracy(), bootstrap_str()])   # <1>
+@scorer(metrics=[accuracy(), bootstrap_std()])   # <1>
 def includes(ignore_case: bool = True):
 
     async def score(state: TaskState, target: Target):   # <2>
@@ -284,13 +284,16 @@ def average_score(scores: list[Score]) -> Score:
     )
 ```
 
-Which might be used like this:
+
+Further, you will need to wrap your use of `multi_scorer()` inside a `@scorer` decorated function (with the requisite `metrics` specified). For example:
 
 ```python
-multi_scorer(
-    scorers = [model_graded_qa(model=model) for model in models],
-    reducer = average_score
-)
+@scorer(metrics=[mean()])
+def multi_model_graded(models)
+    return multi_scorer(
+        scorers = [model_graded_qa(model=model) for model in models],
+        reducer = average_score
+    )
 ```
 
 
@@ -408,4 +411,4 @@ scoring_logs = [score(log, model_graded_qa(model=model))
                 for model in grader_models]
 
 plot_results(scoring_logs)
-```
+```
diff --git a/docs/vscode.qmd b/docs/vscode.qmd
@@ -70,7 +70,7 @@ The Inspect Activity Bar provides an interface for tuning both global configurat
 
 The activity bar has three panels:
 
--   **Configuration** edits global configuration by reading and writing values from the workspace `.env` config file (see the documentation on [Configuration](#0) for more details on `.env` files).
+-   **Configuration** edits global configuration by reading and writing values from the workspace `.env` config file (see the documentation on [Configuration](workflow.qmd#configuration) for more details on `.env` files).
 
 -   **Task** provides a way to tweak the CLI arguments passed to `inspect eval` when it is run from the user interface.