Skip to content

Commit

Permalink
release v0.3.10
Browse files Browse the repository at this point in the history
  • Loading branch information
aisi-inspect committed May 29, 2024
1 parent 671a239 commit 140c3d6
Show file tree
Hide file tree
Showing 121 changed files with 3,514 additions and 1,811 deletions.
23 changes: 21 additions & 2 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,26 @@ jobs:
- name: Lint and format with Ruff
uses: chartboost/ruff-action@v1

build:
mypy:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10", "3.11"]
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install .[dev]
- name: Run mypy
run: |
mypy --exclude tests/test_package src tests
test:
runs-on: ubuntu-latest
strategy:
matrix:
Expand All @@ -38,7 +57,7 @@ jobs:
python -m pip install .[dev]
- name: Test with pytest
run: |
pytest -rA -x --doctest-modules --color=yes --cov=inspect_ai
pytest -rA --doctest-modules --color=yes --cov=inspect_ai
package:
name: Build & inspect the package.
Expand Down
8 changes: 7 additions & 1 deletion .github/workflows/vscode.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,20 @@ on:
push:
tags:
- "v[0-9]*"
paths:
- "tools/vscode/**"
- ".github/workflows/vscode.yml"
branches:
- "main"
pull_request:
branches:
- "main"
paths:
- "tools/vscode/**"
- ".github/workflows/vscode.yml"
workflow_dispatch:

name: Deploy Extension
name: Build VS Code Ext
jobs:
deploy:
runs-on: ubuntu-latest
Expand Down
21 changes: 20 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,24 @@
# Changelog

## v0.3.10 (29 May 2024)

- **BREAKING:** The `pattern` scorer has been modified to match against any (or all) regex match groups. This replaces the previous behaviour when there was more than one group, which would only match the second group.
- Improved performance for Inspect View on very large datasets (virtualized sample list).
- ToolChoice `any` option to indicate the model should use at least one tool (supported by Anthropic and Mistral, mapped to `auto` for OpenAI).
- Tool calls can now return a simple scalar or `list[ContentText | ContentImage]`
- Support for updated Anthropic tools beta (tool_choice and image tool results)
- Report tool_error back to model if it provides invalid JSON for tool calls arguments (formerly this halted the entire eval with an error).
- New `max_samples` option to control how many samples are run in parallel (still defaults to running all samples in parallel).
- Add `boolq.py` benchmark.
- Add `piqa.py` benchmark.
- View: Improved markdown rendering (properly escape reference links)
- Improved typing for example_dataset function.
- Setuptools entry point for loading custom model extensions.
- Break optional `tuple` return out of `ToolResult` type.
- Bugfix: always read original sample message(s) for `TaskState.input_text`.
- Bugfix: remove write counter from log (could have resulted in incomplete/invalid logs propagating to the viewer)
- Bugfix: handle task names that include spaces in log viewer.

## v0.3.9 (14 May 2024)

- Add `ollama` local model provider.
Expand Down Expand Up @@ -49,7 +68,7 @@
- `Score` now has an optional `answer` field, which denotes the answer text extracted from model output.
- Accuracy metrics now take an optional `ValueToFloat` function for customizing how textual values mapped to float.
- Made `model_graded_qa` more flexible with separate `instruction` template and `grade_pattern`, as well providing `partial_credit` as an option.
- Modify the default templates for `chain_of_thought()` and `self_critique()` to instruct the model to reply with `ANSWER: $ANSWER` at the end on its own line.
- Modify the default templates for `chain_of_thought()` and `self_critique()` to instruct the model to reply with `ANSWER: $ANSWER` at the end on its own line.
- Improved numeric extraction for `match(numeric=True)` (better currency and decimal handling).
- Improve `answer()` patterns so that they detect letter and word answers both within and at the end of model output.
- `Plan` now has an optional `cleanup` function which can be used to free per-sample resources (e.g. Docker containers) even in the case of an evaluation error.
Expand Down
39 changes: 23 additions & 16 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,33 @@
## Benchmarks

This directory contains evals for several benchmarks, including:
This directory contains evals for several benchmarks. Datasets for evals are not embedded in the repository but are rather stored using either Git-LFS or Hugging Face (see below for instructions on the additional requirements for reading datasets from these sources).

| Benchmark | Reference | Code |
|------------------------|------------------------|-----------------------:|
| MMLU: Measuring Massive Multitask Language Understanding | <https://arxiv.org/abs/2009.03300> | [mmlu.py](mmlu.py) |
| MATH: Measuring Mathematical Problem Solving With the MATH Dataset | <https://arxiv.org/abs/2103.03874> | [mathematics.py](mathematics.py) |
| GPQA: A Graduate-Level Google-Proof Q&A Benchmark | <https://arxiv.org/abs/2311.12022> | [gpqa.py](gpqa.py) |
| ARC: AI2 Reasoning Challenge | <https://arxiv.org/abs/1803.05457> | [arc.py](arc.py) |
| GSM8K: Training Verifiers to Solve Math Word Problems | <https://arxiv.org/abs/2110.14168> | [gsm8k.py](gsm8k.py) |
| HellaSwag: Can a Machine Really Finish Your Sentence? | <https://arxiv.org/abs/1905.07830> | [hellaswag.py](hellaswag.py) |
| Benchmark | Reference | Code | Dataset |
|------------------------------------------------------------------------|------------------------------------|---------------------------------:|--------------|
| MMLU: Measuring Massive Multitask Language Understanding | <https://arxiv.org/abs/2009.03300> | [mmlu.py](mmlu.py) | Git-LFS |
| MATH: Measuring Mathematical Problem Solving With the MATH Dataset | <https://arxiv.org/abs/2103.03874> | [mathematics.py](mathematics.py) | Git-LFS |
| GPQA: A Graduate-Level Google-Proof Q&A Benchmark | <https://arxiv.org/abs/2311.12022> | [gpqa.py](gpqa.py) | Hugging Face |
| ARC: AI2 Reasoning Challenge | <https://arxiv.org/abs/1803.05457> | [arc.py](arc.py) | Hugging Face |
| GSM8K: Training Verifiers to Solve Math Word Problems | <https://arxiv.org/abs/2110.14168> | [gsm8k.py](gsm8k.py) | Hugging Face |
| HellaSwag: Can a Machine Really Finish Your Sentence? | <https://arxiv.org/abs/1905.07830> | [hellaswag.py](hellaswag.py) | Hugging Face |
| PIQA: Physical Interaction: Question Answering | <https://arxiv.org/abs/1911.11641> | [piqa.py](piqa.py) | Hugging Face |
| BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions | <https://arxiv.org/abs/1905.10044> | [boolq.py](boolq.py) | Hugging Face |

The datasets for ARC, GSM8K, and HellaSwag are read from Hugging Face, so require the installation of the **datasets** package:
### Git-LFS Datasets

``` bash
$ pip install datasets
```

The datasets for MMLU and MATH are stored using [Git-LFS](https://git-lfs.com/). Once you have downloaded and installed LFS, switch to the repo source directory and run the following commands to sync the data from LFS:
To use these datasets, first download and install [Git-LFS](https://git-lfs.com/). Once you have done this, switch to the repo source directory and run the following commands to sync the data from LFS:

``` bash
$ cd inspect_ai
$ git lfs fetch --all
$ git lfs pull
```
```

### Hugging Face Datasets

To use these datasets you need to install the **datasets** package as follows:

``` bash
$ pip install datasets
```

47 changes: 47 additions & 0 deletions benchmarks/boolq.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
"""
BoolQ
Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins,
Kristina Toutanova
https://arxiv.org/abs/1905.10044
# Run against validations boolq dataset
inspect eval boolq.py
"""

from inspect_ai import Task, task
from inspect_ai.dataset import Sample, hf_dataset
from inspect_ai.scorer import pattern
from inspect_ai.solver import generate, prompt_template

TEMPLATE = r"""
Answer the following question with either Yes or No. Include nothing else in your response.
Question: {prompt}
"""


def record_to_sample(record):
if record["answer"]:
target = "Yes"
else:
target = "No"

return Sample(input=record["question"], target=target)


@task
def boolq():
dataset = hf_dataset(
path="boolq",
sample_fields=record_to_sample,
split="validation",
shuffle=True,
)

return Task(
dataset=dataset,
plan=[prompt_template(template=TEMPLATE), generate()],
scorer=pattern(r"(Yes|No).?\Z"),
)
55 changes: 55 additions & 0 deletions benchmarks/piqa.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
"""
PIQA (Physical Interaction: Question Answering)
Reasoning about Physical Commonsense in Natural Language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi
https://arxiv.org/abs/1911.11641
# eval piqa validation set
inspect eval piq.py
"""

from inspect_ai import Task, task
from inspect_ai.dataset import Sample, hf_dataset
from inspect_ai.scorer import answer
from inspect_ai.solver import multiple_choice


def record_to_sample(record):
return Sample(
input=record["goal"],
target="A" if record["label"] == 0 else "B",
choices=[record["sol1"], record["sol2"]],
)


TEMPLATE = r"""
The entire content of your response should be of the following format: 'ANSWER:
$LETTER' (without quotes) where LETTER is one of {letters}.
Given either a question or a statement followed by two possible solutions
labelled A and B, choose the most appropriate solution. If a question is given,
the solutions answer the question. If a statement is given, the solutions
explain how to achieve the statement.
{question}
{choices}
""".strip()


@task
def piqa():
dataset = hf_dataset(
path="piqa",
sample_fields=record_to_sample,
trust=True,
split="validation",
shuffle=True,
)

return Task(
dataset=dataset,
plan=[multiple_choice(template=TEMPLATE)],
scorer=answer("letter"),
)
10 changes: 5 additions & 5 deletions docs/datasets.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Of course, many real-world datasets won't be so trivial to read. Below we'll dis

## Dataset Samples

The core data type underlying the use of datasets with Inspect is the `Sample`. A sample has an `input`, a `target`, an optional `id`, and an optional collection of `metadata`.
The core data type underlying the use of datasets with Inspect is the `Sample`, which consists of a required `input` field and several other optional fields:

**Class** `inspect_ai.dataset.Sample`

Expand All @@ -43,9 +43,9 @@ Can be read directly with:
dataset = csv_dataset("security_guide.csv")
```

Note that samples from datasets without and `id` field will automatically be assigned ids based on an auto-incrementing integer starting with 1.
Note that samples from datasets without an `id` field will automatically be assigned ids based on an auto-incrementing integer starting with 1.

If your samples include `choices`, then the label should be a numeric index into the available `choices` rather a letter (this is an implicit assumption of the `multiple_choice()` solver).
If your samples include `choices`, then the `target` should be a numeric index into the available `choices` rather than a letter (this is an implicit assumption of the `multiple_choice()` solver).

## Field Mapping

Expand All @@ -65,7 +65,7 @@ dataset = json_dataset(
)
```

If you need to do more than just map field names and actually do custom processing of the data, you can instead pass a function which takes an `index` and `record` (represented as a `dict`) from the underlying file and returns a `Sample`. For example:
If you need to do more than just map field names and actually do custom processing of the data, you can instead pass a function which takes a `record` (represented as a `dict`) from the underlying file and returns a `Sample`. For example:

``` python
from inspect_ai.dataset import Sample, json_dataset
Expand Down Expand Up @@ -154,7 +154,7 @@ Using S3 is mostly a matter of substituting S3 URLs (e.g. `s3://my-bucket-name`)
json_dataset("s3://my-bucket/dataset.jsonl")
```

S3 buckets are normally access controlled so require authentication to read from. There are a wide variety of ways to configure your client for AWS authentication, all of which work with Inspect. See the article on [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) for additional details
S3 buckets are normally access controlled so require authentication to read from. There are a wide variety of ways to configure your client for AWS authentication, all of which work with Inspect. See the article on [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) for additional details.

## Chat Messages

Expand Down
31 changes: 25 additions & 6 deletions docs/models.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -356,14 +356,33 @@ The `__init__()` method *must* call the `super().__init__()` method, and typical

The `generate()` method handles interacting with the model. In addition, there are some optional methods you can override to specify various behaviours and constraints (default max tokens and connections, identifying rate limit errors, etc.)

Once you've created the class and decorated it with `@modelapi` as shown above, you can reference it as follows:
### Model Registration

If you are publishing a custom model within a Python package, you should register an `inspect_ai` [setuptools entry point](https://setuptools.pypa.io/en/latest/userguide/entry_point.html). This will ensure that inspect loads your extension before it attempts to resolve a model name that uses your provider.

For example, if your package was named `custom_models` and your model provider was exported from a source file named `inspect_ai.py` at the root of your package, `pyproject.toml` that would look like this:

``` toml
[project.entry-points.inspect_ai]
custom_models = "custom_models.inspect_ai"
```

### Using the Model

Once you've created the class, decorated it with `@modelapi` as shown above, and registered it, then you can use it as follows:

``` bash
inspect eval ctf.py --model custom/my-model
```

Where `my-model` is the name of some model supported by your provider (this will be passed to `__init()__` in the `model_name` argument).

You can also reference it from within Python calls to `get_model()` or `eval()`:

``` python
# get a model instance
model = get_model("custom/name-of-model")
model = get_model("custom/my-model")

# run an eval with the model
eval(math, model = "custom/name-of-model")
```

In this example, the `model_name` argument passed to `__init__()` will be "name-of-model".
eval(math, model = "custom/my-model")
```
21 changes: 12 additions & 9 deletions docs/scorers.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ def model_graded_qa(
instructions: str | None = None,
grade_pattern: str | None = None,
partial_credit: bool = False,
model: str | Model | None = None,
model: list[str | Model] | str | Model | None = None,
) -> Scorer:
...
```
Expand All @@ -95,7 +95,7 @@ The built-in model graded scorers also support using multiple grader models (whe

```python
model_graded_qa(
models = [
model = [
"google/gemini-1.0-pro",
"anthropic/claude-3-opus-20240229"
"together/meta-llama/Llama-3-70b-chat-hf",
Expand Down Expand Up @@ -173,7 +173,7 @@ Next, we'll take a look at the source code for a couple of the built in scorers
Here is the source code for the built-in `includes()` scorer:

``` python
@scorer(metrics=[accuracy(), bootstrap_str()]) # <1>
@scorer(metrics=[accuracy(), bootstrap_std()]) # <1>
def includes(ignore_case: bool = True):

async def score(state: TaskState, target: Target): # <2>
Expand Down Expand Up @@ -284,13 +284,16 @@ def average_score(scores: list[Score]) -> Score:
)
```

Which might be used like this:

Further, you will need to wrap your use of `multi_scorer()` inside a `@scorer` decorated function (with the requisite `metrics` specified). For example:

```python
multi_scorer(
scorers = [model_graded_qa(model=model) for model in models],
reducer = average_score
)
@scorer(metrics=[mean()])
def multi_model_graded(models)
return multi_scorer(
scorers = [model_graded_qa(model=model) for model in models],
reducer = average_score
)
```


Expand Down Expand Up @@ -408,4 +411,4 @@ scoring_logs = [score(log, model_graded_qa(model=model))
for model in grader_models]

plot_results(scoring_logs)
```
```
2 changes: 1 addition & 1 deletion docs/vscode.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ The Inspect Activity Bar provides an interface for tuning both global configurat

The activity bar has three panels:

- **Configuration** edits global configuration by reading and writing values from the workspace `.env` config file (see the documentation on [Configuration](#0) for more details on `.env` files).
- **Configuration** edits global configuration by reading and writing values from the workspace `.env` config file (see the documentation on [Configuration](workflow.qmd#configuration) for more details on `.env` files).

- **Task** provides a way to tweak the CLI arguments passed to `inspect eval` when it is run from the user interface.

Expand Down
Loading

0 comments on commit 140c3d6

Please sign in to comment.