Skip to content

Commit

Permalink
Merge branch 'main' into feature/headless-browser-tool
Browse files Browse the repository at this point in the history
  • Loading branch information
jjallaire-aisi authored Sep 26, 2024
2 parents 5b92689 + e0b25b7 commit 0df08df
Show file tree
Hide file tree
Showing 229 changed files with 3,049 additions and 2,087 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ default_language_version:
python: python3.11
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.6.5
rev: v0.6.7
hooks:
# Run the linter.
- id: ruff
Expand Down
30 changes: 30 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,35 @@
# Changelog

## Unreleased

- Capture solver input params for subtasks created by `fork()`.
- Allow Docker sandboxes configured with `x-default` to be referred to by their declared service name.
- Require a `max_messages` for use of `basic_agent()` (as without it, the agent could end up in an infinite loop).
- Track sample task state in solver decorator rather than solver transcript.
- Display solver input parameters for forked subtasks.
- Improvements to docker compose down cleanup: timeout, survive missing compose files.

## v0.3.32 (25 September 2024)

- Fix issue w/ subtasks not getting a fresh store() (regression from introduction of `fork()` in v0.3.30)
- Fix issue w/ subtasks that return None invalidating the log file.
- Make subtasks collapsable in Inspect View.
- Improved error reporting for missing `web_search()` provider environment variables.

## v0.3.31 (24 September 2024)

- Deprecated `Plan` in favor of `Solver` (with `chain()` function to compose multiple solvers).
- Added `max_tool_output` generation option (defaults to 16KB).
- Improve performance of `header_only` log reading (switch from json-stream to ijson).
- Add support for 0 retries to `eval-set` (run a single `eval` then stop).
- Tool calling fixes for update to Mistral v1.1. client.
- Always show `epochs` in task status (formerly wasn't included for multiple task display)
- Render transcript `info()` strings as markdown
- Eliminate log spam from spurious grpc fork message.
- Fix issue with hf_dataset shuffle=True not actually shuffling.
- Improved error handling when loading invalid setuptools entrypoints.
- Don't catch TypeError when calling tools (we now handle this in other ways)

## v0.3.30 (18 September 2024)

- Added [fork()](agents-api.qmd#sec-forking) function to fork a `TaskState` and evaluate it against multiple solvers in parallel.
Expand Down
13 changes: 13 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
cff-version: 1.2.0
title: 'Inspect AI: Framework for Large Language Model Evaluations'
message: >-
If you cite this software, please do so using the
metadata from this file.
type: software
authors:
- name: UK AI Safety Institute
website: 'https://www.aisi.gov.uk/'
repository-code: 'https://github.com/UKGovernmentBEIS/inspect_ai'
url: 'https://inspect.ai-safety-institute.org.uk/'
license: MIT
date-released: "2024-05-10"
37 changes: 22 additions & 15 deletions docs/_tools-scaffold.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,23 +20,30 @@ state.messages.append(output.message)
state.messages.extend(call_tools(output.message, state.tools))
```

This does everything that default `generate()` does, save for an outer loop to continue calling the mode as long as it continues calling tools. You could implement the outer loop as follows:
This does everything that default `generate()` does, save for an outer loop to continue calling the mode as long as it continues calling tools. This is a complete solver agent that implements the outer loop:

``` python
model = get_model()
while True:
# call model
output = await model.generate(state.messages, state.tools)

# update state
state.output = output
state.messages.append(output.message)

# make tool calls or terminate if there are none
if output.message.tool_calls:
state.messages.extend(call_tools(output.message, state.tools))
else:
break
@solver
def agent_loop():
async def solve(state: TaskState, generate: Generate):
model = get_model()
while True:
# call model
output = await model.generate(state.messages, state.tools)

# update state
state.output = output
state.messages.append(output.message)

# make tool calls or terminate if there are none
if output.message.tool_calls:
state.messages.extend(call_tools(output.message, state.tools))
else:
break

return state

return solve
```

You can imagine several ways you might want to customise this loop:
Expand Down
12 changes: 6 additions & 6 deletions docs/agents-api.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,13 @@ This article describes advanced Inspect APIs available for creating evaluations
5. Delegating work to sub-tasks
6. Sandboxing arbitrary code execution

We'll assume that you already understand Inspect [Solvers](solvers.qmd) and [Tools](tools.qmd) (please review those articles as required before proceeding).
We'll assume that you have already covered the basics of [Solvers](solvers.qmd), [Tools](tools.qmd), and [Agents](agents.qmd) (please review those articles as required before proceeding).

## Use of `metadata`

Before proceeding, it's important to point that some of the features described below were previously approximated by using the `metadata` field of `TaskState`, specifically `metadata` was often used as a catch-all storage location for:

- Carrying state between solvers and sometimes tools.
- Sharing state between solvers.
- Providing a place to log additional structured data.
- Recording calls to "helper" models used for elicitation or scoring.

Expand Down Expand Up @@ -138,7 +138,7 @@ from inspect_ai.log import transcript
transcript().info("here is some custom info")
```

You can pass arbitrary JSON serialisable objects to `info()`.
Strings passed to `info()` will be rendered as markdown. In addition to strings you can also pass arbitrary JSON serialisable objects to `info()`.

### Grouping with Steps

Expand Down Expand Up @@ -216,11 +216,11 @@ Note that we don't `await` the subtasks when building up our list of `searches`.

### Forking {#sec-forking}

Inspect's `fork()` function provids a convenient wrapper around a very common use of subtasks: running a `TaskState` against a set of solvers in parallel to explore different trajectories.
Inspect's `fork()` function provids a convenient wrapper around a very common use of subtasks: running a `TaskState` against a set of solvers in parallel to explore different trajectories.

For example, let's say you have a solver named `explore()` that takes `temperature` as a parameter. You might want to try the solver out with multiple temperature values and then continue on with the best result:

```python
``` python
from inspect_ai.solver import fork

results = await fork(state, [
Expand All @@ -241,7 +241,7 @@ Many agents provide models with the ability to execute arbitrary code. It's impo
def file_probe()
return Task(
dataset=dataset,
plan=[
solver=[
use_tools([list_files()]),
generate()
],
Expand Down
39 changes: 19 additions & 20 deletions docs/agents.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

Agents combine planning, memory, and tool usage to pursue more complex, longer horizon tasks (e.g. a [Capture the Flag](https://en.wikipedia.org/wiki/Capture_the_flag_(cybersecurity)) challenge). Agents are an area of active research, and many schemes for implementing them have been developed, including [AutoGPT](https://arxiv.org/abs/2306.02224), [ReAct](https://arxiv.org/pdf/2303.11366.pdf), and [Reflexion](https://arxiv.org/pdf/2303.11366.pdf).

An agent isn't a special construct within Inspect, it's merely a solver that includes tool use and calls `generate()` internally to interact with the model.

Inspect supports a variety of approaches to agent evaluations, including:

1. Using Inspect's built-in `basic_agent()`.
Expand All @@ -12,8 +14,6 @@ Inspect supports a variety of approaches to agent evaluations, including:

3. Adapting an agent provided by a research paper or open source library (for example, using a 3rd party agent library like [LangChain](https://python.langchain.com/docs/modules/agents/) or [Langroid](https://langroid.github.io/langroid/)).

We'll cover the basics of all of these approaches below.

An important additional consideration for agent evaluations is sandboxing (providing a secure environment for models to execute code within). The [Sandbox Environments](#sec-sandbox-environments) section goes into more depth on this.

## Basic Agent {#sec-basic-agent}
Expand All @@ -22,17 +22,17 @@ The `basic_agent()`provides a ReAct tool loop with support for retries and encou

1. When developing tasks and datasets it's convenient to have a ready made agent that you know that will competently navigate your task.

2. When developing custom agents, it's a good idea to start out with an idea of how the model performs using its native planning and eliciatation capabilities. The basic agent is a good way to establish this baseline.
2. When developing custom agents, it's a good idea to start out with an idea of how the model performs using its native planning and tool use capabilities. The basic agent is a good way to establish this baseline.

3. It provides a sound basis for comparison of the native agentic capabilities of models both over time and across providers.

The basic agent incorporates best practices for giving models some additional resilience and persistence, both through the optional `max_attempts` parameter, as well as by continuing the task even when the model stops making tool calls. The basic agent can frequently match or exeed custom scaffolds, so you should always try it as a baseline for your tasks!
The basic agent incorporates best practices for giving models some additional resilience and persistence, both through the optional `max_attempts` parameter, as well as by continuing the task even when the model stops making tool calls. The basic agent can frequently match or exceed custom scaffolds, so you should always try it as a baseline for your tasks!

Note that when using the basic agent you should *always* set a `max_messages` so that there is some termination point if the model gets off track or stuck in a loop.

### Example

Here is an example use of `basic_agent()` as the `plan` for a CTF evaluation:
Here is an example use of `basic_agent()` as the `solver` for a CTF evaluation:

``` python
from inspect_ai import Task, task
Expand All @@ -54,7 +54,7 @@ you are going to use and how they fit into your plan. # <1>
def ctf():
return Task(
dataset=json_dataset("ctf.json"),
plan=basic_agent(
solver=basic_agent(
init=system_message(SYSTEM_MESSAGE),
tools=[bash(timeout=180), python(timeout=180)], # <2>
max_attempts=3, # <3>
Expand Down Expand Up @@ -92,22 +92,21 @@ There are several options available for customising the behaviour of the basic a

For multiple attempts, submissions are evaluated using the task's main scorer, with value of 1.0 indicating a correct answer. Scorer values are converted to float (e.g. "C" becomes 1.0) using the standard `value_to_float()` function. Provide an alternate conversion scheme as required via `score_value`.


## Custom Scaffold {#sec-custom-scaffolding}

The basic agent demonstrated above will work well for some tasks, but in other cases you may need to provide more custom logic. For example, you might want to:
The basic agent demonstrated above will work well for some tasks, but in other cases you may want to provide more custom logic. For example, you might want to:

{{< include _tools-scaffold.md >}}

### Tool Filtering

While its possible to make tools globally available to the model via `use_tools()`, you may also want to filter the available tools either based on task stages or dynamically based on some other criteria.

Here's an example of a `Solver` that filters the available tools between calls to `generate()`:
Here's an example of a solver agent that filters the available tools between calls to `generate()`:

``` python
@solver
def generate_ctf():
def ctf_agent():
async def solve(state: TaskState, generate: Generate):

# first pass w/ core tools
Expand All @@ -128,8 +127,6 @@ def generate_ctf():
return solve
```

In this example we rely on the default `generate()` tool calling behaviour (`"loop"`). However, you can also imaging combining tool filtering with the more tailored tool calling logic described in [Tool Calls](#sec-tool-calls).

### Agents API

For more sophisticated agents, Inspect offers several additional advanced APIs for state management, sub-agents, and fine grained logging. See the [Agents API](agents-api.qmd) article for additional details.
Expand Down Expand Up @@ -259,7 +256,7 @@ Finally, here's a task that uses the `wikipedia_search()` solver:
def wikipedia() -> Task:
return Task(
dataset=json_dataset("wikipedia.jsonl"),
plan=wikipedia_search(),
solver=wikipedia_search(),
scorer=model_graded_fact(),
)
```
Expand Down Expand Up @@ -327,7 +324,7 @@ dataset = [
def file_probe()
return Task(
dataset=dataset,
plan=[
solver=[
use_tools([list_files()]),
generate()
],
Expand Down Expand Up @@ -360,7 +357,7 @@ There are two sandbox environments built in to Inspect:

Sandbox environment definitions can be bound at the `Sample`, `Task`, or `eval()` level. Binding precedence goes from `eval()`, to `Task` to `Sample`, however sandbox config files defined on the `Sample` always take precedence when the sandbox type for the `Sample` is the same as the enclosing `Task` or `eval()`.

Here is a `Task` that defines a `sandbox` and corresponding sandbox config file:
Here is a `Task` that defines a `sandbox`:

``` python
Task(
Expand All @@ -370,16 +367,18 @@ Task(
generate()
]),
scorer=match(),
sandbox=("docker", "compose.yaml")
sandbox="docker"
)
```

In this example there is a `compose.yaml` file in the task directory that will be used to provision Docker services (if there is no config file specified then the Docker default Python 3.12 image will be used). For example:
By default, any `Dockerfile` and/or `compose.yaml` file within the task directory will be automatically discovered and used. If your compose file has a different name then you can provide an override specification as follows:

``` python
sandbox="docker"
```python
sandbox=("docker", "attacker-compose.yaml")
```

The configuration file added to the `sandbox` spec should always be a compose file (rather than a `Dockerfile`, which is always discovered automatically).

### Per Sample Setup

The `Sample` class includes `sandbox`, `files` and `setup` fields that are used to specify per-sample sandbox config, file assets, and setup logic.
Expand Down Expand Up @@ -500,7 +499,7 @@ sandbox("victim") # named sandbox environment
```

::: {.callout-note apperance="simple"}
If you define multiple sandbox environments you are *required* to name one of them "default" so that Inspect knows which environment to copy samples files to and resolve for calls to `sandbox()` without an argument.
If you define multiple sandbox environments you are *required* to name one of them "default" so that Inspect knows which environment to resolve for calls to `sandbox()` without an argument. Alternatively, you can add the `x-default` key to a service not named "default" to designate it as the default sandbox.
:::

#### Infrastructure
Expand Down
2 changes: 1 addition & 1 deletion docs/caching.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ For example, here we are iterating on our self critique template, so we cache th
def theory_of_mind():
return Task(
dataset=example_dataset("theory_of_mind"),
plan=[
solver=[
chain_of_thought(),
generate(cache = True),
self_critique(CRITIQUE_TEMPLATE)
Expand Down
16 changes: 14 additions & 2 deletions docs/datasets.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,19 @@ Note that samples from datasets without an `id` field will automatically be assi

If your samples include `choices`, then the `target` should be a numeric index into the available `choices` rather than a letter (this is an implicit assumption of the `multiple_choice()` solver).

If your samples include `files`, they will be copied into the default sandbox environment unless their name contains a prefix mapping them into another environment (e.g. "`victim:flag.txt": "flag.txt"`).
### Files

The `files` field maps container target file paths to file contents (where contents can be either a filesystem path, a URL, or a string with inline content). For example, to copy a local file named `flag.txt` into the container path `/shared/flag.txt` you would use this:

```python
"/shared/flag.txt": "flag.txt"
```

Files are copied into the default sandbox environment unless their name contains a prefix mapping them into another environment. For example, to copy into the `victim` container:

```python
"victim:/shared/flag.txt": "flag.txt"
```

## Field Mapping

Expand Down Expand Up @@ -243,7 +255,7 @@ dataset=MemoryDataset([
def security_guide():
return Task(
dataset=dataset,
plan=[system_message(SYSTEM_MESSAGE), generate()],
solver=[system_message(SYSTEM_MESSAGE), generate()],
scorer=model_graded_fact(),
)
```
Expand Down
4 changes: 2 additions & 2 deletions docs/errors-and-limits.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ In some cases you might wish to tolerate some number of errors without failing t
def intercode_ctf():
return Task(
dataset=read_dataset(),
plan=[
solver=[
system_message("system.txt"),
use_tools([bash(timeout=120)]),
generate(),
Expand Down Expand Up @@ -65,7 +65,7 @@ In open-ended model conversations (for example, an agent evalution with tool usa
def intercode_ctf():
return Task(
dataset=read_dataset(),
plan=[
solver=[
system_message("system.txt"),
use_tools([bash(timeout=120)]),
generate(),
Expand Down
2 changes: 1 addition & 1 deletion docs/eval-sets.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ There are two fundamental requirements for dynamic tasks used with `eval_set()`:
1) They are created using an `@task` function as described above.
2) Their parameters use ordinary Python types (like `str`, `int`, `list`, etc.) as opposed to custom objects (which are hard to serialise consistently).

Note that you can pass a `plan` to an `@task` function, so long as it was created by a function decorated with `@plan`.
Note that you can pass a `solver` to an `@task` function, so long as it was created by a function decorated with `@solver`.

### Retry Options

Expand Down
1 change: 1 addition & 0 deletions docs/examples/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ aliases:
<nav id="TOC" role="doc-toc">
<ul>
<li><a href="#coding" id="toc-coding">Coding</a></li>
<li><a href="#assistants" id="toc-coding">Assistants</a></li>
<li><a href="#cybersecurity" id="toc-coding">Cybersecurity</a></li>
<li><a href="#mathematics" id="toc-mathematics">Mathematics</a></li>
<li><a href="#reasoning" id="toc-reasoning">Reasoning</a></li>
Expand Down
Loading

0 comments on commit 0df08df

Please sign in to comment.