Merge branch 'main' into feature/headless-browser-tool

UKGovernmentBEIS · Sep 26, 2024 · 0df08df · 0df08df
2 parents 5b92689 + e0b25b7
commit 0df08df
Show file tree

Hide file tree

Showing 229 changed files with 3,049 additions and 2,087 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -5,7 +5,7 @@ default_language_version:
   python: python3.11
 repos:
 - repo: https://github.com/astral-sh/ruff-pre-commit
-  rev: v0.6.5
+  rev: v0.6.7
   hooks:
     # Run the linter.
     - id: ruff

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,35 @@
 # Changelog
 
+## Unreleased
+
+- Capture solver input params for subtasks created by `fork()`.
+- Allow Docker sandboxes configured with `x-default` to be referred to by their declared service name.
+- Require a `max_messages` for use of `basic_agent()` (as without it, the agent could end up in an infinite loop).
+- Track sample task state in solver decorator rather than solver transcript.
+- Display solver input parameters for forked subtasks.
+- Improvements to docker compose down cleanup: timeout, survive missing compose files.
+
+## v0.3.32 (25 September 2024)
+
+- Fix issue w/ subtasks not getting a fresh store() (regression from introduction of `fork()` in v0.3.30)
+- Fix issue w/ subtasks that return None invalidating the log file.
+- Make subtasks collapsable in Inspect View.
+- Improved error reporting for missing `web_search()` provider environment variables. 
+
+## v0.3.31 (24 September 2024)
+
+- Deprecated `Plan` in favor of `Solver` (with `chain()` function to compose multiple solvers).
+- Added `max_tool_output` generation option (defaults to 16KB).
+- Improve performance of `header_only` log reading (switch from json-stream to ijson).
+- Add support for 0 retries to `eval-set` (run a single `eval` then stop).
+- Tool calling fixes for update to Mistral v1.1. client.
+- Always show `epochs` in task status (formerly wasn't included for multiple task display)
+- Render transcript `info()` strings as markdown
+- Eliminate log spam from spurious grpc fork message.
+- Fix issue with hf_dataset shuffle=True not actually shuffling.
+- Improved error handling when loading invalid setuptools entrypoints.
+- Don't catch TypeError when calling tools (we now handle this in other ways)
+
 ## v0.3.30 (18 September 2024)
 
 - Added [fork()](agents-api.qmd#sec-forking) function to fork a `TaskState` and evaluate it against multiple solvers in parallel.

diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,13 @@
+cff-version: 1.2.0
+title: 'Inspect AI: Framework for Large Language Model Evaluations'
+message: >-
+  If you cite this software, please do so using the
+  metadata from this file.
+type: software
+authors:
+  - name: UK AI Safety Institute
+    website: 'https://www.aisi.gov.uk/'
+repository-code: 'https://github.com/UKGovernmentBEIS/inspect_ai'
+url: 'https://inspect.ai-safety-institute.org.uk/'
+license: MIT
+date-released: "2024-05-10"
diff --git a/docs/_tools-scaffold.md b/docs/_tools-scaffold.md
@@ -20,23 +20,30 @@ state.messages.append(output.message)
 state.messages.extend(call_tools(output.message, state.tools))
 ```
 
-This does everything that default `generate()` does, save for an outer loop to continue calling the mode as long as it continues calling tools. You could implement the outer loop as follows:
+This does everything that default `generate()` does, save for an outer loop to continue calling the mode as long as it continues calling tools. This is a complete solver agent that implements the outer loop:
 
 ``` python
-model = get_model()
-while True:
-    # call model
-    output = await model.generate(state.messages, state.tools)
-
-    # update state
-    state.output = output
-    state.messages.append(output.message)
-
-    # make tool calls or terminate if there are none
-    if output.message.tool_calls:
-        state.messages.extend(call_tools(output.message, state.tools))
-    else:
-        break
+@solver
+def agent_loop():
+    async def solve(state: TaskState, generate: Generate):
+        model = get_model()
+        while True:
+            # call model
+            output = await model.generate(state.messages, state.tools)
+
+            # update state
+            state.output = output
+            state.messages.append(output.message)
+
+            # make tool calls or terminate if there are none
+            if output.message.tool_calls:
+                state.messages.extend(call_tools(output.message, state.tools))
+            else:
+                break
+
+        return state
+
+    return solve
 ```
 
 You can imagine several ways you might want to customise this loop:

diff --git a/docs/agents-api.qmd b/docs/agents-api.qmd
@@ -11,13 +11,13 @@ This article describes advanced Inspect APIs available for creating evaluations
 5.  Delegating work to sub-tasks
 6.  Sandboxing arbitrary code execution
 
-We'll assume that you already understand Inspect [Solvers](solvers.qmd) and [Tools](tools.qmd) (please review those articles as required before proceeding).
+We'll assume that you have already covered the basics of [Solvers](solvers.qmd), [Tools](tools.qmd), and [Agents](agents.qmd) (please review those articles as required before proceeding).
 
 ## Use of `metadata`
 
 Before proceeding, it's important to point that some of the features described below were previously approximated by using the `metadata` field of `TaskState`, specifically `metadata` was often used as a catch-all storage location for:
 
--   Carrying state between solvers and sometimes tools.
+-   Sharing state between solvers.
 -   Providing a place to log additional structured data.
 -   Recording calls to "helper" models used for elicitation or scoring.
 
@@ -138,7 +138,7 @@ from inspect_ai.log import transcript
 transcript().info("here is some custom info")
 ```
 
-You can pass arbitrary JSON serialisable objects to `info()`.
+Strings passed to `info()` will be rendered as markdown. In addition to strings you can also pass arbitrary JSON serialisable objects to `info()`.
 
 ### Grouping with Steps
 
@@ -216,11 +216,11 @@ Note that we don't `await` the subtasks when building up our list of `searches`.
 
 ### Forking {#sec-forking}
 
-Inspect's `fork()` function provids a convenient wrapper around a very common use of subtasks: running a `TaskState` against a set of solvers in parallel to explore different trajectories. 
+Inspect's `fork()` function provids a convenient wrapper around a very common use of subtasks: running a `TaskState` against a set of solvers in parallel to explore different trajectories.
 
 For example, let's say you have a solver named `explore()` that takes `temperature` as a parameter. You might want to try the solver out with multiple temperature values and then continue on with the best result:
 
-```python
+``` python
 from inspect_ai.solver import fork
 
 results = await fork(state, [
@@ -241,7 +241,7 @@ Many agents provide models with the ability to execute arbitrary code. It's impo
 def file_probe()
     return Task(
         dataset=dataset,
-        plan=[
+        solver=[
             use_tools([list_files()]), 
             generate()
         ],

diff --git a/docs/agents.qmd b/docs/agents.qmd
@@ -4,6 +4,8 @@
 
 Agents combine planning, memory, and tool usage to pursue more complex, longer horizon tasks (e.g. a [Capture the Flag](https://en.wikipedia.org/wiki/Capture_the_flag_(cybersecurity)) challenge). Agents are an area of active research, and many schemes for implementing them have been developed, including [AutoGPT](https://arxiv.org/abs/2306.02224), [ReAct](https://arxiv.org/pdf/2303.11366.pdf), and [Reflexion](https://arxiv.org/pdf/2303.11366.pdf).
 
+An agent isn't a special construct within Inspect, it's merely a solver that includes tool use and calls `generate()` internally to interact with the model.
+
 Inspect supports a variety of approaches to agent evaluations, including:
 
 1.  Using Inspect's built-in `basic_agent()`.
@@ -12,8 +14,6 @@ Inspect supports a variety of approaches to agent evaluations, including:
 
 3.  Adapting an agent provided by a research paper or open source library (for example, using a 3rd party agent library like [LangChain](https://python.langchain.com/docs/modules/agents/) or [Langroid](https://langroid.github.io/langroid/)).
 
-We'll cover the basics of all of these approaches below.
-
 An important additional consideration for agent evaluations is sandboxing (providing a secure environment for models to execute code within). The [Sandbox Environments](#sec-sandbox-environments) section goes into more depth on this.
 
 ## Basic Agent {#sec-basic-agent}
@@ -22,17 +22,17 @@ The `basic_agent()`provides a ReAct tool loop with support for retries and encou
 
 1.  When developing tasks and datasets it's convenient to have a ready made agent that you know that will competently navigate your task.
 
-2.  When developing custom agents, it's a good idea to start out with an idea of how the model performs using its native planning and eliciatation capabilities. The basic agent is a good way to establish this baseline.
+2.  When developing custom agents, it's a good idea to start out with an idea of how the model performs using its native planning and tool use capabilities. The basic agent is a good way to establish this baseline.
 
 3.  It provides a sound basis for comparison of the native agentic capabilities of models both over time and across providers.
 
-The basic agent incorporates best practices for giving models some additional resilience and persistence, both through the optional `max_attempts` parameter, as well as by continuing the task even when the model stops making tool calls. The basic agent can frequently match or exeed custom scaffolds, so you should always try it as a baseline for your tasks!
+The basic agent incorporates best practices for giving models some additional resilience and persistence, both through the optional `max_attempts` parameter, as well as by continuing the task even when the model stops making tool calls. The basic agent can frequently match or exceed custom scaffolds, so you should always try it as a baseline for your tasks!
 
 Note that when using the basic agent you should *always* set a `max_messages` so that there is some termination point if the model gets off track or stuck in a loop.
 
 ### Example
 
-Here is an example use of `basic_agent()` as the `plan` for a CTF evaluation:
+Here is an example use of `basic_agent()` as the `solver` for a CTF evaluation:
 
 ``` python
 from inspect_ai import Task, task
@@ -54,7 +54,7 @@ you are going to use and how they fit into your plan.                    # <1>
 def ctf():
     return Task(
         dataset=json_dataset("ctf.json"),
-        plan=basic_agent(
+        solver=basic_agent(
             init=system_message(SYSTEM_MESSAGE),
             tools=[bash(timeout=180), python(timeout=180)], # <2>
             max_attempts=3,                                 # <3>
@@ -92,22 +92,21 @@ There are several options available for customising the behaviour of the basic a
 
 For multiple attempts, submissions are evaluated using the task's main scorer, with value of 1.0 indicating a correct answer. Scorer values are converted to float (e.g. "C" becomes 1.0) using the standard `value_to_float()` function. Provide an alternate conversion scheme as required via `score_value`.
 
-
 ## Custom Scaffold {#sec-custom-scaffolding}
 
-The basic agent demonstrated above will work well for some tasks, but in other cases you may need to provide more custom logic. For example, you might want to:
+The basic agent demonstrated above will work well for some tasks, but in other cases you may want to provide more custom logic. For example, you might want to:
 
 {{< include _tools-scaffold.md >}}
 
 ### Tool Filtering
 
 While its possible to make tools globally available to the model via `use_tools()`, you may also want to filter the available tools either based on task stages or dynamically based on some other criteria.
 
-Here's an example of a `Solver` that filters the available tools between calls to `generate()`:
+Here's an example of a solver agent that filters the available tools between calls to `generate()`:
 
 ``` python
 @solver
-def generate_ctf():
+def ctf_agent():
     async def solve(state: TaskState, generate: Generate):
 
         # first pass w/ core tools
@@ -128,8 +127,6 @@ def generate_ctf():
     return solve
 ```
 
-In this example we rely on the default `generate()` tool calling behaviour (`"loop"`). However, you can also imaging combining tool filtering with the more tailored tool calling logic described in [Tool Calls](#sec-tool-calls).
-
 ### Agents API
 
 For more sophisticated agents, Inspect offers several additional advanced APIs for state management, sub-agents, and fine grained logging. See the [Agents API](agents-api.qmd) article for additional details.
@@ -259,7 +256,7 @@ Finally, here's a task that uses the `wikipedia_search()` solver:
 def wikipedia() -> Task:
     return Task(
         dataset=json_dataset("wikipedia.jsonl"),
-        plan=wikipedia_search(),
+        solver=wikipedia_search(),
         scorer=model_graded_fact(),
     )
 ```
@@ -327,7 +324,7 @@ dataset = [
 def file_probe()
     return Task(
         dataset=dataset,
-        plan=[
+        solver=[
             use_tools([list_files()]), 
             generate()
         ],
@@ -360,7 +357,7 @@ There are two sandbox environments built in to Inspect:
 
 Sandbox environment definitions can be bound at the `Sample`, `Task`, or `eval()` level. Binding precedence goes from `eval()`, to `Task` to `Sample`, however sandbox config files defined on the `Sample` always take precedence when the sandbox type for the `Sample` is the same as the enclosing `Task` or `eval()`.
 
-Here is a `Task` that defines a `sandbox` and corresponding sandbox config file:
+Here is a `Task` that defines a `sandbox`:
 
 ``` python
 Task(
@@ -370,16 +367,18 @@ Task(
         generate()
     ]),
     scorer=match(),
-    sandbox=("docker", "compose.yaml")
+    sandbox="docker"
 )
 ```
 
-In this example there is a `compose.yaml` file in the task directory that will be used to provision Docker services (if there is no config file specified then the Docker default Python 3.12 image will be used). For example:
+By default, any `Dockerfile` and/or `compose.yaml` file within the task directory will be automatically discovered and used. If your compose file has a different name then you can provide an override specification as follows:
 
-``` python
-sandbox="docker"
+```python
+sandbox=("docker", "attacker-compose.yaml")
 ```
 
+The configuration file added to the `sandbox` spec should always be a compose file (rather than a `Dockerfile`, which is always discovered automatically). 
+
 ### Per Sample Setup
 
 The `Sample` class includes `sandbox`, `files` and `setup` fields that are used to specify per-sample sandbox config, file assets, and setup logic.
@@ -500,7 +499,7 @@ sandbox("victim")  # named sandbox environment
 ```
 
 ::: {.callout-note apperance="simple"}
-If you define multiple sandbox environments you are *required* to name one of them "default" so that Inspect knows which environment to copy samples files to and resolve for calls to `sandbox()` without an argument.
+If you define multiple sandbox environments you are *required* to name one of them "default" so that Inspect knows which environment to resolve for calls to `sandbox()` without an argument. Alternatively, you can add the `x-default` key to a service not named "default" to designate it as the default sandbox.
 :::
 
 #### Infrastructure

diff --git a/docs/caching.qmd b/docs/caching.qmd
@@ -25,7 +25,7 @@ For example, here we are iterating on our self critique template, so we cache th
 def theory_of_mind():
     return Task(
         dataset=example_dataset("theory_of_mind"),
-        plan=[
+        solver=[
             chain_of_thought(),
             generate(cache = True),
             self_critique(CRITIQUE_TEMPLATE)

diff --git a/docs/datasets.qmd b/docs/datasets.qmd
@@ -50,7 +50,19 @@ Note that samples from datasets without an `id` field will automatically be assi
 
 If your samples include `choices`, then the `target` should be a numeric index into the available `choices` rather than a letter (this is an implicit assumption of the `multiple_choice()` solver).
 
-If your samples include `files`, they will be copied into the default sandbox environment unless their name contains a prefix mapping them into another environment (e.g. "`victim:flag.txt": "flag.txt"`).
+### Files
+
+The `files` field maps container target file paths to file contents (where contents can be either a filesystem path, a URL, or a string with inline content). For example, to copy a local file named `flag.txt` into the container path `/shared/flag.txt` you would use this:
+
+```python
+"/shared/flag.txt": "flag.txt"
+```
+
+Files are copied into the default sandbox environment unless their name contains a prefix mapping them into another environment. For example, to copy into the `victim` container:
+
+```python
+"victim:/shared/flag.txt": "flag.txt"
+```
 
 ## Field Mapping
 
@@ -243,7 +255,7 @@ dataset=MemoryDataset([
 def security_guide():
     return Task(
         dataset=dataset,
-        plan=[system_message(SYSTEM_MESSAGE), generate()],
+        solver=[system_message(SYSTEM_MESSAGE), generate()],
         scorer=model_graded_fact(),
     )
 ```

diff --git a/docs/errors-and-limits.qmd b/docs/errors-and-limits.qmd
@@ -24,7 +24,7 @@ In some cases you might wish to tolerate some number of errors without failing t
 def intercode_ctf():
     return Task(
         dataset=read_dataset(),
-        plan=[
+        solver=[
             system_message("system.txt"),
             use_tools([bash(timeout=120)]),
             generate(),
@@ -65,7 +65,7 @@ In open-ended model conversations (for example, an agent evalution with tool usa
 def intercode_ctf():
     return Task(
         dataset=read_dataset(),
-        plan=[
+        solver=[
             system_message("system.txt"),
             use_tools([bash(timeout=120)]),
             generate(),

diff --git a/docs/eval-sets.qmd b/docs/eval-sets.qmd
@@ -70,7 +70,7 @@ There are two fundamental requirements for dynamic tasks used with `eval_set()`:
 1)  They are created using an `@task` function as described above.
 2)  Their parameters use ordinary Python types (like `str`, `int`, `list`, etc.) as opposed to custom objects (which are hard to serialise consistently).
 
-Note that you can pass a `plan` to an `@task` function, so long as it was created by a function decorated with `@plan`.
+Note that you can pass a `solver` to an `@task` function, so long as it was created by a function decorated with `@solver`.
 
 ### Retry Options
 

diff --git a/docs/examples/index.qmd b/docs/examples/index.qmd
@@ -14,6 +14,7 @@ aliases:
 <nav id="TOC" role="doc-toc">
 <ul>
 <li><a href="#coding" id="toc-coding">Coding</a></li>
+<li><a href="#assistants" id="toc-coding">Assistants</a></li>
 <li><a href="#cybersecurity" id="toc-coding">Cybersecurity</a></li>
 <li><a href="#mathematics" id="toc-mathematics">Mathematics</a></li>
 <li><a href="#reasoning" id="toc-reasoning">Reasoning</a></li>