updated docs for new --solver option(s) (#487)

* plan -> solver in docs, examples, and tests * improve solver docs * improve workflow --------- Co-authored-by: J.J. Allaire <[email protected]> Co-authored-by: aisi-inspect <[email protected]>
UKGovernmentBEIS · Sep 24, 2024 · 2289bd9 · 2289bd9
1 parent 99ff4f1
commit 2289bd9
Show file tree

Hide file tree

Showing 79 changed files with 466 additions and 265 deletions.
diff --git a/docs/_tools-scaffold.md b/docs/_tools-scaffold.md
@@ -20,23 +20,30 @@ state.messages.append(output.message)
 state.messages.extend(call_tools(output.message, state.tools))
 ```
 
-This does everything that default `generate()` does, save for an outer loop to continue calling the mode as long as it continues calling tools. You could implement the outer loop as follows:
+This does everything that default `generate()` does, save for an outer loop to continue calling the mode as long as it continues calling tools. This is a complete solver agent that implements the outer loop:
 
 ``` python
-model = get_model()
-while True:
-    # call model
-    output = await model.generate(state.messages, state.tools)
-
-    # update state
-    state.output = output
-    state.messages.append(output.message)
-
-    # make tool calls or terminate if there are none
-    if output.message.tool_calls:
-        state.messages.extend(call_tools(output.message, state.tools))
-    else:
-        break
+@solver
+def agent_loop():
+    async def solve(state: TaskState, generate: Generate):
+        model = get_model()
+        while True:
+            # call model
+            output = await model.generate(state.messages, state.tools)
+
+            # update state
+            state.output = output
+            state.messages.append(output.message)
+
+            # make tool calls or terminate if there are none
+            if output.message.tool_calls:
+                state.messages.extend(call_tools(output.message, state.tools))
+            else:
+                break
+
+        return state
+
+    return solve
 ```
 
 You can imagine several ways you might want to customise this loop:

diff --git a/docs/agents-api.qmd b/docs/agents-api.qmd
@@ -11,13 +11,13 @@ This article describes advanced Inspect APIs available for creating evaluations
 5.  Delegating work to sub-tasks
 6.  Sandboxing arbitrary code execution
 
-We'll assume that you already understand Inspect [Solvers](solvers.qmd) and [Tools](tools.qmd) (please review those articles as required before proceeding).
+We'll assume that you have already covered the basics of [Solvers](solvers.qmd), [Tools](tools.qmd), and [Agents](agents.qmd) (please review those articles as required before proceeding).
 
 ## Use of `metadata`
 
 Before proceeding, it's important to point that some of the features described below were previously approximated by using the `metadata` field of `TaskState`, specifically `metadata` was often used as a catch-all storage location for:
 
--   Carrying state between solvers and sometimes tools.
+-   Sharing state between solvers.
 -   Providing a place to log additional structured data.
 -   Recording calls to "helper" models used for elicitation or scoring.
 
@@ -138,7 +138,7 @@ from inspect_ai.log import transcript
 transcript().info("here is some custom info")
 ```
 
-You can pass arbitrary JSON serialisable objects to `info()`.
+Strings passed to `info()` will be rendered as markdown. In addition to strings you can also pass arbitrary JSON serialisable objects to `info()`.
 
 ### Grouping with Steps
 
@@ -216,11 +216,11 @@ Note that we don't `await` the subtasks when building up our list of `searches`.
 
 ### Forking {#sec-forking}
 
-Inspect's `fork()` function provids a convenient wrapper around a very common use of subtasks: running a `TaskState` against a set of solvers in parallel to explore different trajectories. 
+Inspect's `fork()` function provids a convenient wrapper around a very common use of subtasks: running a `TaskState` against a set of solvers in parallel to explore different trajectories.
 
 For example, let's say you have a solver named `explore()` that takes `temperature` as a parameter. You might want to try the solver out with multiple temperature values and then continue on with the best result:
 
-```python
+``` python
 from inspect_ai.solver import fork
 
 results = await fork(state, [
@@ -241,7 +241,7 @@ Many agents provide models with the ability to execute arbitrary code. It's impo
 def file_probe()
     return Task(
         dataset=dataset,
-        plan=[
+        solver=[
             use_tools([list_files()]), 
             generate()
         ],

diff --git a/docs/agents.qmd b/docs/agents.qmd
@@ -4,6 +4,8 @@
 
 Agents combine planning, memory, and tool usage to pursue more complex, longer horizon tasks (e.g. a [Capture the Flag](https://en.wikipedia.org/wiki/Capture_the_flag_(cybersecurity)) challenge). Agents are an area of active research, and many schemes for implementing them have been developed, including [AutoGPT](https://arxiv.org/abs/2306.02224), [ReAct](https://arxiv.org/pdf/2303.11366.pdf), and [Reflexion](https://arxiv.org/pdf/2303.11366.pdf).
 
+An agent isn't a special construct within Inspect, it's merely a solver that includes tool use and calls `generate()` internally to interact with the model.
+
 Inspect supports a variety of approaches to agent evaluations, including:
 
 1.  Using Inspect's built-in `basic_agent()`.
@@ -12,8 +14,6 @@ Inspect supports a variety of approaches to agent evaluations, including:
 
 3.  Adapting an agent provided by a research paper or open source library (for example, using a 3rd party agent library like [LangChain](https://python.langchain.com/docs/modules/agents/) or [Langroid](https://langroid.github.io/langroid/)).
 
-We'll cover the basics of all of these approaches below.
-
 An important additional consideration for agent evaluations is sandboxing (providing a secure environment for models to execute code within). The [Sandbox Environments](#sec-sandbox-environments) section goes into more depth on this.
 
 ## Basic Agent {#sec-basic-agent}
@@ -22,17 +22,17 @@ The `basic_agent()`provides a ReAct tool loop with support for retries and encou
 
 1.  When developing tasks and datasets it's convenient to have a ready made agent that you know that will competently navigate your task.
 
-2.  When developing custom agents, it's a good idea to start out with an idea of how the model performs using its native planning and eliciatation capabilities. The basic agent is a good way to establish this baseline.
+2.  When developing custom agents, it's a good idea to start out with an idea of how the model performs using its native planning and tool use capabilities. The basic agent is a good way to establish this baseline.
 
 3.  It provides a sound basis for comparison of the native agentic capabilities of models both over time and across providers.
 
-The basic agent incorporates best practices for giving models some additional resilience and persistence, both through the optional `max_attempts` parameter, as well as by continuing the task even when the model stops making tool calls. The basic agent can frequently match or exeed custom scaffolds, so you should always try it as a baseline for your tasks!
+The basic agent incorporates best practices for giving models some additional resilience and persistence, both through the optional `max_attempts` parameter, as well as by continuing the task even when the model stops making tool calls. The basic agent can frequently match or exceed custom scaffolds, so you should always try it as a baseline for your tasks!
 
 Note that when using the basic agent you should *always* set a `max_messages` so that there is some termination point if the model gets off track or stuck in a loop.
 
 ### Example
 
-Here is an example use of `basic_agent()` as the `plan` for a CTF evaluation:
+Here is an example use of `basic_agent()` as the `solver` for a CTF evaluation:
 
 ``` python
 from inspect_ai import Task, task
@@ -54,7 +54,7 @@ you are going to use and how they fit into your plan.                    # <1>
 def ctf():
     return Task(
         dataset=json_dataset("ctf.json"),
-        plan=basic_agent(
+        solver=basic_agent(
             init=system_message(SYSTEM_MESSAGE),
             tools=[bash(timeout=180), python(timeout=180)], # <2>
             max_attempts=3,                                 # <3>
@@ -92,22 +92,21 @@ There are several options available for customising the behaviour of the basic a
 
 For multiple attempts, submissions are evaluated using the task's main scorer, with value of 1.0 indicating a correct answer. Scorer values are converted to float (e.g. "C" becomes 1.0) using the standard `value_to_float()` function. Provide an alternate conversion scheme as required via `score_value`.
 
-
 ## Custom Scaffold {#sec-custom-scaffolding}
 
-The basic agent demonstrated above will work well for some tasks, but in other cases you may need to provide more custom logic. For example, you might want to:
+The basic agent demonstrated above will work well for some tasks, but in other cases you may want to provide more custom logic. For example, you might want to:
 
 {{< include _tools-scaffold.md >}}
 
 ### Tool Filtering
 
 While its possible to make tools globally available to the model via `use_tools()`, you may also want to filter the available tools either based on task stages or dynamically based on some other criteria.
 
-Here's an example of a `Solver` that filters the available tools between calls to `generate()`:
+Here's an example of a solver agent that filters the available tools between calls to `generate()`:
 
 ``` python
 @solver
-def generate_ctf():
+def ctf_agent():
     async def solve(state: TaskState, generate: Generate):
 
         # first pass w/ core tools
@@ -128,8 +127,6 @@ def generate_ctf():
     return solve
 ```
 
-In this example we rely on the default `generate()` tool calling behaviour (`"loop"`). However, you can also imaging combining tool filtering with the more tailored tool calling logic described in [Tool Calls](#sec-tool-calls).
-
 ### Agents API
 
 For more sophisticated agents, Inspect offers several additional advanced APIs for state management, sub-agents, and fine grained logging. See the [Agents API](agents-api.qmd) article for additional details.
@@ -259,7 +256,7 @@ Finally, here's a task that uses the `wikipedia_search()` solver:
 def wikipedia() -> Task:
     return Task(
         dataset=json_dataset("wikipedia.jsonl"),
-        plan=wikipedia_search(),
+        solver=wikipedia_search(),
         scorer=model_graded_fact(),
     )
 ```
@@ -327,7 +324,7 @@ dataset = [
 def file_probe()
     return Task(
         dataset=dataset,
-        plan=[
+        solver=[
             use_tools([list_files()]), 
             generate()
         ],

diff --git a/docs/caching.qmd b/docs/caching.qmd
@@ -25,7 +25,7 @@ For example, here we are iterating on our self critique template, so we cache th
 def theory_of_mind():
     return Task(
         dataset=example_dataset("theory_of_mind"),
-        plan=[
+        solver=[
             chain_of_thought(),
             generate(cache = True),
             self_critique(CRITIQUE_TEMPLATE)

diff --git a/docs/datasets.qmd b/docs/datasets.qmd
@@ -243,7 +243,7 @@ dataset=MemoryDataset([
 def security_guide():
     return Task(
         dataset=dataset,
-        plan=[system_message(SYSTEM_MESSAGE), generate()],
+        solver=[system_message(SYSTEM_MESSAGE), generate()],
         scorer=model_graded_fact(),
     )
 ```

diff --git a/docs/errors-and-limits.qmd b/docs/errors-and-limits.qmd
@@ -24,7 +24,7 @@ In some cases you might wish to tolerate some number of errors without failing t
 def intercode_ctf():
     return Task(
         dataset=read_dataset(),
-        plan=[
+        solver=[
             system_message("system.txt"),
             use_tools([bash(timeout=120)]),
             generate(),
@@ -65,7 +65,7 @@ In open-ended model conversations (for example, an agent evalution with tool usa
 def intercode_ctf():
     return Task(
         dataset=read_dataset(),
-        plan=[
+        solver=[
             system_message("system.txt"),
             use_tools([bash(timeout=120)]),
             generate(),

diff --git a/docs/examples/index.qmd b/docs/examples/index.qmd
@@ -14,6 +14,7 @@ aliases:
 <nav id="TOC" role="doc-toc">
 <ul>
 <li><a href="#coding" id="toc-coding">Coding</a></li>
+<li><a href="#assistants" id="toc-coding">Assistants</a></li>
 <li><a href="#cybersecurity" id="toc-coding">Cybersecurity</a></li>
 <li><a href="#mathematics" id="toc-mathematics">Mathematics</a></li>
 <li><a href="#reasoning" id="toc-reasoning">Reasoning</a></li>