Skip to content

Commit

Permalink
updated docs for new --solver option(s) (#487)
Browse files Browse the repository at this point in the history
* plan -> solver in docs, examples, and tests

* improve solver docs

* improve workflow

---------

Co-authored-by: J.J. Allaire <[email protected]>
Co-authored-by: aisi-inspect <[email protected]>
  • Loading branch information
3 people authored Sep 24, 2024
1 parent 99ff4f1 commit 2289bd9
Show file tree
Hide file tree
Showing 79 changed files with 466 additions and 265 deletions.
37 changes: 22 additions & 15 deletions docs/_tools-scaffold.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,23 +20,30 @@ state.messages.append(output.message)
state.messages.extend(call_tools(output.message, state.tools))
```

This does everything that default `generate()` does, save for an outer loop to continue calling the mode as long as it continues calling tools. You could implement the outer loop as follows:
This does everything that default `generate()` does, save for an outer loop to continue calling the mode as long as it continues calling tools. This is a complete solver agent that implements the outer loop:

``` python
model = get_model()
while True:
# call model
output = await model.generate(state.messages, state.tools)

# update state
state.output = output
state.messages.append(output.message)

# make tool calls or terminate if there are none
if output.message.tool_calls:
state.messages.extend(call_tools(output.message, state.tools))
else:
break
@solver
def agent_loop():
async def solve(state: TaskState, generate: Generate):
model = get_model()
while True:
# call model
output = await model.generate(state.messages, state.tools)

# update state
state.output = output
state.messages.append(output.message)

# make tool calls or terminate if there are none
if output.message.tool_calls:
state.messages.extend(call_tools(output.message, state.tools))
else:
break

return state

return solve
```

You can imagine several ways you might want to customise this loop:
Expand Down
12 changes: 6 additions & 6 deletions docs/agents-api.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,13 @@ This article describes advanced Inspect APIs available for creating evaluations
5. Delegating work to sub-tasks
6. Sandboxing arbitrary code execution

We'll assume that you already understand Inspect [Solvers](solvers.qmd) and [Tools](tools.qmd) (please review those articles as required before proceeding).
We'll assume that you have already covered the basics of [Solvers](solvers.qmd), [Tools](tools.qmd), and [Agents](agents.qmd) (please review those articles as required before proceeding).

## Use of `metadata`

Before proceeding, it's important to point that some of the features described below were previously approximated by using the `metadata` field of `TaskState`, specifically `metadata` was often used as a catch-all storage location for:

- Carrying state between solvers and sometimes tools.
- Sharing state between solvers.
- Providing a place to log additional structured data.
- Recording calls to "helper" models used for elicitation or scoring.

Expand Down Expand Up @@ -138,7 +138,7 @@ from inspect_ai.log import transcript
transcript().info("here is some custom info")
```

You can pass arbitrary JSON serialisable objects to `info()`.
Strings passed to `info()` will be rendered as markdown. In addition to strings you can also pass arbitrary JSON serialisable objects to `info()`.

### Grouping with Steps

Expand Down Expand Up @@ -216,11 +216,11 @@ Note that we don't `await` the subtasks when building up our list of `searches`.

### Forking {#sec-forking}

Inspect's `fork()` function provids a convenient wrapper around a very common use of subtasks: running a `TaskState` against a set of solvers in parallel to explore different trajectories.
Inspect's `fork()` function provids a convenient wrapper around a very common use of subtasks: running a `TaskState` against a set of solvers in parallel to explore different trajectories.

For example, let's say you have a solver named `explore()` that takes `temperature` as a parameter. You might want to try the solver out with multiple temperature values and then continue on with the best result:

```python
``` python
from inspect_ai.solver import fork

results = await fork(state, [
Expand All @@ -241,7 +241,7 @@ Many agents provide models with the ability to execute arbitrary code. It's impo
def file_probe()
return Task(
dataset=dataset,
plan=[
solver=[
use_tools([list_files()]),
generate()
],
Expand Down
25 changes: 11 additions & 14 deletions docs/agents.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

Agents combine planning, memory, and tool usage to pursue more complex, longer horizon tasks (e.g. a [Capture the Flag](https://en.wikipedia.org/wiki/Capture_the_flag_(cybersecurity)) challenge). Agents are an area of active research, and many schemes for implementing them have been developed, including [AutoGPT](https://arxiv.org/abs/2306.02224), [ReAct](https://arxiv.org/pdf/2303.11366.pdf), and [Reflexion](https://arxiv.org/pdf/2303.11366.pdf).

An agent isn't a special construct within Inspect, it's merely a solver that includes tool use and calls `generate()` internally to interact with the model.

Inspect supports a variety of approaches to agent evaluations, including:

1. Using Inspect's built-in `basic_agent()`.
Expand All @@ -12,8 +14,6 @@ Inspect supports a variety of approaches to agent evaluations, including:

3. Adapting an agent provided by a research paper or open source library (for example, using a 3rd party agent library like [LangChain](https://python.langchain.com/docs/modules/agents/) or [Langroid](https://langroid.github.io/langroid/)).

We'll cover the basics of all of these approaches below.

An important additional consideration for agent evaluations is sandboxing (providing a secure environment for models to execute code within). The [Sandbox Environments](#sec-sandbox-environments) section goes into more depth on this.

## Basic Agent {#sec-basic-agent}
Expand All @@ -22,17 +22,17 @@ The `basic_agent()`provides a ReAct tool loop with support for retries and encou

1. When developing tasks and datasets it's convenient to have a ready made agent that you know that will competently navigate your task.

2. When developing custom agents, it's a good idea to start out with an idea of how the model performs using its native planning and eliciatation capabilities. The basic agent is a good way to establish this baseline.
2. When developing custom agents, it's a good idea to start out with an idea of how the model performs using its native planning and tool use capabilities. The basic agent is a good way to establish this baseline.

3. It provides a sound basis for comparison of the native agentic capabilities of models both over time and across providers.

The basic agent incorporates best practices for giving models some additional resilience and persistence, both through the optional `max_attempts` parameter, as well as by continuing the task even when the model stops making tool calls. The basic agent can frequently match or exeed custom scaffolds, so you should always try it as a baseline for your tasks!
The basic agent incorporates best practices for giving models some additional resilience and persistence, both through the optional `max_attempts` parameter, as well as by continuing the task even when the model stops making tool calls. The basic agent can frequently match or exceed custom scaffolds, so you should always try it as a baseline for your tasks!

Note that when using the basic agent you should *always* set a `max_messages` so that there is some termination point if the model gets off track or stuck in a loop.

### Example

Here is an example use of `basic_agent()` as the `plan` for a CTF evaluation:
Here is an example use of `basic_agent()` as the `solver` for a CTF evaluation:

``` python
from inspect_ai import Task, task
Expand All @@ -54,7 +54,7 @@ you are going to use and how they fit into your plan. # <1>
def ctf():
return Task(
dataset=json_dataset("ctf.json"),
plan=basic_agent(
solver=basic_agent(
init=system_message(SYSTEM_MESSAGE),
tools=[bash(timeout=180), python(timeout=180)], # <2>
max_attempts=3, # <3>
Expand Down Expand Up @@ -92,22 +92,21 @@ There are several options available for customising the behaviour of the basic a

For multiple attempts, submissions are evaluated using the task's main scorer, with value of 1.0 indicating a correct answer. Scorer values are converted to float (e.g. "C" becomes 1.0) using the standard `value_to_float()` function. Provide an alternate conversion scheme as required via `score_value`.


## Custom Scaffold {#sec-custom-scaffolding}

The basic agent demonstrated above will work well for some tasks, but in other cases you may need to provide more custom logic. For example, you might want to:
The basic agent demonstrated above will work well for some tasks, but in other cases you may want to provide more custom logic. For example, you might want to:

{{< include _tools-scaffold.md >}}

### Tool Filtering

While its possible to make tools globally available to the model via `use_tools()`, you may also want to filter the available tools either based on task stages or dynamically based on some other criteria.

Here's an example of a `Solver` that filters the available tools between calls to `generate()`:
Here's an example of a solver agent that filters the available tools between calls to `generate()`:

``` python
@solver
def generate_ctf():
def ctf_agent():
async def solve(state: TaskState, generate: Generate):

# first pass w/ core tools
Expand All @@ -128,8 +127,6 @@ def generate_ctf():
return solve
```

In this example we rely on the default `generate()` tool calling behaviour (`"loop"`). However, you can also imaging combining tool filtering with the more tailored tool calling logic described in [Tool Calls](#sec-tool-calls).

### Agents API

For more sophisticated agents, Inspect offers several additional advanced APIs for state management, sub-agents, and fine grained logging. See the [Agents API](agents-api.qmd) article for additional details.
Expand Down Expand Up @@ -259,7 +256,7 @@ Finally, here's a task that uses the `wikipedia_search()` solver:
def wikipedia() -> Task:
return Task(
dataset=json_dataset("wikipedia.jsonl"),
plan=wikipedia_search(),
solver=wikipedia_search(),
scorer=model_graded_fact(),
)
```
Expand Down Expand Up @@ -327,7 +324,7 @@ dataset = [
def file_probe()
return Task(
dataset=dataset,
plan=[
solver=[
use_tools([list_files()]),
generate()
],
Expand Down
2 changes: 1 addition & 1 deletion docs/caching.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ For example, here we are iterating on our self critique template, so we cache th
def theory_of_mind():
return Task(
dataset=example_dataset("theory_of_mind"),
plan=[
solver=[
chain_of_thought(),
generate(cache = True),
self_critique(CRITIQUE_TEMPLATE)
Expand Down
2 changes: 1 addition & 1 deletion docs/datasets.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ dataset=MemoryDataset([
def security_guide():
return Task(
dataset=dataset,
plan=[system_message(SYSTEM_MESSAGE), generate()],
solver=[system_message(SYSTEM_MESSAGE), generate()],
scorer=model_graded_fact(),
)
```
Expand Down
4 changes: 2 additions & 2 deletions docs/errors-and-limits.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ In some cases you might wish to tolerate some number of errors without failing t
def intercode_ctf():
return Task(
dataset=read_dataset(),
plan=[
solver=[
system_message("system.txt"),
use_tools([bash(timeout=120)]),
generate(),
Expand Down Expand Up @@ -65,7 +65,7 @@ In open-ended model conversations (for example, an agent evalution with tool usa
def intercode_ctf():
return Task(
dataset=read_dataset(),
plan=[
solver=[
system_message("system.txt"),
use_tools([bash(timeout=120)]),
generate(),
Expand Down
1 change: 1 addition & 0 deletions docs/examples/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ aliases:
<nav id="TOC" role="doc-toc">
<ul>
<li><a href="#coding" id="toc-coding">Coding</a></li>
<li><a href="#assistants" id="toc-coding">Assistants</a></li>
<li><a href="#cybersecurity" id="toc-coding">Cybersecurity</a></li>
<li><a href="#mathematics" id="toc-mathematics">Mathematics</a></li>
<li><a href="#reasoning" id="toc-reasoning">Reasoning</a></li>
Expand Down
Loading

0 comments on commit 2289bd9

Please sign in to comment.