Low Level Solver API #82

jjallaire · 2024-07-16T00:08:56Z

Overview

This PR implements several related new features aimed at improving the development and observability of complex agent evals. Taken together, these capabilities constitute a "low-level" Solver API that affords full control over execution and model interactions. This is intended to compliment the "high-level" Solver API which assumes a more linear model interaction history and relies on the default generate() function and tool use loop.

Our goal is to identify the lower level capabilities common to agent evaluations and provide an inspect-native implementation of them. The hope is that this will enable both "bare metal" agent development (just using Inspect w/ no extra libraries) as well as free higher level agent abstractions built on inspect from many lower level concerns (e.g. state-tracking, observability, managing sub-agents, etc.). Ideally, core agents could be implemented using Inspect primitives only, and then be re-used in both bare-metal evals as well as evals that use additional abstractions from a dedicated agent library.

This PR is currently a draft pending implementation of log viewer features to visualise transcripts / state changes / events / etc.

Tool Use

Previously, tool use in Inspect required the use of the higher level generate() function passed to solvers. We've now refactored things so that its easy to code the equivalent of the generate() function even when using lower-level direct interactions w/ LLMs. For example:

model = get_model()
output = await model.generate(messages, tools)
messages.append(output.message)
messages.extend(call_tools(output.message, tools))

This does everything that default generate() does (save for the outer loop). You could implement the outer loop as follows:

model = get_model()
while True:
    output = await model.generate(messages, tools)
    messages.append(output.message)
    if output.message.tool_calls:
        messages.extend(call_tools(output.message, tools))
    else:
        break

One additional change made to facilitate implementing the equivalent of generate() is that when calling get_model() with no parameters to get the "active" model, the model instance you get back will use the evaluation's default GenerateConfig (e.g. temperature, top_p, etc.). Note that previously you needed to call the higher level generate() passed to Solvers to have this configuration applied.

Store

The Store is intended as a replacement for reading/writing to state.metadata. It includes better facilities for default values and has a clean mechanism for use within tools. The Store is also designed to play well with Transcripts and Subtasks (both described below). The core of the Store interface is:

from inspect_ai.solver import Store

class Store:
    def get(self, key: str, default: VT) -> VT
    def set(self, key: str, value: Any) -> None
    def delete(self, key: str) -> None

Basic views on the store's collection (e.g. items(), keys(), values()) are also provided. Note that the get() method will automatically add the default to the store if it doesn't exist.

The Store can be accessed via TaskState as follows:

history = state.store.get("history", [])

It is also possible the access the Store for the current sample using the store() function. This is the mechanism for tools to read and write the Store. For example:

from inspect_ai.solver import store
from inspect_ai.tool import tool

@tool
def web_browser_back():
   def execute() -> str:
       history = store().get("web_browser:history", [])
       return history.pop()

While there is no formal namespacing mechanism for the Store, this can be informally achieved using key prefixes as demonstrated above.

See store.py for the full implementation of Store.

Transcripts

The current Inspect logs only record the final state of the message history, which for complex evals that re-write the message history tells you very little about "what happened" during the execution of the sample. Transcripts are intended to provide a rich per-sample view of everything that occurred. Transcripts consist of a list of events, here are the core events that are recorded:

Event	Description
`StateEvent`	Records changes (in JSON Patch format) to `TaskState` that occur within a `Solver`.
`ModelEvent`	Records LLM calls (including input message, tools, config, and model output).
`LoggerEvent`	Records calls to the Python logger (e.g. `logger.warning("warn)`).
`StepEvent`	Deliniates "steps" in an eval. By default a step is created for each `Solver` but you can also add your own steps.

The plan is to replace the default "Messages" view in the Inspect log viewer with a "Transcript" view that provides much better visibility into what actually occurred during execution. These are the core events, there are additionally InfoEvent, StepEvent, StoreEvent, and SubtaskEvent which are described below.

Custom Info

You can insert custom entries into the transcript via the Transcipt info() method (which creates an InfoEvent). Access the transcript for the current sample using the transcript() function, for example:

from inspect_ai.solver import transcript

transcript().info("here is some custom info")

You can pass arbitrary JSON serialisable objects to info() (we will likely want to create a mechanism for rich display of this data within the log viewer):

Grouping with Steps

You can create arbitrary groupings of transcript activity using the Transcript step() context manager. For example:

with transcript().step("reasoning"):
    ...
    state.store.set("next-action", next_action)

There are two reasons that you might want to create steps:

Any changes to the store which occur during a step will be collected into a StoreEvent that records the changes (in JSON Patch format) that occurred.
The Inspect log viewer will create a visual delination for the step, which will make it easier to see the flow of activity within the transcript.

See transcript.py for the full implementation of Transcript.

Subtasks

Subtasks provide a mechanism for creating isolated, re-usable units of execution. You might implement a complex tool using a subtask or might use them in a multi-agent evaluation. The main characteristics of sub-tasks are:

They run in their own async coroutine.
They have their own isolated Transcript
They have their own isolated Store (no access to the sample Store).

To create a subtask, declare an async function with the @subtask decorator. The function can take any arguments and return a value of any type. For example:

from inspect_ai.solver import Store, subtask

@subtask
async def web_search(keywords: str) -> str:
    # get links for these keywords
    links = await search_links(keywords)

    # add links to the store so they end up in the transcript
    store().set("links", links)

    # summarise the links
    return await fetch_and_summarise(links)

Note that we add links to the store not because we strictly need to for our implementation, but because we want the links to be recorded as part of the transcript.

Call the subtask as you would any async function:

summary = await web_search(keywords="solar power")

A few things will occur automatically when you run a subtask:

New isolated Store and Transcript objects will be created for the subtask (accessible via the store() and transcript() functions). Changes to the Store that occur during execution will be recorded in a StoreEvent.
A SubtaskEvent will be added to the current transcript. The event will include the name of the subtask, its input and results, and a transcript of all events that occur within the subtask.

You can also include one or more steps within a subtask.

Tool Subtasks

If a tool is entirely independent of other tools (i.e. it is not coordinating with other tools via changes to the store) you can make the tool itself a subtask. This would be beneficial if the tool had interactions (e.g. model calls, store changes, etc.) you want to record in a distinct transcript.

To enclose a tool in a subtask, just add the @subtask decorator to the tool's execute function. For example, let's adapt the web_search() function above into a tool that also defines a subtask (note we also add a model parameter to customise what model is used for summarisation):

@tool(prompt="Use the web search tool to find information online")
def web_search(model: str | Model | None = None):

    # resolve model used for summarisation
    model = get_model(model)

    @subtask(name="web_search")
    def execute(keywords: str) -> str:
        # get links for these keywords
        links = await search_links(keywords)

        # add links to the store so they end up in the transcript
        store().set("links", links)

        # summarise the links
        return await fetch_and_summarise(links, model)

    return execute

Note that we've added the name="web_search" argument to @subtask—this is because we've declared our function using the generic name execute() and want to provide an alternative that is more readily understandable.

See subtask.py for the full implementation of the @subtask decorator.

Baseline configuration with baseline implementation in a few places

This allows the types to flow from package.json

This is causing a failure in resolving dependencies which prevents yarn install from working.

(move it out of transcript view - it expects to receive events with resolve content)

…cript track changes/transcript for tool calls

…t_ai into feature/subtasks

dragonstyle

This is an epic PR

* initial work on subtasks * update readme * remove readme (it's now in the PR writeup) * skip tool store test if no openai * fix typo * correct message reading * correct type * Proof of concept for JSDoc Styles Baseline configuration with baseline implementation in a few places * Use yarn to manage preact / htm This allows the types to flow from package.json * Fully type util as proof of concept * Cleanup in utils * Proof of concept using types from log.d.ts * Rough in of transcript dump * Try to reign in type checking for prism here * update to latest prettier * Conditionally show transcript tab * Another solver rendering * Move transcript views * Including TZ information in timestamp * Tweaked step implementation * Add tools to model dump * Revise model event view (still WIP) * More structured transcript * A little more air * A little more tweaking * fix prettier complaint * trying updating yarn.lock * Attempt to force resolution * Remove `json-schema-to-typescript` This is causing a failure in resolving dependencies which prevents yarn install from working. * Improved state change view * Further fine tuning of appearance * Ensure store support too * More standard appearance * Fix content size * Improve appearance * Properly render objects and arrays in state changes * Improve grid appearance * Remove unused imports * Correct subtask inline display * Simplify state change event rendering * Fix prettier * Share event layout * Improve logger event view * add ScoreEvent * remove unused var * track state changes in transcript steps * remove subtask stuff from web_search for now * Improve state changes rendering * Remove logger event title for more compactness (also includes improvements to the transcript event view itself) * Add a scorer event handler * Improve subtask rendering * fix heading cursor * Improve Score Event View * merge from main * turn event into a base class * don't export event types * regen schema * fixup imports * revert event type changes * write schema to types dir * transcript events: no export and standard 'data' field * regen schema * fix transcript test * use pydantic 2.0 field_serialiser * Revert "transcript events: no export and standard 'data' field" This reverts commit 5f2b654. * use pydantic 2.0 field_serialiser * don't export events * remove unused import * log the full logging message * rename log method * drive transcript events into recorder * write through to trsc file * cleaner interface for transcript event forwarding * initial write to sqlite * Standardize font-size on rem * decorate the html tag for the logview so it can detect vscode immediately * Improve column allocation * Create Shared Fonts Object Move all things to a shared notion of fonts and styles that can be re-used easily. Use font scaling in vscode to achieve the correct appearance (now that we’re rem based we can just change the base font size). * Move summary stats into navbar * Restructure navbar into workspace * Improve progress bar appearance * Improve column sizing * Refactor tab appearance into navbar * Adjust correct/incorrect appearance * Baseline pill improvements * fix heading height in vscode * correct sidebar * Improve sidebar appearance (+prettier) * widen sidebar slightly * Sample Display Tweaks * Tweaks to config * initial work on content db * more comprehensive event walking * de-duplicate content in transcript events * Remove timestamps, correct prop name * Baseline implementation of evalevents resolving plus some prettier formatting * Correct content resolution * remove logging section of evallog (now covered in sample transcript) * Improve density when hosted in vscode at narrow short sizes * Revised appearance to grouped cards * formatting * A little more tweakage * generate_loop and tool_call steps * Fix lint issues * no srsly fix lint * resolve circular import * run prettier on event panel * Fix error w/specific logs * update test * Improve find band appearance * sample init event * Proof of concept state rendering * Relocate state code since it will grow * correct resolution of objects * lint and formatting * sample_init event * Add collapsible state to event panel, collapse certain panels * Subtask rough in * ensure we have vite * Correct merge + regen dist * add a watch command * correct formatting * correct build (investigating why my local build wasn’t flagging this) * include source maps * Add Sample Init, track state across transcript * fix lint * update dist * ensure nav-pills align bottom * correct lint * Add chat view to model * prettier * ran commands in wrong order * Improve sample init event (still mostly a dump of data) * Add all messages view * Simplify transcript view * Improvements to display * Chatview should show tool call even if no message * Improve state event display * Display choices in sampleinit event * Improve Score Event appearance * Tweak title results * More appearance tweakage * Improve tab appearance * Fix tab selection issue in subtask transcripts * Improved spacing * Fix scoring metadata layout * toolcall event * initial work on raw request/response for model event * Add placeholder tool event * initial work on raw model calls * log raw tool argument not converted * log raw model call for anthropic * format attribs notebook * raw model request for mistral * Add depth to cards (with basic impl) * remove map * ignore source map * Add baseline model raw view * Improve state appearance * Improve log display * fix formatting * properly default to messages if no transcript * add one last debug message * Disable build checking with note * Appearance refinement - only start indenting at second level step - create section component * raw capture for google * Don’t capture args when logging This is doing a lot of work which shouldn’t be happening in a log handler (and the value of the args is suspect anyhow). Causing an exception in certain environments. * Remove disused imports * record raw api calls for groq * Improve root solver display - break up root cards - add sample init step (synthetic) * raw api call recording * raw model access for cloudflare * raw model output for azureai * Improve subtask display * raw model capture for vertex * eliminate qualifying note on tool descriptions * improve setup display * Add ToolView * improved agents api docs * Tweaks * eliminate tool steps * hide agents api for now * agents api docs * Resolve the model call contents * Special case handling for sample init event title (no dupe title) * Improve logging event appearance * more tool docs * rename to agents api * remove bash prompt * Correct transcript icons * improve tab selection behavior * Improved model display * Correct font size in metadatagrid * initial work on tool transcript * more tool event work * schema updates * Refactor content resolution to the very top level (move it out of transcript view - it expects to receive events with resolve content) * Resolve the whole state object for events * remove generate_loop step type * Fix ruff errors * don’t force wrap in weird ways * Correct tool rendering for state events * Baseline visual diff implementation * Move tools to single transcript tab * Improve tool event output rendering * Don’t intend tool input / output * enable docs + add parallel execution * Fix prism coloring for js, python, json * show no output message if there is not tool output * allow event titles to wrap * Improve wrapping at small sizes (model event) * crossref to agents api article --------- Co-authored-by: aisi-inspect <[email protected]> Co-authored-by: Charles Teague <[email protected]> Co-authored-by: jjallaire-aisi <[email protected]>

aisi-inspect and others added 30 commits July 15, 2024 23:59

initial work on subtasks

f179d91

update readme

f60f41b

remove readme (it's now in the PR writeup)

941fcfd

skip tool store test if no openai

fe25ddd

fix typo

43dc97b

correct message reading

3b5abad

correct type

2070839

Proof of concept for JSDoc Styles

3fb1e9e

Baseline configuration with baseline implementation in a few places

Use yarn to manage preact / htm

715cb81

This allows the types to flow from package.json

Fully type util as proof of concept

74722bd

Cleanup in utils

6dc6efd

Proof of concept using types from log.d.ts

3564d0e

Rough in of transcript dump

50518c7

Try to reign in type checking for prism here

1b53e7d

update to latest prettier

1cc9d99

Conditionally show transcript tab

a0d0ab7

Another solver rendering

35010e7

Move transcript views

a2ab8f2

Including TZ information in timestamp

e58c418

Tweaked step implementation

96feb1e

Add tools to model dump

f501d6b

Revise model event view (still WIP)

3a37bb7

More structured transcript

d535f31

A little more air

cbe7bec

A little more tweaking

940b39b

fix prettier complaint

2ad71de

trying updating yarn.lock

7d46cdc

Merge branch 'main' into feature/subtasks

f316d3c

Attempt to force resolution

7d14184

Remove json-schema-to-typescript

a9d86dd

This is causing a failure in resolving dependencies which prevents yarn install from working.

dragonstyle and others added 20 commits August 16, 2024 21:03

Correct font size in metadatagrid

5766e3d

initial work on tool transcript

ddd7236

more tool event work

1365fcc

schema updates

5e1558a

Merge remote-tracking branch 'origin/main' into feature/subtasks

0dd796b

Refactor content resolution to the very top level

d07bf87

(move it out of transcript view - it expects to receive events with resolve content)

Resolve the whole state object for events

d30f170

remove generate_loop step type

f5d5166

Merge pull request #255 from UKGovernmentBEIS/feature/tool-call-trans…

847ac78

…cript track changes/transcript for tool calls

Fix ruff errors

34688b1

don’t force wrap in weird ways

1f37ce6

Correct tool rendering for state events

f5124e5

Baseline visual diff implementation

3fcaa9f

Move tools to single transcript tab

effe5cb

Improve tool event output rendering

d5f2f41

Don’t intend tool input / output

8f4af91

enable docs + add parallel execution

35fb5de

Merge branch 'feature/subtasks' of github.com:UKGovernmentBEIS/inspec…

401047f

…t_ai into feature/subtasks

Fix prism coloring for js, python, json

f4e69e3

Merge branch 'main' into feature/subtasks

3db949f

jjallaire-aisi requested a review from dragonstyle August 20, 2024 12:40

jjallaire-aisi marked this pull request as ready for review August 20, 2024 12:40

dragonstyle added 3 commits August 20, 2024 08:49

show no output message if there is not tool output

203b2c0

allow event titles to wrap

c3baf35

Improve wrapping at small sizes (model event)

99987de

dragonstyle approved these changes Aug 20, 2024

View reviewed changes

crossref to agents api article

b9741bd

jjallaire-aisi merged commit f7e1249 into main Aug 20, 2024
9 checks passed

jjallaire-aisi deleted the feature/subtasks branch August 20, 2024 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low Level Solver API #82

Low Level Solver API #82

jjallaire commented Jul 16, 2024 •

edited by jjallaire-aisi

Loading

dragonstyle left a comment

Low Level Solver API #82

Low Level Solver API #82

Conversation

jjallaire commented Jul 16, 2024 • edited by jjallaire-aisi Loading

Overview

Tool Use

Store

Transcripts

Custom Info

Grouping with Steps

Subtasks

Tool Subtasks

dragonstyle left a comment

Choose a reason for hiding this comment

jjallaire commented Jul 16, 2024 •

edited by jjallaire-aisi

Loading