Low Level Solver API (#82)

* initial work on subtasks * update readme * remove readme (it's now in the PR writeup) * skip tool store test if no openai * fix typo * correct message reading * correct type * Proof of concept for JSDoc Styles Baseline configuration with baseline implementation in a few places * Use yarn to manage preact / htm This allows the types to flow from package.json * Fully type util as proof of concept * Cleanup in utils * Proof of concept using types from log.d.ts * Rough in of transcript dump * Try to reign in type checking for prism here * update to latest prettier * Conditionally show transcript tab * Another solver rendering * Move transcript views * Including TZ information in timestamp * Tweaked step implementation * Add tools to model dump * Revise model event view (still WIP) * More structured transcript * A little more air * A little more tweaking * fix prettier complaint * trying updating yarn.lock * Attempt to force resolution * Remove `json-schema-to-typescript` This is causing a failure in resolving dependencies which prevents yarn install from working. * Improved state change view * Further fine tuning of appearance * Ensure store support too * More standard appearance * Fix content size * Improve appearance * Properly render objects and arrays in state changes * Improve grid appearance * Remove unused imports * Correct subtask inline display * Simplify state change event rendering * Fix prettier * Share event layout * Improve logger event view * add ScoreEvent * remove unused var * track state changes in transcript steps * remove subtask stuff from web_search for now * Improve state changes rendering * Remove logger event title for more compactness (also includes improvements to the transcript event view itself) * Add a scorer event handler * Improve subtask rendering * fix heading cursor * Improve Score Event View * merge from main * turn event into a base class * don't export event types * regen schema * fixup imports * revert event type changes * write schema to types dir * transcript events: no export and standard 'data' field * regen schema * fix transcript test * use pydantic 2.0 field_serialiser * Revert "transcript events: no export and standard 'data' field" This reverts commit 5f2b654. * use pydantic 2.0 field_serialiser * don't export events * remove unused import * log the full logging message * rename log method * drive transcript events into recorder * write through to trsc file * cleaner interface for transcript event forwarding * initial write to sqlite * Standardize font-size on rem * decorate the html tag for the logview so it can detect vscode immediately * Improve column allocation * Create Shared Fonts Object Move all things to a shared notion of fonts and styles that can be re-used easily. Use font scaling in vscode to achieve the correct appearance (now that we’re rem based we can just change the base font size). * Move summary stats into navbar * Restructure navbar into workspace * Improve progress bar appearance * Improve column sizing * Refactor tab appearance into navbar * Adjust correct/incorrect appearance * Baseline pill improvements * fix heading height in vscode * correct sidebar * Improve sidebar appearance (+prettier) * widen sidebar slightly * Sample Display Tweaks * Tweaks to config * initial work on content db * more comprehensive event walking * de-duplicate content in transcript events * Remove timestamps, correct prop name * Baseline implementation of evalevents resolving plus some prettier formatting * Correct content resolution * remove logging section of evallog (now covered in sample transcript) * Improve density when hosted in vscode at narrow short sizes * Revised appearance to grouped cards * formatting * A little more tweakage * generate_loop and tool_call steps * Fix lint issues * no srsly fix lint * resolve circular import * run prettier on event panel * Fix error w/specific logs * update test * Improve find band appearance * sample init event * Proof of concept state rendering * Relocate state code since it will grow * correct resolution of objects * lint and formatting * sample_init event * Add collapsible state to event panel, collapse certain panels * Subtask rough in * ensure we have vite * Correct merge + regen dist * add a watch command * correct formatting * correct build (investigating why my local build wasn’t flagging this) * include source maps * Add Sample Init, track state across transcript * fix lint * update dist * ensure nav-pills align bottom * correct lint * Add chat view to model * prettier * ran commands in wrong order * Improve sample init event (still mostly a dump of data) * Add all messages view * Simplify transcript view * Improvements to display * Chatview should show tool call even if no message * Improve state event display * Display choices in sampleinit event * Improve Score Event appearance * Tweak title results * More appearance tweakage * Improve tab appearance * Fix tab selection issue in subtask transcripts * Improved spacing * Fix scoring metadata layout * toolcall event * initial work on raw request/response for model event * Add placeholder tool event * initial work on raw model calls * log raw tool argument not converted * log raw model call for anthropic * format attribs notebook * raw model request for mistral * Add depth to cards (with basic impl) * remove map * ignore source map * Add baseline model raw view * Improve state appearance * Improve log display * fix formatting * properly default to messages if no transcript * add one last debug message * Disable build checking with note * Appearance refinement - only start indenting at second level step - create section component * raw capture for google * Don’t capture args when logging This is doing a lot of work which shouldn’t be happening in a log handler (and the value of the args is suspect anyhow). Causing an exception in certain environments. * Remove disused imports * record raw api calls for groq * Improve root solver display - break up root cards - add sample init step (synthetic) * raw api call recording * raw model access for cloudflare * raw model output for azureai * Improve subtask display * raw model capture for vertex * eliminate qualifying note on tool descriptions * improve setup display * Add ToolView * improved agents api docs * Tweaks * eliminate tool steps * hide agents api for now * agents api docs * Resolve the model call contents * Special case handling for sample init event title (no dupe title) * Improve logging event appearance * more tool docs * rename to agents api * remove bash prompt * Correct transcript icons * improve tab selection behavior * Improved model display * Correct font size in metadatagrid * initial work on tool transcript * more tool event work * schema updates * Refactor content resolution to the very top level (move it out of transcript view - it expects to receive events with resolve content) * Resolve the whole state object for events * remove generate_loop step type * Fix ruff errors * don’t force wrap in weird ways * Correct tool rendering for state events * Baseline visual diff implementation * Move tools to single transcript tab * Improve tool event output rendering * Don’t intend tool input / output * enable docs + add parallel execution * Fix prism coloring for js, python, json * show no output message if there is not tool output * allow event titles to wrap * Improve wrapping at small sizes (model event) * crossref to agents api article --------- Co-authored-by: aisi-inspect <[email protected]> Co-authored-by: Charles Teague <[email protected]> Co-authored-by: jjallaire-aisi <[email protected]>
UKGovernmentBEIS · Aug 20, 2024 · f7e1249 · f7e1249
1 parent 8a0ede5
commit f7e1249
Show file tree

Hide file tree

Showing 133 changed files with 19,099 additions and 10,041 deletions.
diff --git a/.github/workflows/log_viewer.yml b/.github/workflows/log_viewer.yml
@@ -43,28 +43,31 @@ jobs:
       - name: Run eslint
         run: yarn eslint
 
-  build:
-    runs-on: ubuntu-latest
-    defaults:
-      run:
-        working-directory: src/inspect_ai/_view/www
-    steps:
-      - uses: actions/checkout@v4
-      - name: Set up Node.js
-        uses: actions/setup-node@v4
-        with:
-          node-version: "22.x"
-      - name: Install dependencies
-        run: yarn install
+  # TODO: This is failing even with a freshly generated build.js file
+  # Need to debug or better understand the cause
+  # build:
+  #   runs-on: ubuntu-latest
+  #   defaults:
+  #     run:
+  #       working-directory: src/inspect_ai/_view/www
+  #   steps:
+  #     - uses: actions/checkout@v4
+  #     - name: Set up Node.js
+  #       uses: actions/setup-node@v4
+  #       with:
+  #         node-version: "22.x"
+  #     - name: Install dependencies
+  #       run: yarn install
 
-      - name: Build log viewer
-        run: yarn build
+  #     - name: Build log viewer
+  #       run: yarn build
 
-      - name: Ensure dist changes are checked in
-        run: |
-          if [[ $(git status --porcelain) != "" ]]
-          then
-            echo "Log viewer dist files have not been updated, please run yarn build and check in the changes."
-            git status
-            exit 1
-          fi
+  #     - name: Ensure dist changes are checked in
+  #       run: |
+  #         if [[ $(git status --porcelain) != "" ]]
+  #         then
+  #           echo "Log viewer dist files have not been updated, please run yarn build and check in the changes."
+  #           git status
+  #           git diff dist/assets/index.js
+  #           exit 1
+  #         fi
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -1,6 +1,6 @@
 ## Benchmarks
 
-This directory contains evals for several benchmarks. Datasets for evals are not embedded in the repository but are rather ether downloaded either directly from their source URL or via Hugging Face datasets. To use Hugging Face datasets please install the datasets package with `pip install datasets`.
+This directory contains evals for several benchmarks. Datasets for evals are not embedded in the repository but are rather either downloaded either directly from their source URL or via Hugging Face datasets. To use Hugging Face datasets please install the datasets package with `pip install datasets`.
 
 | Benchmark                                                              | Reference                            |                             Code | Dataset      |
 |-------------------------|----------------|---------------:|----------------|

diff --git a/docs/_quarto.yml b/docs/_quarto.yml
@@ -71,10 +71,12 @@ book:
         chapters:
            - caching.qmd
            - parallelism.qmd
+           - agents-api.qmd
            - eval-logs.qmd
            - eval-suites.qmd
            - extensions.qmd
 
+
 toc-depth: 2
 number-sections: true
 number-depth: 2

diff --git a/docs/agents-api.qmd b/docs/agents-api.qmd
@@ -0,0 +1,307 @@
+---
+title: "Agents API"
+format: html
+---
+
+::: callout-note
+The Agents API described in this article is currently available only in the development version of Inspect. You can install the development version with:
+
+``` {.bash .code-overflow-wrap}
+pip install git+https://github.com/ukgovernmentbeis/inspect_ai
+```
+:::
+
+## Overview
+
+This article describes advanced Inspect APIs available for creating evaluations with agents. You can also build agents evals using Inspect's default ReAct tool use loop or by bridging to an external agent library (see the main [Agents](agents.qmd) article for further details). Topics covered in this article include:
+
+1.  Sharing state across solvers and tools
+2.  Creating a custom tool use loop
+3.  Dynamically customising tool descriptions
+4.  Observability with sample transcripts.
+5.  Delegating work to sub-tasks
+6.  Sandboxing arbitrary code execution
+
+We'll assume that you already understand Inspect [Solvers](solvers.qmd) and [Tools](tools.qmd) (please review those articles as required before proceeding).
+
+## Use of `metadata`
+
+Before proceeding, it's important to point that some of the features described below were previously approximated by using the `metadata` field of `TaskState`, specifically `metadata` was often used as a catch-all storage location for:
+
+-   Carrying state between solvers and sometimes tools.
+-   Providing a place to log additional structured data.
+-   Recording calls to "helper" models used for elicitation or scoring.
+
+The `metadata` field no longer need be used for these scenarios (and in fact should now be treated as a read-only part of the `TaskState`). Below we'll describe how the `Store` can be used for state, how structured data can be logged to the sample `Transcript`, and how all model calls are now automatically recorded and included in the transcript.
+
+## Sharing State
+
+Sequences of solvers often need to store and manipulate shared state. Further, tools may often want their own persistent state (or groups of tools may want to share state). This can be accomplished in Insepct using the `Store`, which provides a scoped scratchpad for arbitrary values.
+
+The core of the `Store` interface is:
+
+``` python
+from inspect_ai.solver import Store
+
+class Store:
+    def get(self, key: str, default: VT) -> VT
+    def set(self, key: str, value: Any) -> None
+    def delete(self, key: str) -> None
+```
+
+Basic views on the store's collection (e.g. `items()`, `keys()`, `values()`) are also provided. Note that the `get()` method will automatically add the `default` to the store if it doesn't exist.
+
+The `Store` can be accessed via `TaskState` as follows:
+
+``` python
+history = state.store.get("history", [])
+```
+
+It is also possible the access the `Store` *for the current sample* using the `store()` function. This is the mechanism for tools to read and write the `Store`. For example:
+
+``` python
+from inspect_ai.solver import store
+from inspect_ai.tool import tool
+
+@tool
+def web_browser_back():
+   def execute() -> str:
+       history = store().get("web_browser:history", [])
+       return history.pop()
+```
+
+While there is no formal namespacing mechanism for the `Store`, this can be informally achieved using key prefixes as demonstrated above.
+
+You should generally try to use JSON serialisable Python types in the `Store` (e.g. objects should be dataclasses or Pydantic BaseModel) so that they can be recorded in the [Transcript](#sec-transcripts).
+
+While the default `Store` for a sample is shared globally between solvers and tools, a more narrowly scoped `Store` is created automatically for [Subtasks](#sec-subtasks).
+
+## Tool Use
+
+### Custom Loop
+
+The higher level `generate()` function passed to solvers includes a built-in tool use loop—when the model calls a tool, Inspect calls the underlying Python function and reports the result to the model, proceeding until the model stops calling tools. However, for more advanced agents you may want to intervene in the tool use loop in a variety of ways:
+
+1.  Urge the model to continue (or take a different path) if it gives up.
+2.  Exercise more fine grained control over which, when, and how many tool calls are made.
+3.  Redirect the model to another trajectory if its not on a productive course.
+4.  Have multiple `generate()` passes each with a distinct set of tools.
+
+To do this, create a solver that emulates the default tool use loop and provides additional customisation as required. Here is the code at the core of Inspect tool use in `generate()`:
+
+``` python
+model = get_model()
+state.output = await model.generate(
+    state.messages, state.tools
+)
+state.messages.append(output.message)
+state.messages.extend(
+    call_tools(state.output.message, state.tools)
+)
+```
+
+This does everything that default `generate()` does, save for an outer loop to continue calling the mode as long as it continues calling tools. You could implement the outer loop as follows:
+
+``` python
+model = get_model()
+while True:
+    state.output = await model.generate(
+        state.messages, state.tools
+    )
+    state.messages.append(state.output.message)
+    if state.output.message.tool_calls:
+        state.messages.extend(
+            call_tools(state.output.message, state.tools)
+        )
+    else:
+        break
+```
+
+Note that you don't necessarily even need to structure the agent using a loop. For example, you might have an inner function implementing the loop, while an outer function dynamically swaps out what tools are available. For example, imagine the above was implemented in a function named `tool_use_loop()`, you might have outer function like this:
+
+``` python
+# first pass w/ core tools
+state.tools = [decompile(), dissasemble(), bash()]
+state = await tool_use_loop(state)
+
+# second pass w/ prompt and python tool only
+state.tools = [python()]
+state = await tool_use_loop(state)
+```
+
+Taken together these APIs enable you to build a custom version of `generate()` with whatever structure and logic you need.
+
+### Tool Descriptions
+
+In some cases you may want to change the default descriptions created by a tool author—for example you might want to provide better disambiguation between multiple similar tools that are used together. You also might have need to do this during development of tools (to explore what descriptions are most useful to models).
+
+The `tool_with()` function enables you to take any tool and adapt its name and/or descriptions. For example:
+
+``` python
+from inspect_ai.tool import tool_with
+
+my_add = tool_with(
+  tool=add(), 
+  name="my_add",
+  description="a tool to add numbers", 
+  parameters={
+    "x": "the x argument",
+    "y": "the y argument"
+  })
+```
+
+You need not provide all of the parameters shown above, for example here are some examples where we modify just the main tool description or only a single parameter:
+
+``` python
+my_add = tool_with(add(), description="a tool to add numbers")
+my_add = tool_with(add(), parameters={"x": "the x argument"})
+```
+
+Note that the `tool_with()` function returns a copy of the passed tool with modified descriptions (the passed tool retains its original descriptions)..
+
+## Transcripts
+
+Transcripts provide a rich per-sample sequential view of everything that occurs during plan execution and scoring, including:
+
+-   Model interactions (including the raw API call made to the provider).
+-   Tool calls (including a sub-transcript of activitywithin the tool)
+-   Changes (in [JSON Patch](https://jsonpatch.com/) format) to the `TaskState` for the `Sample`.
+-   Scoring (including a sub-transcript of interactions within the scorer).
+-   Custom `info()` messages inserted explicitly into the transcript.
+-   Python logger calls (`info` level or designated custom `log-level`).
+
+This information is provided within the Inspect log viewer in the **Transcript** tab (which sits alongside the Messages, Scoring, and Metadata tabs in the per-sample display).
+
+### Custom Info
+
+You can insert custom entries into the transcript via the Transcipt `info()` method (which creates an `InfoEvent`). Access the transcript for the current sample using the `transcript()` function, for example:
+
+``` python
+from inspect_ai.solver import transcript
+
+transcript().info("here is some custom info")
+```
+
+You can pass arbitrary JSON serialisable objects to `info()`.
+
+### Grouping with Steps
+
+You can create arbitrary groupings of transcript activity using the Transcript `step()` context manager. For example:
+
+``` python
+with transcript().step("reasoning"):
+    ...
+    state.store.set("next-action", next_action)
+```
+
+There are two reasons that you might want to create steps:
+
+1.  Any changes to the store which occur during a step will be collected into a `StoreEvent` that records the changes (in [JSON Patch](https://jsonpatch.com/) format) that occurred.
+2.  The Inspect log viewer will create a visual delineation for the step, which will make it easier to see the flow of activity within the transcript.
+
+## Subtasks {#sec-subtasks}
+
+Subtasks provide a mechanism for creating isolated, re-usable units of execution. You might implement a complex tool using a subtask or might use them in a multi-agent evaluation. The main characteristics of sub-tasks are:
+
+1.  They run in their own async coroutine.
+2.  They have their own isolated `Store` (no access to the sample `Store`).
+3.  They have their own isolated `Transcript`
+
+To create a subtask, declare an async function with the `@subtask` decorator. The function can take any arguments and return a value of any type. For example:
+
+``` python
+from inspect_ai.solver import Store, subtask
+
+@subtask
+async def web_search(keywords: str) -> str:
+    # get links for these keywords
+    links = await search_links(keywords)
+
+    # add links to the store so they end up in the transcript
+    store().set("links", links)
+
+    # summarise the links
+    return await fetch_and_summarise(links)
+```
+
+Note that we add `links` to the `store` not because we strictly need to for our implementation, but because we want the links to be recorded as part of the transcript.
+
+Call the subtask as you would any async function:
+
+``` python
+summary = await web_search(keywords="solar power")
+```
+
+A few things will occur automatically when you run a subtask:
+
+-   New isolated `Store` and `Transcript` objects will be created for the subtask (accessible via the `store()` and `transcript()` functions). Changes to the `Store` that occur during execution will be recorded in a `StoreEvent`.
+
+-   A `SubtaskEvent` will be added to the current transcript. The event will include the name of the subtask, its input and results, and a transcript of all events that occur within the subtask.
+
+You can also include one or more steps within a subtask.
+
+### Parallel Execution
+
+You can execute subtasks in parallel using `asyncio.gather()`. For example, to run 3 `web_search()` subtasks in parallel:
+
+
+``` python
+import asyncio
+
+searches = [
+  web_search(keywords="solar power"),
+  web_search(keywords="wind power"),
+  web_search(keywords="hydro power"),
+]
+
+results = await asyncio.gather(*searches)
+```
+
+Note that we don't `await` the subtasks when building up our list of `searches`. Rather, we let `asyncio.gather()` await all of them, returning only when all of the results are available.
+
+## Sandboxing
+
+Many agents provide models with the ability to execute arbitrary code. It's important that this code be sandboxed so that it executes in an isolated context. Inspect supports this through the `SandboxEnvironment` (which in turn may be implemented using Docker or various other schemes). Enable sandboxing for a task with the `sandbox` parameter. For example:
+
+``` python
+@task
+def file_probe()
+    return Task(
+        dataset=dataset,
+        plan=[
+            use_tools([list_files()]), 
+            generate()
+        ],
+        sandbox="docker",
+        scorer=includes(),
+    )
+)
+```
+
+Use the `SandboxEnvironment` within a tool via the `sandbox()` function. For example, here's an implementation of the `list_files()` tool referenced above:
+
+``` python
+from inspect_ai.tool import ToolError, tool
+from inspect_ai.util import sandbox
+
+@tool
+def list_files():
+    async def execute(dir: str):
+        """List the files in a directory.
+
+        Args:
+            dir (str): Directory
+
+        Returns:
+            File listing of the directory
+        """
+        result = await sandbox().exec(["ls", dir])
+        if result.success:
+            return result.stdout
+        else:
+            raise ToolError(result.stderr)
+
+    return execute
+```
+
+See the section on [Sandbox Environments](agents.qmd##sec-sandbox-environments) for further details on using sandboxes with Inspect.