From a2119ada97eb1b6353663be49fc494c6b8752cb2 Mon Sep 17 00:00:00 2001 From: aisi-inspect <166920645+aisi-inspect@users.noreply.github.com> Date: Fri, 27 Sep 2024 16:53:00 +0000 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- agents-api.html | 241 ++++++++++------ agents.html | 671 ++++++++++++++++++++++++-------------------- eval-logs.html | 2 +- examples/index.html | 10 +- index.html | 2 +- log-viewer.html | 2 +- search.json | 4 +- sitemap.xml | 2 +- tutorial.html | 26 +- vscode.html | 2 +- workflow.html | 2 +- 12 files changed, 548 insertions(+), 418 deletions(-) diff --git a/.nojekyll b/.nojekyll index c4e29f190..78c5d04ad 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -ad327f9f \ No newline at end of file +94928440 \ No newline at end of file diff --git a/agents-api.html b/agents-api.html index 2cb4ce9e7..ad1b15ccf 100644 --- a/agents-api.html +++ b/agents-api.html @@ -306,6 +306,8 @@
TaskState
and exploring several trajectories.Note that by default expected errors (e.g. file not found, insufficient, permission , timeouts, etc.) are forwarded to the model for possible recovery. If you would like to intervene in the default error handling then rather than immediately appending the list of assistant messages returned from call_tools()
to state.messages
(as shown above), check the error property of these messages (which will be None
in the case of no error) and proceed accordingly.
One thing that a custom scaffold may do is try to recover from various conditions that cause the model to stop generating. You can find the reason that generation stopped in the stop_reason
field of ModelOutput
. For example:
= await model.generate(state.messages, state.tools)
+ output if output.stop_reason == "model_length":
+# do something to recover from context window overflow
Here are the possible values for StopReason
:
Stop Reason | +Description | +
---|---|
stop |
+The model hit a natural stop point or a provided stop sequence | +
max_tokens |
+The maximum number of tokens specified in the request was reached. | +
model_length |
+The model’s context length was exceeded. | +
tool_calls |
+The model called a tool | +
content_filter |
+Content was omitted due to a content filter. | +
unknown |
+Unknown (e.g. unexpected runtime error) | +
Note that the model_length
and max_tokens
stop reasons are currently only available in the development version of Inspect. You can install the development version with:
pip install git+https://github.com/UKGovernmentBEIS/inspect_ai
By default expected errors (e.g. file not found, insufficient, permission , timeouts, etc.) are forwarded to the model for possible recovery. If you would like to intervene in the default error handling then rather than immediately appending the list of assistant messages returned from call_tools()
to state.messages
(as shown above), check the error property of these messages (which will be None
in the case of no error) and proceed accordingly.
Note that you don’t necessarily even need to structure the agent using a loop. For example, you might have an inner function implementing the loop, while an outer function dynamically swaps out what tools are available. For example, imagine the above was implemented in a function named tool_use_loop()
, you might have outer function like this:
# first pass w/ core tools
-= [decompile(), dissasemble(), bash()]
- state.tools = await tool_use_loop(state)
- state
-# second pass w/ prompt and python tool only
-= [python()]
- state.tools = await tool_use_loop(state) state
# first pass w/ core tools
+= [decompile(), dissasemble(), bash()]
+ state.tools = await tool_use_loop(state)
+ state
+# second pass w/ prompt and python tool only
+= [python()]
+ state.tools = await tool_use_loop(state) state
Taken together these APIs enable you to build a custom version of generate()
with whatever structure and logic you need.
In some cases you may want to change the default descriptions created by a tool author—for example you might want to provide better disambiguation between multiple similar tools that are used together. You also might have need to do this during development of tools (to explore what descriptions are most useful to models).
The tool_with()
function enables you to take any tool and adapt its name and/or descriptions. For example:
from inspect_ai.tool import tool_with
-
-= tool_with(
- my_add =add(),
- tool="my_add",
- name="a tool to add numbers",
- description={
- parameters"x": "the x argument",
- "y": "the y argument"
- })
from inspect_ai.tool import tool_with
+
+= tool_with(
+ my_add =add(),
+ tool="my_add",
+ name="a tool to add numbers",
+ description={
+ parameters"x": "the x argument",
+ "y": "the y argument"
+ })
You need not provide all of the parameters shown above, for example here are some examples where we modify just the main tool description or only a single parameter:
-= tool_with(add(), description="a tool to add numbers")
- my_add = tool_with(add(), parameters={"x": "the x argument"}) my_add
= tool_with(add(), description="a tool to add numbers")
+ my_add = tool_with(add(), parameters={"x": "the x argument"}) my_add
Note that the tool_with()
function returns a copy of the passed tool with modified descriptions (the passed tool retains its original descriptions)..
You can insert custom entries into the transcript via the Transcipt info()
method (which creates an InfoEvent
). Access the transcript for the current sample using the transcript()
function, for example:
from inspect_ai.log import transcript
-
-"here is some custom info") transcript().info(
from inspect_ai.log import transcript
+
+"here is some custom info") transcript().info(
Strings passed to info()
will be rendered as markdown. In addition to strings you can also pass arbitrary JSON serialisable objects to info()
.
You can create arbitrary groupings of transcript activity using the Transcript step()
context manager. For example:
with transcript().step("reasoning"):
-
- ...set("next-action", next_action) state.store.
with transcript().step("reasoning"):
+
+ ...set("next-action", next_action) state.store.
There are two reasons that you might want to create steps:
StoreEvent
that records the changes (in JSON Patch format) that occurred.Transcript
To create a subtask, declare an async function with the @subtask
decorator. The function can take any arguments and return a value of any type. For example:
from inspect_ai.util import Store, subtask
-
-@subtask
-async def web_search(keywords: str) -> str:
-# get links for these keywords
- = await search_links(keywords)
- links
-# add links to the store so they end up in the transcript
- set("links", links)
- store().
-# summarise the links
- return await fetch_and_summarise(links)
from inspect_ai.util import Store, subtask
+
+@subtask
+async def web_search(keywords: str) -> str:
+# get links for these keywords
+ = await search_links(keywords)
+ links
+# add links to the store so they end up in the transcript
+ set("links", links)
+ store().
+# summarise the links
+ return await fetch_and_summarise(links)
Note that we add links
to the store
not because we strictly need to for our implementation, but because we want the links to be recorded as part of the transcript.
Call the subtask as you would any async function:
-= await web_search(keywords="solar power") summary
= await web_search(keywords="solar power") summary
A few things will occur automatically when you run a subtask:
New isolated Store
and Transcript
objects will be created for the subtask (accessible via the store()
and transcript()
functions). Changes to the Store
that occur during execution will be recorded in a StoreEvent
.
You can execute subtasks in parallel using asyncio.gather()
. For example, to run 3 web_search()
subtasks in parallel:
import asyncio
-
-= [
- searches ="solar power"),
- web_search(keywords="wind power"),
- web_search(keywords="hydro power"),
- web_search(keywords
- ]
-= await asyncio.gather(*searches) results
import asyncio
+
+= [
+ searches ="solar power"),
+ web_search(keywords="wind power"),
+ web_search(keywords="hydro power"),
+ web_search(keywords
+ ]
+= await asyncio.gather(*searches) results
Note that we don’t await
the subtasks when building up our list of searches
. Rather, we let asyncio.gather()
await all of them, returning only when all of the results are available.
Inspect’s fork()
function provids a convenient wrapper around a very common use of subtasks: running a TaskState
against a set of solvers in parallel to explore different trajectories.
For example, let’s say you have a solver named explore()
that takes temperature
as a parameter. You might want to try the solver out with multiple temperature values and then continue on with the best result:
from inspect_ai.solver import fork
-
-= await fork(state, [
- results = 0.5),
- explore(temperature = 0.75),
- explore(temperature = 1.0)
- explore(temperature ])
from inspect_ai.solver import fork
+
+= await fork(state, [
+ results = 0.5),
+ explore(temperature = 0.75),
+ explore(temperature = 1.0)
+ explore(temperature ])
The state
will be deep copied so that each explore()
solver instance gets it own copy of the state
to work on. The results
contain a list of TaskState
with the value returned from each of the solvers.
Many agents provide models with the ability to execute arbitrary code. It’s important that this code be sandboxed so that it executes in an isolated context. Inspect supports this through the SandboxEnvironment
(which in turn may be implemented using Docker or various other schemes). Enable sandboxing for a task with the sandbox
parameter. For example:
@task
-def file_probe()
-return Task(
- =dataset,
- dataset=[
- solver
- use_tools([list_files()]),
- generate()
- ],="docker",
- sandbox=includes(),
- scorer
- ) )
@task
+def file_probe()
+return Task(
+ =dataset,
+ dataset=[
+ solver
+ use_tools([list_files()]),
+ generate()
+ ],="docker",
+ sandbox=includes(),
+ scorer
+ ) )
Use the SandboxEnvironment
within a tool via the sandbox()
function. For example, here’s an implementation of the list_files()
tool referenced above:
from inspect_ai.tool import ToolError, tool
-from inspect_ai.util import sandbox
-
-@tool
-def list_files():
-async def execute(dir: str):
- """List the files in a directory.
-
- Args:
- dir (str): Directory
-
- Returns:
- File listing of the directory
- """
-= await sandbox().exec(["ls", dir])
- result if result.success:
- return result.stdout
- else:
- raise ToolError(result.stderr)
-
-return execute
from inspect_ai.tool import ToolError, tool
+from inspect_ai.util import sandbox
+
+@tool
+def list_files():
+async def execute(dir: str):
+ """List the files in a directory.
+
+ Args:
+ dir (str): Directory
+
+ Returns:
+ File listing of the directory
+ """
+= await sandbox().exec(["ls", dir])
+ result if result.success:
+ return result.stdout
+ else:
+ raise ToolError(result.stderr)
+
+return execute
See the section on Sandbox Environments for further details on using sandboxes with Inspect.
diff --git a/agents.html b/agents.html index 20b364216..8bd958c2f 100644 --- a/agents.html +++ b/agents.html @@ -308,6 +308,8 @@TaskState
and exploring several trajectories.Note that by default expected errors (e.g. file not found, insufficient, permission , timeouts, etc.) are forwarded to the model for possible recovery. If you would like to intervene in the default error handling then rather than immediately appending the list of assistant messages returned from call_tools()
to state.messages
(as shown above), check the error property of these messages (which will be None
in the case of no error) and proceed accordingly.
One thing that a custom scaffold may do is try to recover from various conditions that cause the model to stop generating. You can find the reason that generation stopped in the stop_reason
field of ModelOutput
. For example:
= await model.generate(state.messages, state.tools)
+ output if output.stop_reason == "model_length":
+# do something to recover from context window overflow
Here are the possible values for StopReason
:
Stop Reason | +Description | +
---|---|
stop |
+The model hit a natural stop point or a provided stop sequence | +
max_tokens |
+The maximum number of tokens specified in the request was reached. | +
model_length |
+The model’s context length was exceeded. | +
tool_calls |
+The model called a tool | +
content_filter |
+Content was omitted due to a content filter. | +
unknown |
+Unknown (e.g. unexpected runtime error) | +
Note that the model_length
and max_tokens
stop reasons are currently only available in the development version of Inspect. You can install the development version with:
pip install git+https://github.com/UKGovernmentBEIS/inspect_ai
By default expected errors (e.g. file not found, insufficient, permission , timeouts, etc.) are forwarded to the model for possible recovery. If you would like to intervene in the default error handling then rather than immediately appending the list of assistant messages returned from call_tools()
to state.messages
(as shown above), check the error property of these messages (which will be None
in the case of no error) and proceed accordingly.
While its possible to make tools globally available to the model via use_tools()
, you may also want to filter the available tools either based on task stages or dynamically based on some other criteria.
Here’s an example of a solver agent that filters the available tools between calls to generate()
:
@solver
-def ctf_agent():
-async def solve(state: TaskState, generate: Generate):
-
- # first pass w/ core tools
- = [decompile(), dissasemble(), bash()]
- state.tools = await generate(state)
- state
-# second pass w/ prompt and python tool only
- = [python()]
- state.tools
- state.messages.append(ChatMessageUser( = "Use Python to extract the flag."
- content
- )) = await generate(state)
- state
-# clear tools and return
- = []
- state.tools return state
-
- return solve
@solver
+def ctf_agent():
+async def solve(state: TaskState, generate: Generate):
+
+ # first pass w/ core tools
+ = [decompile(), dissasemble(), bash()]
+ state.tools = await generate(state)
+ state
+# second pass w/ prompt and python tool only
+ = [python()]
+ state.tools
+ state.messages.append(ChatMessageUser( = "Use Python to extract the flag."
+ content
+ )) = await generate(state)
+ state
+# clear tools and return
+ = []
+ state.tools return state
+
+ return solve
Bridging from the Inspect solver interface to the standard input and output types of the agent library. In this example this is provided by the langchain_solver()
function, which takes a LangChain agent function and converts it to an Inspect solver.
Here’s the implementation of langchain_solver()
(imports excluded for brevity):
# Interface for LangChain agent function
-class LangChainAgent(Protocol):
-async def __call__(self, llm: BaseChatModel, input: dict[str, Any]): ...
-
-# Convert a LangChain agent function into a Solver
-def langchain_solver(agent: LangChainAgent) -> Solver:
-
-async def solve(state: TaskState, generate: Generate) -> TaskState:
-
-# create the inspect model api bridge
- = InspectChatModel()
- llm
-# call the agent
- await agent(
- = llm,
- llm input = dict(
- input=state.user_prompt.text,
- =as_langchain_chat_history(
- chat_history1:]
- state.messages[
- ),
- )
- )
-# collect output from llm interface
- = llm.messages
- state.messages = llm.output
- state.output = output
- state.output.completion
- # return state
- return state
-
-return solve
-
-# LangChain BaseChatModel for Inspect Model API
-class InspectChatModel(BaseChatModel):
-async def _agenerate(
- self,
- list[BaseMessage],
- messages: list[str] | None = None,
- stop: | None = None,
- run_manager: AsyncCallbackManagerForLLMRun **kwargs: dict[str, Any],
- -> ChatResult:
- ) ...
# Interface for LangChain agent function
+class LangChainAgent(Protocol):
+async def __call__(self, llm: BaseChatModel, input: dict[str, Any]): ...
+
+# Convert a LangChain agent function into a Solver
+def langchain_solver(agent: LangChainAgent) -> Solver:
+
+async def solve(state: TaskState, generate: Generate) -> TaskState:
+
+# create the inspect model api bridge
+ = InspectChatModel()
+ llm
+# call the agent
+ await agent(
+ = llm,
+ llm input = dict(
+ input=state.user_prompt.text,
+ =as_langchain_chat_history(
+ chat_history1:]
+ state.messages[
+ ),
+ )
+ )
+# collect output from llm interface
+ = llm.messages
+ state.messages = llm.output
+ state.output = output
+ state.output.completion
+ # return state
+ return state
+
+return solve
+
+# LangChain BaseChatModel for Inspect Model API
+class InspectChatModel(BaseChatModel):
+async def _agenerate(
+ self,
+ list[BaseMessage],
+ messages: list[str] | None = None,
+ stop: | None = None,
+ run_manager: AsyncCallbackManagerForLLMRun **kwargs: dict[str, Any],
+ -> ChatResult:
+ ) ...
Now here’s the wikipedia_search()
solver (imports again excluded for brevity):
@solver
-def wikipedia_search(
-int | None = 15,
- max_iterations: float | None = None
- max_execution_time: -> Solver:
- ) # standard prompt for tools agent
- = hub.pull("hwchase17/openai-tools-agent")
- prompt
-# tavily and wikipedia tools
- = TavilySearchAPIWrapper() # type: ignore
- tavily_api = (
- tools =tavily_api)] +
- [TavilySearchResults(api_wrapper"wikipedia"])
- load_tools([
- )
-# agent function
- async def agent(
-
- llm: BaseChatModel, input: dict[str, Any]
- -> str | list[str | dict[str,Any]]:
- ) # create agent
- = create_openai_tools_agent(
- tools_agent
- llm, tools, prompt
- )= AgentExecutor.from_agent_and_tools(
- executor =cast(BaseMultiActionAgent, tools_agent),
- agent=tools,
- tools="wikipedia_search",
- name=max_iterations,
- max_iterations=max_execution_time
- max_execution_time
- )
-# execute the agent and return output
- = await executor.ainvoke(input)
- result return result["output"]
-
-# return agent function as inspect solver
- return langchain_solver(agent)
@solver
+def wikipedia_search(
+int | None = 15,
+ max_iterations: float | None = None
+ max_execution_time: -> Solver:
+ ) # standard prompt for tools agent
+ = hub.pull("hwchase17/openai-tools-agent")
+ prompt
+# tavily and wikipedia tools
+ = TavilySearchAPIWrapper() # type: ignore
+ tavily_api = (
+ tools =tavily_api)] +
+ [TavilySearchResults(api_wrapper"wikipedia"])
+ load_tools([
+ )
+# agent function
+ async def agent(
+
+ llm: BaseChatModel, input: dict[str, Any]
+ -> str | list[str | dict[str,Any]]:
+ ) # create agent
+ = create_openai_tools_agent(
+ tools_agent
+ llm, tools, prompt
+ )= AgentExecutor.from_agent_and_tools(
+ executor =cast(BaseMultiActionAgent, tools_agent),
+ agent=tools,
+ tools="wikipedia_search",
+ name=max_iterations,
+ max_iterations=max_execution_time
+ max_execution_time
+ )
+# execute the agent and return output
+ = await executor.ainvoke(input)
+ result return result["output"]
+
+# return agent function as inspect solver
+ return langchain_solver(agent)
ToolInfo
when generate is called.
+Note that we register native LangChain tools. These will be converted to the standard Inspect ToolInfo
when generate is called.
langchain_solver()
.
+This is the standard interface to LangChain agents. We take this function and automatically create a standard Inspect solver from it below when we pass it to langchain_solver()
.
input
. We call the async executor API to play well with Inspect’s concurrency.
+Invoke the agent using the chat history passed in input
. We call the async executor API to play well with Inspect’s concurrency.
langchain_solver()
function maps the simpler agent function semantics into the standard Inspect solver API.
+The langchain_solver()
function maps the simpler agent function semantics into the standard Inspect solver API.
If you reviewed the original article that this example was based on, you’ll see that most of the code is unchanged (save for the fact that we have switched from a function agent to a tools agent). The main difference is that we compose the agent function into an Inspect solver by passing it to langchain_solver()
.
Finally, here’s a task that uses the wikipedia_search()
solver:
@task
-def wikipedia() -> Task:
-return Task(
- =json_dataset("wikipedia.jsonl"),
- dataset=wikipedia_search(),
- solver=model_graded_fact(),
- scorer )
@task
+def wikipedia() -> Task:
+return Task(
+ =json_dataset("wikipedia.jsonl"),
+ dataset=wikipedia_search(),
+ solver=model_graded_fact(),
+ scorer )
The full source code for this example can be found in the Inspect GitHub repo at examples/langchain.
@@ -716,108 +781,108 @@Let’s take a look at a simple example to illustrate. First, we’ll define a list_files()
tool. This tool need to access the ls
command—it does so by calling the sandbox()
function to get access to the SandboxEnvironment
instance for the currently executing Sample
:
from inspect_ai.tool import ToolError, tool
-from inspect_ai.util import sandbox
-
-@tool
-def list_files():
-async def execute(dir: str):
- """List the files in a directory.
-
- Args:
- dir (str): Directory
-
- Returns:
- File listing of the directory
- """
-= await sandbox().exec(["ls", dir])
- result if result.success:
- return result.stdout
- else:
- raise ToolError(result.stderr)
-
-return execute
from inspect_ai.tool import ToolError, tool
+from inspect_ai.util import sandbox
+
+@tool
+def list_files():
+async def execute(dir: str):
+ """List the files in a directory.
+
+ Args:
+ dir (str): Directory
+
+ Returns:
+ File listing of the directory
+ """
+= await sandbox().exec(["ls", dir])
+ result if result.success:
+ return result.stdout
+ else:
+ raise ToolError(result.stderr)
+
+return execute
The exec()
function is used to list the directory contents. Note that its not immediately clear where or how exec()
is implemented (that will be described shortly!).
Here’s an evaluation that makes use of this tool:
-from inspect_ai import task, Task
-from inspect_ai.dataset import Sample
-from inspect_ai.scorer import includes
-from inspect_ai.solver import generate, use_tools
-
-= [
- dataset
- Sample(input='Is there a file named "bar.txt" '
- + 'in the current directory?',
- ="Yes",
- target={"bar.txt": "hello"},
- files
- )
- ]
-@task
-def file_probe()
-return Task(
- =dataset,
- dataset=[
- solver
- use_tools([list_files()]),
- generate()
- ],="docker",
- sandbox=includes(),
- scorer
- ) )
from inspect_ai import task, Task
+from inspect_ai.dataset import Sample
+from inspect_ai.scorer import includes
+from inspect_ai.solver import generate, use_tools
+
+= [
+ dataset
+ Sample(input='Is there a file named "bar.txt" '
+ + 'in the current directory?',
+ ="Yes",
+ target={"bar.txt": "hello"},
+ files
+ )
+ ]
+@task
+def file_probe()
+return Task(
+ =dataset,
+ dataset=[
+ solver
+ use_tools([list_files()]),
+ generate()
+ ],="docker",
+ sandbox=includes(),
+ scorer
+ ) )
We’ve included sandbox="docker"
to indicate that sandbox environment operations should be executed in a Docker container. Specifying a sandbox environment (either at the task or evaluation level) is required if your tools call the sandbox()
function.
Note that files
are specified as part of the Sample
. Files can be specified inline using plain text (as depicted above), inline using a base64-encoded data URI, or as a path to a file or remote resource (e.g. S3 bucket). Relative file paths are resolved according to the location of the underlying dataset file.
The following instance methods are available to tools that need to interact with a SandboxEnvironment
:
class SandboxEnvironment:
-
- async def exec(
- self,
- list[str],
- cmd: input: str | bytes | None = None,
- str | None = None,
- cwd: dict[str, str] = {},
- env: str | None = None,
- user: int | None = None,
- timeout: -> ExecResult[str]:
- ) """
- Raises:
- TimeoutError: If the specified `timeout` expires.
- UnicodeDecodeError: If an error occurs while
- decoding the command output.
- PermissionError: If the user does not have
- permission to execute the command.
- """
-
- ...
-async def write_file(
- self, file: str, contents: str | bytes
- -> None:
- ) """
- Raises:
- PermissionError: If the user does not have
- permission to write to the specified path.
- IsADirectoryError: If the file exists already and
- is a directory.
- """
-
- ...
-async def read_file(
- self, file: str, text: bool = True
- -> Union[str | bytes]:
- ) """
- Raises:
- FileNotFoundError: If the file does not exist.
- UnicodeDecodeError: If an encoding error occurs
- while reading the file.
- (only applicable when `text = True`)
- PermissionError: If the user does not have
- permission to read from the specified path.
- IsADirectoryError: If the file is a directory.
- """
- ...
class SandboxEnvironment:
+
+ async def exec(
+ self,
+ list[str],
+ cmd: input: str | bytes | None = None,
+ str | None = None,
+ cwd: dict[str, str] = {},
+ env: str | None = None,
+ user: int | None = None,
+ timeout: -> ExecResult[str]:
+ ) """
+ Raises:
+ TimeoutError: If the specified `timeout` expires.
+ UnicodeDecodeError: If an error occurs while
+ decoding the command output.
+ PermissionError: If the user does not have
+ permission to execute the command.
+ """
+
+ ...
+async def write_file(
+ self, file: str, contents: str | bytes
+ -> None:
+ ) """
+ Raises:
+ PermissionError: If the user does not have
+ permission to write to the specified path.
+ IsADirectoryError: If the file exists already and
+ is a directory.
+ """
+
+ ...
+async def read_file(
+ self, file: str, text: bool = True
+ -> Union[str | bytes]:
+ ) """
+ Raises:
+ FileNotFoundError: If the file does not exist.
+ UnicodeDecodeError: If an encoding error occurs
+ while reading the file.
+ (only applicable when `text = True`)
+ PermissionError: If the user does not have
+ permission to read from the specified path.
+ IsADirectoryError: If the file is a directory.
+ """
+ ...
Note that write_file()
automatically creates parent directories as required if they don’t exist.
For each method there is a documented set of errors that are raised: these are expected errors and can either be caught by tools or allowed to propagate in which case they will be reported to the model for potential recovery. In addition, unexpected errors may occur (e.g. a networking error connecting to a remote container): these errors are not reported to the model and fail the Sample
with an error state.
The sandbox is also available to custom scorers.
@@ -845,17 +910,17 @@Sandbox environment definitions can be bound at the Sample
, Task
, or eval()
level. Binding precedence goes from eval()
, to Task
to Sample
, however sandbox config files defined on the Sample
always take precedence when the sandbox type for the Sample
is the same as the enclosing Task
or eval()
.
Here is a Task
that defines a sandbox
:
- Task(=dataset,
- dataset
- plan([
- use_tools([read_file(), list_files()])),
- generate()
- ]),=match(),
- scorer="docker"
- sandbox )
+ Task(=dataset,
+ dataset
+ plan([
+ use_tools([read_file(), list_files()])),
+ generate()
+ ]),=match(),
+ scorer="docker"
+ sandbox )
By default, any Dockerfile
and/or compose.yaml
file within the task directory will be automatically discovered and used. If your compose file has a different name then you can provide an override specification as follows:
=("docker", "attacker-compose.yaml") sandbox
=("docker", "attacker-compose.yaml") sandbox
The configuration file added to the sandbox
spec should always be a compose file (rather than a Dockerfile
, which is always discovered automatically).
If there is a Sample setup
script it will be executed within the default sandbox environment after any Sample files
are copied into the environment. The setup
field can be either the script contents, a file path containing the script, or a base64 encoded Data URL.
The setup
script is by default interpreted as a bash script, however you can have it executed by another interpreter using a shebang comment. For example, this will be executed as a Python script:
#!/usr/bin/env python3
-
-print('hello from python')
#!/usr/bin/env python3
+
+print('hello from python')
compose.yaml
services:
- default:
- build: .
- init: true
- command: tail -f /dev/null
- cpus: 1.0
- mem_limit: 0.5gb
- network_mode: none
services:
+ default:
+ build: .
+ init: true
+ command: tail -f /dev/null
+ cpus: 1.0
+ mem_limit: 0.5gb
+ network_mode: none
The init: true
entry enables the container to respond to shutdown requests. The command
is provided to prevent the container from exiting after it starts.
Here is what a simple compose.yaml
would look like for a local pre-built image named ctf-agent-environment
(resource and network limits excluded for brevity):
compose.yaml
services:
- default:
- image: ctf-agent-environment
- x-local: true
- init: true
- command: tail -f /dev/null
services:
+ default:
+ image: ctf-agent-environment
+ x-local: true
+ init: true
+ command: tail -f /dev/null
The ctf-agent-environment
is not an image that exists on a remote registry, so we add the x-local: true
to indicate that it should not be pulled. If local images are tagged, they also will not be pulled by default (so x-local: true
is not required). For example:
compose.yaml
services:
- default:
- image: ctf-agent-environment:1.0.0
- init: true
- command: tail -f /dev/null
services:
+ default:
+ image: ctf-agent-environment:1.0.0
+ init: true
+ command: tail -f /dev/null
If we are using an image from a remote registry we similarly don’t need to include x-local
:
compose.yaml
services:
- default:
- image: python:3.12-bookworm
- init: true
- command: tail -f /dev/null
services:
+ default:
+ image: python:3.12-bookworm
+ init: true
+ command: tail -f /dev/null
See the Docker Compose documentation for information on all available container options.
compose.yaml
services:
- default:
- image: ctf-agent-environment
- x-local: true
- init: true
- cpus: 1.0
- mem_limit: 0.5gb
- victim:
- image: ctf-victim-environment
- x-local: true
- init: true
- cpus: 1.0
- mem_limit: 1gb
services:
+ default:
+ image: ctf-agent-environment
+ x-local: true
+ init: true
+ cpus: 1.0
+ mem_limit: 0.5gb
+ victim:
+ image: ctf-victim-environment
+ x-local: true
+ init: true
+ cpus: 1.0
+ mem_limit: 1gb
The first environment listed is the “default” environment, and can be accessed from within a tool with a normal call to sandbox()
. Other environments would be accessed by name, for example:
# default sandbox environment
- sandbox() "victim") # named sandbox environment sandbox(
# default sandbox environment
+ sandbox() "victim") # named sandbox environment sandbox(
You can view more detailed logging around the creation and use of sandbox environments by using the sandbox
log level. For example:
$ inspect eval ctf.py --log-level sandbox
$ inspect eval ctf.py --log-level sandbox
The sandbox log level is just above warning
(so it will not show http
or debug
level messages).