From a2119ada97eb1b6353663be49fc494c6b8752cb2 Mon Sep 17 00:00:00 2001 From: aisi-inspect <166920645+aisi-inspect@users.noreply.github.com> Date: Fri, 27 Sep 2024 16:53:00 +0000 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- agents-api.html | 241 ++++++++++------ agents.html | 671 ++++++++++++++++++++++++-------------------- eval-logs.html | 2 +- examples/index.html | 10 +- index.html | 2 +- log-viewer.html | 2 +- search.json | 4 +- sitemap.xml | 2 +- tutorial.html | 26 +- vscode.html | 2 +- workflow.html | 2 +- 12 files changed, 548 insertions(+), 418 deletions(-) diff --git a/.nojekyll b/.nojekyll index c4e29f190..78c5d04ad 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -ad327f9f \ No newline at end of file +94928440 \ No newline at end of file diff --git a/agents-api.html b/agents-api.html index 2cb4ce9e7..ad1b15ccf 100644 --- a/agents-api.html +++ b/agents-api.html @@ -306,6 +306,8 @@

Table of contents

  • Tool Use
  • Transcripts @@ -444,34 +446,97 @@

    Custom Loop

  • Adding a critique / reflection step between tool calling and generate.
  • Deep copying the TaskState and exploring several trajectories.
  • -

    Note that by default expected errors (e.g. file not found, insufficient, permission , timeouts, etc.) are forwarded to the model for possible recovery. If you would like to intervene in the default error handling then rather than immediately appending the list of assistant messages returned from call_tools() to state.messages (as shown above), check the error property of these messages (which will be None in the case of no error) and proceed accordingly.

    + +
    +

    Stop Reasons

    +

    One thing that a custom scaffold may do is try to recover from various conditions that cause the model to stop generating. You can find the reason that generation stopped in the stop_reason field of ModelOutput. For example:

    +
    output = await model.generate(state.messages, state.tools)
    +if output.stop_reason == "model_length":
    +    # do something to recover from context window overflow
    +

    Here are the possible values for StopReason :

    + ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Stop ReasonDescription
    stopThe model hit a natural stop point or a provided stop sequence
    max_tokensThe maximum number of tokens specified in the request was reached.
    model_lengthThe model’s context length was exceeded.
    tool_callsThe model called a tool
    content_filterContent was omitted due to a content filter.
    unknownUnknown (e.g. unexpected runtime error)
    +
    +
    +
    + +
    +
    +Note +
    +
    +
    +

    Note that the model_length and max_tokens stop reasons are currently only available in the development version of Inspect. You can install the development version with:

    +
    pip install git+https://github.com/UKGovernmentBEIS/inspect_ai
    +
    +
    +
    +
    +

    Error Handling

    +

    By default expected errors (e.g. file not found, insufficient, permission , timeouts, etc.) are forwarded to the model for possible recovery. If you would like to intervene in the default error handling then rather than immediately appending the list of assistant messages returned from call_tools() to state.messages (as shown above), check the error property of these messages (which will be None in the case of no error) and proceed accordingly.

    Note that you don’t necessarily even need to structure the agent using a loop. For example, you might have an inner function implementing the loop, while an outer function dynamically swaps out what tools are available. For example, imagine the above was implemented in a function named tool_use_loop(), you might have outer function like this:

    -
    # first pass w/ core tools
    -state.tools = [decompile(), dissasemble(), bash()]
    -state = await tool_use_loop(state)
    -
    -# second pass w/ prompt and python tool only
    -state.tools = [python()]
    -state = await tool_use_loop(state)
    +
    # first pass w/ core tools
    +state.tools = [decompile(), dissasemble(), bash()]
    +state = await tool_use_loop(state)
    +
    +# second pass w/ prompt and python tool only
    +state.tools = [python()]
    +state = await tool_use_loop(state)

    Taken together these APIs enable you to build a custom version of generate() with whatever structure and logic you need.

    Tool Descriptions

    In some cases you may want to change the default descriptions created by a tool author—for example you might want to provide better disambiguation between multiple similar tools that are used together. You also might have need to do this during development of tools (to explore what descriptions are most useful to models).

    The tool_with() function enables you to take any tool and adapt its name and/or descriptions. For example:

    -
    from inspect_ai.tool import tool_with
    -
    -my_add = tool_with(
    -  tool=add(), 
    -  name="my_add",
    -  description="a tool to add numbers", 
    -  parameters={
    -    "x": "the x argument",
    -    "y": "the y argument"
    -  })
    +
    from inspect_ai.tool import tool_with
    +
    +my_add = tool_with(
    +  tool=add(), 
    +  name="my_add",
    +  description="a tool to add numbers", 
    +  parameters={
    +    "x": "the x argument",
    +    "y": "the y argument"
    +  })

    You need not provide all of the parameters shown above, for example here are some examples where we modify just the main tool description or only a single parameter:

    -
    my_add = tool_with(add(), description="a tool to add numbers")
    -my_add = tool_with(add(), parameters={"x": "the x argument"})
    +
    my_add = tool_with(add(), description="a tool to add numbers")
    +my_add = tool_with(add(), parameters={"x": "the x argument"})

    Note that the tool_with() function returns a copy of the passed tool with modified descriptions (the passed tool retains its original descriptions)..

    @@ -490,17 +555,17 @@

    Transcripts

    Custom Info

    You can insert custom entries into the transcript via the Transcipt info() method (which creates an InfoEvent). Access the transcript for the current sample using the transcript() function, for example:

    -
    from inspect_ai.log import transcript
    -
    -transcript().info("here is some custom info")
    +
    from inspect_ai.log import transcript
    +
    +transcript().info("here is some custom info")

    Strings passed to info() will be rendered as markdown. In addition to strings you can also pass arbitrary JSON serialisable objects to info().

    Grouping with Steps

    You can create arbitrary groupings of transcript activity using the Transcript step() context manager. For example:

    -
    with transcript().step("reasoning"):
    -    ...
    -    state.store.set("next-action", next_action)
    +
    with transcript().step("reasoning"):
    +    ...
    +    state.store.set("next-action", next_action)

    There are two reasons that you might want to create steps:

    1. Any changes to the store which occur during a step will be collected into a StoreEvent that records the changes (in JSON Patch format) that occurred.
    2. @@ -517,21 +582,21 @@

      Subtasks

    3. They have their own isolated Transcript

    To create a subtask, declare an async function with the @subtask decorator. The function can take any arguments and return a value of any type. For example:

    -
    from inspect_ai.util import Store, subtask
    -
    -@subtask
    -async def web_search(keywords: str) -> str:
    -    # get links for these keywords
    -    links = await search_links(keywords)
    -
    -    # add links to the store so they end up in the transcript
    -    store().set("links", links)
    -
    -    # summarise the links
    -    return await fetch_and_summarise(links)
    +
    from inspect_ai.util import Store, subtask
    +
    +@subtask
    +async def web_search(keywords: str) -> str:
    +    # get links for these keywords
    +    links = await search_links(keywords)
    +
    +    # add links to the store so they end up in the transcript
    +    store().set("links", links)
    +
    +    # summarise the links
    +    return await fetch_and_summarise(links)

    Note that we add links to the store not because we strictly need to for our implementation, but because we want the links to be recorded as part of the transcript.

    Call the subtask as you would any async function:

    -
    summary = await web_search(keywords="solar power")
    +
    summary = await web_search(keywords="solar power")

    A few things will occur automatically when you run a subtask:

    Sandboxing

    Many agents provide models with the ability to execute arbitrary code. It’s important that this code be sandboxed so that it executes in an isolated context. Inspect supports this through the SandboxEnvironment (which in turn may be implemented using Docker or various other schemes). Enable sandboxing for a task with the sandbox parameter. For example:

    -
    @task
    -def file_probe()
    -    return Task(
    -        dataset=dataset,
    -        solver=[
    -            use_tools([list_files()]), 
    -            generate()
    -        ],
    -        sandbox="docker",
    -        scorer=includes(),
    -    )
    -)
    +
    @task
    +def file_probe()
    +    return Task(
    +        dataset=dataset,
    +        solver=[
    +            use_tools([list_files()]), 
    +            generate()
    +        ],
    +        sandbox="docker",
    +        scorer=includes(),
    +    )
    +)

    Use the SandboxEnvironment within a tool via the sandbox() function. For example, here’s an implementation of the list_files() tool referenced above:

    -
    from inspect_ai.tool import ToolError, tool
    -from inspect_ai.util import sandbox
    -
    -@tool
    -def list_files():
    -    async def execute(dir: str):
    -        """List the files in a directory.
    -
    -        Args:
    -            dir (str): Directory
    -
    -        Returns:
    -            File listing of the directory
    -        """
    -        result = await sandbox().exec(["ls", dir])
    -        if result.success:
    -            return result.stdout
    -        else:
    -            raise ToolError(result.stderr)
    -
    -    return execute
    +
    from inspect_ai.tool import ToolError, tool
    +from inspect_ai.util import sandbox
    +
    +@tool
    +def list_files():
    +    async def execute(dir: str):
    +        """List the files in a directory.
    +
    +        Args:
    +            dir (str): Directory
    +
    +        Returns:
    +            File listing of the directory
    +        """
    +        result = await sandbox().exec(["ls", dir])
    +        if result.success:
    +            return result.stdout
    +        else:
    +            raise ToolError(result.stderr)
    +
    +    return execute

    See the section on Sandbox Environments for further details on using sandboxes with Inspect.

    diff --git a/agents.html b/agents.html index 20b364216..8bd958c2f 100644 --- a/agents.html +++ b/agents.html @@ -308,6 +308,8 @@

    Table of contents

  • Custom Scaffold
  • @@ -539,31 +541,94 @@

    Custom Scaffold

    Adding a critique / reflection step between tool calling and generate.
  • Deep copying the TaskState and exploring several trajectories.
  • -

    Note that by default expected errors (e.g. file not found, insufficient, permission , timeouts, etc.) are forwarded to the model for possible recovery. If you would like to intervene in the default error handling then rather than immediately appending the list of assistant messages returned from call_tools() to state.messages (as shown above), check the error property of these messages (which will be None in the case of no error) and proceed accordingly.

    +
    +

    Stop Reasons

    +

    One thing that a custom scaffold may do is try to recover from various conditions that cause the model to stop generating. You can find the reason that generation stopped in the stop_reason field of ModelOutput. For example:

    +
    output = await model.generate(state.messages, state.tools)
    +if output.stop_reason == "model_length":
    +    # do something to recover from context window overflow
    +

    Here are the possible values for StopReason :

    + ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Stop ReasonDescription
    stopThe model hit a natural stop point or a provided stop sequence
    max_tokensThe maximum number of tokens specified in the request was reached.
    model_lengthThe model’s context length was exceeded.
    tool_callsThe model called a tool
    content_filterContent was omitted due to a content filter.
    unknownUnknown (e.g. unexpected runtime error)
    +
    +
    +
    + +
    +
    +Note +
    +
    +
    +

    Note that the model_length and max_tokens stop reasons are currently only available in the development version of Inspect. You can install the development version with:

    +
    pip install git+https://github.com/UKGovernmentBEIS/inspect_ai
    +
    +
    +
    +
    +

    Error Handling

    +

    By default expected errors (e.g. file not found, insufficient, permission , timeouts, etc.) are forwarded to the model for possible recovery. If you would like to intervene in the default error handling then rather than immediately appending the list of assistant messages returned from call_tools() to state.messages (as shown above), check the error property of these messages (which will be None in the case of no error) and proceed accordingly.

    +

    Tool Filtering

    While its possible to make tools globally available to the model via use_tools(), you may also want to filter the available tools either based on task stages or dynamically based on some other criteria.

    Here’s an example of a solver agent that filters the available tools between calls to generate():

    -
    @solver
    -def ctf_agent():
    -    async def solve(state: TaskState, generate: Generate):
    -        
    -        # first pass w/ core tools
    -        state.tools = [decompile(), dissasemble(), bash()]
    -        state = await generate(state)
    -
    -        # second pass w/ prompt and python tool only
    -        state.tools = [python()]
    -        state.messages.append(ChatMessageUser( 
    -            content = "Use Python to extract the flag." 
    -        ))  
    -        state = await generate(state)
    -
    -        # clear tools and return
    -        state.tools = []
    -        return state
    -    
    -    return solve
    +
    @solver
    +def ctf_agent():
    +    async def solve(state: TaskState, generate: Generate):
    +        
    +        # first pass w/ core tools
    +        state.tools = [decompile(), dissasemble(), bash()]
    +        state = await generate(state)
    +
    +        # second pass w/ prompt and python tool only
    +        state.tools = [python()]
    +        state.messages.append(ChatMessageUser( 
    +            content = "Use Python to extract the flag." 
    +        ))  
    +        state = await generate(state)
    +
    +        # clear tools and return
    +        state.tools = []
    +        return state
    +    
    +    return solve

    Agents API

    @@ -583,49 +648,49 @@

    Example: LangChain

  • Bridging from the Inspect solver interface to the standard input and output types of the agent library. In this example this is provided by the langchain_solver() function, which takes a LangChain agent function and converts it to an Inspect solver.

  • Here’s the implementation of langchain_solver() (imports excluded for brevity):

    -
    # Interface for LangChain agent function
    -class LangChainAgent(Protocol):
    -    async def __call__(self, llm: BaseChatModel, input: dict[str, Any]): ...
    -
    -# Convert a LangChain agent function into a Solver
    -def langchain_solver(agent: LangChainAgent) -> Solver:
    -
    -    async def solve(state: TaskState, generate: Generate) -> TaskState:
    -
    -        # create the inspect model api bridge
    -        llm = InspectChatModel()
    -
    -        # call the agent
    -        await agent(
    -            llm = llm,
    -            input = dict(
    -                input=state.user_prompt.text,
    -                chat_history=as_langchain_chat_history(
    -                    state.messages[1:]
    -                ),
    -            )
    -        )
    -
    -        # collect output from llm interface
    -        state.messages = llm.messages
    -        state.output = llm.output
    -        state.output.completion = output
    -        
    -        # return state
    -        return state
    -
    -    return solve
    -
    -# LangChain BaseChatModel for Inspect Model API
    -class InspectChatModel(BaseChatModel):
    -     async def _agenerate(
    -        self,
    -        messages: list[BaseMessage],
    -        stop: list[str] | None = None,
    -        run_manager: AsyncCallbackManagerForLLMRun | None = None,
    -        **kwargs: dict[str, Any],
    -    ) -> ChatResult:
    -        ...
    +
    # Interface for LangChain agent function
    +class LangChainAgent(Protocol):
    +    async def __call__(self, llm: BaseChatModel, input: dict[str, Any]): ...
    +
    +# Convert a LangChain agent function into a Solver
    +def langchain_solver(agent: LangChainAgent) -> Solver:
    +
    +    async def solve(state: TaskState, generate: Generate) -> TaskState:
    +
    +        # create the inspect model api bridge
    +        llm = InspectChatModel()
    +
    +        # call the agent
    +        await agent(
    +            llm = llm,
    +            input = dict(
    +                input=state.user_prompt.text,
    +                chat_history=as_langchain_chat_history(
    +                    state.messages[1:]
    +                ),
    +            )
    +        )
    +
    +        # collect output from llm interface
    +        state.messages = llm.messages
    +        state.output = llm.output
    +        state.output.completion = output
    +        
    +        # return state
    +        return state
    +
    +    return solve
    +
    +# LangChain BaseChatModel for Inspect Model API
    +class InspectChatModel(BaseChatModel):
    +     async def _agenerate(
    +        self,
    +        messages: list[BaseMessage],
    +        stop: list[str] | None = None,
    +        run_manager: AsyncCallbackManagerForLLMRun | None = None,
    +        **kwargs: dict[str, Any],
    +    ) -> ChatResult:
    +        ...
    @@ -637,71 +702,71 @@

    Example: LangChain

    Now here’s the wikipedia_search() solver (imports again excluded for brevity):

    -
    @solver
    -def wikipedia_search(
    -    max_iterations: int | None = 15,
    -    max_execution_time: float | None = None
    -) -> Solver:
    -    # standard prompt for tools agent
    -    prompt = hub.pull("hwchase17/openai-tools-agent")
    -
    -    # tavily and wikipedia tools
    -    tavily_api = TavilySearchAPIWrapper()  # type: ignore
    -    tools = (
    -        [TavilySearchResults(api_wrapper=tavily_api)] + 
    -        load_tools(["wikipedia"])
    -    )
    -
    -    # agent function
    -    async def agent(
    -        llm: BaseChatModel, 
    -        input: dict[str, Any]
    -    ) -> str | list[str | dict[str,Any]]:  
    -        # create agent
    -        tools_agent = create_openai_tools_agent(
    -          llm, tools, prompt
    -        )
    -        executor = AgentExecutor.from_agent_and_tools(
    -            agent=cast(BaseMultiActionAgent, tools_agent),
    -            tools=tools,
    -            name="wikipedia_search",
    -            max_iterations=max_iterations,  
    -            max_execution_time=max_execution_time
    -        )
    -
    -        # execute the agent and return output
    -        result = await executor.ainvoke(input)  
    -        return result["output"]
    -
    -    # return agent function as inspect solver
    -    return langchain_solver(agent)
    +
    @solver
    +def wikipedia_search(
    +    max_iterations: int | None = 15,
    +    max_execution_time: float | None = None
    +) -> Solver:
    +    # standard prompt for tools agent
    +    prompt = hub.pull("hwchase17/openai-tools-agent")
    +
    +    # tavily and wikipedia tools
    +    tavily_api = TavilySearchAPIWrapper()  # type: ignore
    +    tools = (
    +        [TavilySearchResults(api_wrapper=tavily_api)] + 
    +        load_tools(["wikipedia"])
    +    )
    +
    +    # agent function
    +    async def agent(
    +        llm: BaseChatModel, 
    +        input: dict[str, Any]
    +    ) -> str | list[str | dict[str,Any]]:  
    +        # create agent
    +        tools_agent = create_openai_tools_agent(
    +          llm, tools, prompt
    +        )
    +        executor = AgentExecutor.from_agent_and_tools(
    +            agent=cast(BaseMultiActionAgent, tools_agent),
    +            tools=tools,
    +            name="wikipedia_search",
    +            max_iterations=max_iterations,  
    +            max_execution_time=max_execution_time
    +        )
    +
    +        # execute the agent and return output
    +        result = await executor.ainvoke(input)  
    +        return result["output"]
    +
    +    # return agent function as inspect solver
    +    return langchain_solver(agent)
    -
    1
    +
    1
    -Note that we register native LangChain tools. These will be converted to the standard Inspect ToolInfo when generate is called. +Note that we register native LangChain tools. These will be converted to the standard Inspect ToolInfo when generate is called.
    -
    2
    +
    2
    -This is the standard interface to LangChain agents. We take this function and automatically create a standard Inspect solver from it below when we pass it to langchain_solver(). +This is the standard interface to LangChain agents. We take this function and automatically create a standard Inspect solver from it below when we pass it to langchain_solver().
    -
    3
    +
    3
    -Invoke the agent using the chat history passed in input. We call the async executor API to play well with Inspect’s concurrency. +Invoke the agent using the chat history passed in input. We call the async executor API to play well with Inspect’s concurrency.
    -
    4
    +
    4
    -The langchain_solver() function maps the simpler agent function semantics into the standard Inspect solver API. +The langchain_solver() function maps the simpler agent function semantics into the standard Inspect solver API.

    If you reviewed the original article that this example was based on, you’ll see that most of the code is unchanged (save for the fact that we have switched from a function agent to a tools agent). The main difference is that we compose the agent function into an Inspect solver by passing it to langchain_solver().

    Finally, here’s a task that uses the wikipedia_search() solver:

    -
    @task
    -def wikipedia() -> Task:
    -    return Task(
    -        dataset=json_dataset("wikipedia.jsonl"),
    -        solver=wikipedia_search(),
    -        scorer=model_graded_fact(),
    -    )
    +
    @task
    +def wikipedia() -> Task:
    +    return Task(
    +        dataset=json_dataset("wikipedia.jsonl"),
    +        solver=wikipedia_search(),
    +        scorer=model_graded_fact(),
    +    )

    The full source code for this example can be found in the Inspect GitHub repo at examples/langchain.

    @@ -716,108 +781,108 @@

    Sandboxing

    Example: File Listing

    Let’s take a look at a simple example to illustrate. First, we’ll define a list_files() tool. This tool need to access the ls command—it does so by calling the sandbox() function to get access to the SandboxEnvironment instance for the currently executing Sample:

    -
    from inspect_ai.tool import ToolError, tool
    -from inspect_ai.util import sandbox
    -
    -@tool
    -def list_files():
    -    async def execute(dir: str):
    -        """List the files in a directory.
    -
    -        Args:
    -            dir (str): Directory
    -
    -        Returns:
    -            File listing of the directory
    -        """
    -        result = await sandbox().exec(["ls", dir])
    -        if result.success:
    -            return result.stdout
    -        else:
    -            raise ToolError(result.stderr)
    -
    -    return execute
    +
    from inspect_ai.tool import ToolError, tool
    +from inspect_ai.util import sandbox
    +
    +@tool
    +def list_files():
    +    async def execute(dir: str):
    +        """List the files in a directory.
    +
    +        Args:
    +            dir (str): Directory
    +
    +        Returns:
    +            File listing of the directory
    +        """
    +        result = await sandbox().exec(["ls", dir])
    +        if result.success:
    +            return result.stdout
    +        else:
    +            raise ToolError(result.stderr)
    +
    +    return execute

    The exec() function is used to list the directory contents. Note that its not immediately clear where or how exec() is implemented (that will be described shortly!).

    Here’s an evaluation that makes use of this tool:

    -
    from inspect_ai import task, Task
    -from inspect_ai.dataset import Sample
    -from inspect_ai.scorer import includes
    -from inspect_ai.solver import generate, use_tools
    -
    -dataset = [
    -    Sample(
    -        input='Is there a file named "bar.txt" ' 
    -               + 'in the current directory?',
    -        target="Yes",
    -        files={"bar.txt": "hello"},
    -    )
    -]
    -
    -@task
    -def file_probe()
    -    return Task(
    -        dataset=dataset,
    -        solver=[
    -            use_tools([list_files()]), 
    -            generate()
    -        ],
    -        sandbox="docker",
    -        scorer=includes(),
    -    )
    -)
    +
    from inspect_ai import task, Task
    +from inspect_ai.dataset import Sample
    +from inspect_ai.scorer import includes
    +from inspect_ai.solver import generate, use_tools
    +
    +dataset = [
    +    Sample(
    +        input='Is there a file named "bar.txt" ' 
    +               + 'in the current directory?',
    +        target="Yes",
    +        files={"bar.txt": "hello"},
    +    )
    +]
    +
    +@task
    +def file_probe()
    +    return Task(
    +        dataset=dataset,
    +        solver=[
    +            use_tools([list_files()]), 
    +            generate()
    +        ],
    +        sandbox="docker",
    +        scorer=includes(),
    +    )
    +)

    We’ve included sandbox="docker" to indicate that sandbox environment operations should be executed in a Docker container. Specifying a sandbox environment (either at the task or evaluation level) is required if your tools call the sandbox() function.

    Note that files are specified as part of the Sample. Files can be specified inline using plain text (as depicted above), inline using a base64-encoded data URI, or as a path to a file or remote resource (e.g. S3 bucket). Relative file paths are resolved according to the location of the underlying dataset file.

    Environment Interface

    The following instance methods are available to tools that need to interact with a SandboxEnvironment:

    -
    class SandboxEnvironment:
    -   
    -    async def exec(
    -        self,
    -        cmd: list[str],
    -        input: str | bytes | None = None,
    -        cwd: str | None = None,
    -        env: dict[str, str] = {},
    -        user: str | None = None,
    -        timeout: int | None = None,
    -    ) -> ExecResult[str]:
    -        """
    -        Raises:
    -          TimeoutError: If the specified `timeout` expires.
    -          UnicodeDecodeError: If an error occurs while
    -            decoding the command output.
    -          PermissionError: If the user does not have
    -            permission to execute the command.
    -        """
    -        ...
    -
    -    async def write_file(
    -        self, file: str, contents: str | bytes
    -    ) -> None:
    -        """
    -        Raises:
    -          PermissionError: If the user does not have
    -            permission to write to the specified path.
    -          IsADirectoryError: If the file exists already and 
    -            is a directory.
    -        """
    -        ...
    -
    -    async def read_file(
    -        self, file: str, text: bool = True
    -    ) -> Union[str | bytes]:
    -        """
    -        Raises:
    -          FileNotFoundError: If the file does not exist.
    -          UnicodeDecodeError: If an encoding error occurs 
    -            while reading the file.
    -            (only applicable when `text = True`)
    -          PermissionError: If the user does not have
    -            permission to read from the specified path.
    -          IsADirectoryError: If the file is a directory.
    -        """
    -        ...
    +
    class SandboxEnvironment:
    +   
    +    async def exec(
    +        self,
    +        cmd: list[str],
    +        input: str | bytes | None = None,
    +        cwd: str | None = None,
    +        env: dict[str, str] = {},
    +        user: str | None = None,
    +        timeout: int | None = None,
    +    ) -> ExecResult[str]:
    +        """
    +        Raises:
    +          TimeoutError: If the specified `timeout` expires.
    +          UnicodeDecodeError: If an error occurs while
    +            decoding the command output.
    +          PermissionError: If the user does not have
    +            permission to execute the command.
    +        """
    +        ...
    +
    +    async def write_file(
    +        self, file: str, contents: str | bytes
    +    ) -> None:
    +        """
    +        Raises:
    +          PermissionError: If the user does not have
    +            permission to write to the specified path.
    +          IsADirectoryError: If the file exists already and 
    +            is a directory.
    +        """
    +        ...
    +
    +    async def read_file(
    +        self, file: str, text: bool = True
    +    ) -> Union[str | bytes]:
    +        """
    +        Raises:
    +          FileNotFoundError: If the file does not exist.
    +          UnicodeDecodeError: If an encoding error occurs 
    +            while reading the file.
    +            (only applicable when `text = True`)
    +          PermissionError: If the user does not have
    +            permission to read from the specified path.
    +          IsADirectoryError: If the file is a directory.
    +        """
    +        ...

    Note that write_file() automatically creates parent directories as required if they don’t exist.

    For each method there is a documented set of errors that are raised: these are expected errors and can either be caught by tools or allowed to propagate in which case they will be reported to the model for potential recovery. In addition, unexpected errors may occur (e.g. a networking error connecting to a remote container): these errors are not reported to the model and fail the Sample with an error state.

    The sandbox is also available to custom scorers.

    @@ -845,17 +910,17 @@

    Environment Binding

    Sandbox environment definitions can be bound at the Sample, Task, or eval() level. Binding precedence goes from eval(), to Task to Sample, however sandbox config files defined on the Sample always take precedence when the sandbox type for the Sample is the same as the enclosing Task or eval().

    Here is a Task that defines a sandbox:

    -
    Task(
    -    dataset=dataset,
    -    plan([
    -        use_tools([read_file(), list_files()])), 
    -        generate()
    -    ]),
    -    scorer=match(),
    -    sandbox="docker"
    -)
    +
    Task(
    +    dataset=dataset,
    +    plan([
    +        use_tools([read_file(), list_files()])), 
    +        generate()
    +    ]),
    +    scorer=match(),
    +    sandbox="docker"
    +)

    By default, any Dockerfile and/or compose.yaml file within the task directory will be automatically discovered and used. If your compose file has a different name then you can provide an override specification as follows:

    -
    sandbox=("docker", "attacker-compose.yaml")
    +
    sandbox=("docker", "attacker-compose.yaml")

    The configuration file added to the sandbox spec should always be a compose file (rather than a Dockerfile, which is always discovered automatically).

    @@ -875,9 +940,9 @@

    Files

    Script

    If there is a Sample setup script it will be executed within the default sandbox environment after any Sample files are copied into the environment. The setup field can be either the script contents, a file path containing the script, or a base64 encoded Data URL.

    The setup script is by default interpreted as a bash script, however you can have it executed by another interpreter using a shebang comment. For example, this will be executed as a Python script:

    -
    #!/usr/bin/env python3
    -
    -print('hello from python')
    +
    #!/usr/bin/env python3
    +
    +print('hello from python')
    @@ -913,14 +978,14 @@

    Docker Configurat
    compose.yaml
    -
    services:
    -  default: 
    -    build: .
    -    init: true
    -    command: tail -f /dev/null
    -    cpus: 1.0
    -    mem_limit: 0.5gb
    -    network_mode: none
    +
    services:
    +  default: 
    +    build: .
    +    init: true
    +    command: tail -f /dev/null
    +    cpus: 1.0
    +    mem_limit: 0.5gb
    +    network_mode: none

    The init: true entry enables the container to respond to shutdown requests. The command is provided to prevent the container from exiting after it starts.

    Here is what a simple compose.yaml would look like for a local pre-built image named ctf-agent-environment (resource and network limits excluded for brevity):

    @@ -928,34 +993,34 @@

    Docker Configurat
    compose.yaml
    -
    services:
    -  default: 
    -    image: ctf-agent-environment
    -    x-local: true
    -    init: true
    -    command: tail -f /dev/null
    +
    services:
    +  default: 
    +    image: ctf-agent-environment
    +    x-local: true
    +    init: true
    +    command: tail -f /dev/null

    The ctf-agent-environment is not an image that exists on a remote registry, so we add the x-local: true to indicate that it should not be pulled. If local images are tagged, they also will not be pulled by default (so x-local: true is not required). For example:

    compose.yaml
    -
    services:
    -  default: 
    -    image: ctf-agent-environment:1.0.0
    -    init: true
    -    command: tail -f /dev/null
    +
    services:
    +  default: 
    +    image: ctf-agent-environment:1.0.0
    +    init: true
    +    command: tail -f /dev/null

    If we are using an image from a remote registry we similarly don’t need to include x-local:

    compose.yaml
    -
    services:
    -  default:
    -    image: python:3.12-bookworm
    -    init: true
    -    command: tail -f /dev/null
    +
    services:
    +  default:
    +    image: python:3.12-bookworm
    +    init: true
    +    command: tail -f /dev/null

    See the Docker Compose documentation for information on all available container options.

    @@ -965,23 +1030,23 @@

    Multiple Environment
    compose.yaml
    -
    services:
    -  default:
    -    image: ctf-agent-environment
    -    x-local: true
    -    init: true
    -    cpus: 1.0
    -    mem_limit: 0.5gb
    -  victim:
    -    image: ctf-victim-environment
    -    x-local: true
    -    init: true
    -    cpus: 1.0
    -    mem_limit: 1gb
    +
    services:
    +  default:
    +    image: ctf-agent-environment
    +    x-local: true
    +    init: true
    +    cpus: 1.0
    +    mem_limit: 0.5gb
    +  victim:
    +    image: ctf-victim-environment
    +    x-local: true
    +    init: true
    +    cpus: 1.0
    +    mem_limit: 1gb

    The first environment listed is the “default” environment, and can be accessed from within a tool with a normal call to sandbox(). Other environments would be accessed by name, for example:

    -
    sandbox()          # default sandbox environment
    -sandbox("victim")  # named sandbox environment
    +
    sandbox()          # default sandbox environment
    +sandbox("victim")  # named sandbox environment
    @@ -999,53 +1064,53 @@

    Multiple Environment

    Infrastructure

    Note that in many cases you’ll want to provision additional infrastructure (e.g. other hosts or volumes). For example, here we define an additional container (“writer”) as well as a volume shared between the default container and the writer container:

    -
    services:
    -  default: 
    -    image: ctf-agent-environment
    -    x-local: true
    -    init: true
    -    volumes:
    -      - ctf-challenge-volume:/shared-data
    -    
    -  writer:
    -    image: ctf-challenge-writer
    -    x-local: true
    -    init: true
    -    volumes:
    -      - ctf-challenge-volume:/shared-data
    -volumes:
    -  ctf-challenge-volume:
    +
    services:
    +  default: 
    +    image: ctf-agent-environment
    +    x-local: true
    +    init: true
    +    volumes:
    +      - ctf-challenge-volume:/shared-data
    +    
    +  writer:
    +    image: ctf-challenge-writer
    +    x-local: true
    +    init: true
    +    volumes:
    +      - ctf-challenge-volume:/shared-data
    +volumes:
    +  ctf-challenge-volume:

    See the documentation on Docker Compose files for information on their full schema and feature set.

    Sample Metadata

    You might want to interpolate Sample metadata into your Docker compose files. You can do this using the standard compose environment variable syntax, where any metadata in the Sample is made available with a SAMPLE_METADATA_ prefix. For example, you might have a per-sample memory limit (with a default value of 0.5gb if unspecified):

    -
    services:
    -  default:
    -    image: ctf-agent-environment
    -    x-local: true
    -    init: true
    -    cpus: 1.0
    -    mem_limit: ${SAMPLE_METDATA_MEMORY_LIMIT-0.5gb}
    +
    services:
    +  default:
    +    image: ctf-agent-environment
    +    x-local: true
    +    init: true
    +    cpus: 1.0
    +    mem_limit: ${SAMPLE_METDATA_MEMORY_LIMIT-0.5gb}

    Note that - suffix that provides the default value of 0.5gb. This is important to include so that when the compose file is read without the context of a Sample (for example, when pulling/building images at startup) that a default value is available.

    Environment Cleanup

    When a task is completed, Inspect will automatically cleanup resources associated with the sandbox environment (e.g. containers, images, and networks). If for any reason resources are not cleaned up (e.g. if the cleanup itself is interrupted via Ctrl+C) you can globally cleanup all environments with the inspect sandbox cleanup command. For example, here we cleanup all environments associated with the docker provider:

    -
    $ inspect sandbox cleanup docker
    +
    $ inspect sandbox cleanup docker

    In some cases you may prefer not to cleanup environments. For example, you might want to examine their state interactively from the shell in order to debug an agent. Use the --no-sandbox-cleanup argument to do this:

    -
    $ inspect eval ctf.py --no-sandbox-cleanup
    +
    $ inspect eval ctf.py --no-sandbox-cleanup

    You can also do this when using eval():

    -
    eval("ctf.py", sandbox_cleanup = False)
    +
    eval("ctf.py", sandbox_cleanup = False)

    When you do this, you’ll see a list of sandbox containers printed out which includes the ID of each container. You can then use this ID to get a shell inside one of the containers:

    -
    docker exec -it inspect-intercode_ctf-ipg9tbviycpvlgwja5anyvn-default-1 bash
    +
    docker exec -it inspect-intercode_ctf-ipg9tbviycpvlgwja5anyvn-default-1 bash

    When you no longer need the environments, you can clean them up either all at once or individually:

    -
    # cleanup all environments
    -inspect sandbox cleanup docker
    -
    -# cleanup single environment
    -inspect sandbox cleanup docker inspect-intercode_ctf-ipg9tbviycpvlgwja5anyvn
    +
    # cleanup all environments
    +inspect sandbox cleanup docker
    +
    +# cleanup single environment
    +inspect sandbox cleanup docker inspect-intercode_ctf-ipg9tbviycpvlgwja5anyvn

    Resource Management

    @@ -1059,13 +1124,13 @@

    Running Containers

    compose.yaml
    -
    services:
    -  default: 
    -    image: ctf-agent-environment
    -    x-local: true
    -    command: tail -f /dev/null
    -    cpus: 1.0
    -    mem_limit: 0.5gb
    +
    services:
    +  default: 
    +    image: ctf-agent-environment
    +    x-local: true
    +    command: tail -f /dev/null
    +    cpus: 1.0
    +    mem_limit: 0.5gb
    @@ -1077,7 +1142,7 @@

    Concurrent Execution<

    Troubleshooting

    You can view more detailed logging around the creation and use of sandbox environments by using the sandbox log level. For example:

    -
    $ inspect eval ctf.py --log-level sandbox
    +
    $ inspect eval ctf.py --log-level sandbox

    The sandbox log level is just above warning (so it will not show http or debug level messages).

    diff --git a/eval-logs.html b/eval-logs.html index 1a06dbc6b..7bb43a8f2 100644 --- a/eval-logs.html +++ b/eval-logs.html @@ -1102,7 +1102,7 @@

    Reading Logs

    -