Skip to content

Commit

Permalink
sync 01-07-24
Browse files Browse the repository at this point in the history
  • Loading branch information
aisi-inspect committed Jul 1, 2024
1 parent b860976 commit 2d7561f
Show file tree
Hide file tree
Showing 32 changed files with 1,002 additions and 678 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,12 @@

## Unreleased

- Evaluate multiple models in parallel by passing a list of models to `eval()`.
- Add `api_key` to `get_model()` for explicitly specifying an API key for a model.
- Improved handling of very large (> 100MB) log files in Inspect View.
- Use `network_mode: none` for disabling networking by default in Docker tool environments.
- Allow tool environent providers to specify a default `max_samples` (set to 25 for the Docker provider).
- Prevent concurrent calls to `eval_async()` (unsafe because of need to change directories for tasks). Parallel task evaluation will instead be implemented as a top-level feature of `eval()` and `eval_async()`

## v0.3.17 (25 June 2024)

Expand Down
2 changes: 1 addition & 1 deletion docs/_format/pre-render.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ if [ -n "${QUARTO_PROJECT_RENDER_ALL}" ]; then
(echo; echo) >> ../examples.qmd
for f in security_guide.qmd hellaswag.qmd theory_of_mind.qmd mathematics.qmd biology_qa.qmd arc.qmd tool_use.qmd gsm8k.qmd footer.qmd; do (cat "${f}"; echo; echo; echo) >> ../examples.qmd; done
cd ..
fi
fi
4 changes: 2 additions & 2 deletions docs/_quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,9 +68,9 @@ book:
- part: "Advanced"
chapters:
- caching.qmd
- parallelism.qmd
- eval-logs.qmd
- eval-suites.qmd
- eval-tuning.qmd
- extensions.qmd

toc-depth: 2
Expand All @@ -96,4 +96,4 @@ format:
# date: today

execute:
enabled: false
enabled: false
4 changes: 3 additions & 1 deletion docs/_sample-preservation.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,6 @@ If dataset shuffling is important to your evaluation and you want to preserve sa

Another consideration is `max_samples`, which is the maximum number of samples to run concurrently within a task. Larger numbers of concurrent samples will result in higher throughput, but will also result in completed samples being written less frequently to the log file, and consequently less total recovable samples in the case of an interrupted task.

By default, Inspect sets the value of `max_samples` to `max_connections + 1`, ensuring that the model API is always fully saturated (note that it would rarely make sense to set it _lower_ than `max_connections`). The default `max_connections` is 10, which will typically result in samples being written to the log frequently. On the other hand, setting a very large `max_connections` (e.g. 100 `max_connections` for a dataset with 100 samples) may result in very few recoverable samples in the case of an interruption.
By default, Inspect sets the value of `max_samples` to `max_connections + 1`, ensuring that the model API is always fully saturated (note that it would rarely make sense to set it _lower_ than `max_connections`). The default `max_connections` is 10, which will typically result in samples being written to the log frequently. On the other hand, setting a very large `max_connections` (e.g. 100 `max_connections` for a dataset with 100 samples) may result in very few recoverable samples in the case of an interruption.

Note also that when using [Tool Environments](#sec-tool-environments), the tool environment provider may place an additional cap on the default `max_samples` (for example, the Docker provider limits the default `max_samples` to no more than 25).
2 changes: 1 addition & 1 deletion docs/agents.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -579,7 +579,7 @@ eval("ctf.py", toolenv_cleanup = False)

When you do this, you'll see something like the following printed out at the end of the eval:

![](images/toolenv-no-cleanup.png){.border}
![](images/toolenv-no-cleanup.png){.border fig-alt="A printed list of yet to be cleaned up Docker tool environments (including the container id and cleanup command for each one)"}

You then might use this command to get a shell inside one of the containers:

Expand Down
14 changes: 7 additions & 7 deletions docs/datasets.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Overview

Inspect has native support for reading datasets in the CSV, JSON, and JSON Lines formats, as well as from [Hugging Face](#sec-hugging-face-datasets). In addition, the core dataset interface for the evaluation pipeline is flexible enough to accept data read from just about any source.
Inspect has native support for reading datasets in the CSV, JSON, and JSON Lines formats, as well as from [Hugging Face](#sec-hugging-face-datasets). In addition, the core dataset interface for the evaluation pipeline is flexible enough to accept data read from just about any source (see the [Custom Reader](#sec-custom-reader) section below for details).

If your data is already in a format amenable for direct reading as an Inspect `Sample`, reading a dataset is as simple as this:

Expand Down Expand Up @@ -216,22 +216,22 @@ ChatMessageUser(content = [
Note that image input is currently only supported for OpenAI vision models (e.g. [gpt-4-vision-preview](https://platform.openai.com/docs/guides/vision)), Google Gemini vision models (e.g. [gemini-pro-vision](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-pro-vision)), and Anthropic Claude 3 models.
:::

## Custom Reader
## Custom Reader {#sec-custom-reader}

You are not restricted to the built in dataset functions for reading samples. Since the `dataset` field of the `Task` class takes either a `Dataset` or a sequences of`Sample`, the following is also valid:
You are not restricted to the built in dataset functions for reading samples. You can also construct a `MemoryDataset`, and pass that to a task. For example:

``` python
from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.dataset import MemoryDataset, Sample
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import generate, system_message

dataset=[
dataset=MemoryDataset([
Sample(
input="What cookie attributes should I use for strong security?",
target="secure samesite and httponly",
)
]
])

@task
def security_guide():
Expand All @@ -242,4 +242,4 @@ def security_guide():
)
```

So if the built in dataset functions don't meet your needs, you can create a custom function that yields a list of `Sample` instances and pass those directly to your `Task`.
So if the built in dataset functions don't meet your needs, you can create a custom function that yields a `MemoryDataset`and pass those directly to your `Task`.
15 changes: 12 additions & 3 deletions docs/extensions.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -53,12 +53,14 @@ For example, if your package was named `inspect_package` and your model provider

::: {.panel-tabset group="entry-points"}
## Setuptools

``` toml
[project.entry-points.inspect_ai]
inspect_package = "inspect_package.inspect_extensions"
```

## Poetry

``` toml
[tool.poetry.plugins.inspect_ai]
inspect_package = "inspect_package.inspect_extensions"
Expand Down Expand Up @@ -170,12 +172,15 @@ class PodmanToolEnvironment(ToolEnvironment):
The class methods take care of various stages of initialisation, setup, and teardown:

| Method | Lifecycle | Purpose |
|------------------|------------------|------------------------------------|
|-------------------|-------------------|----------------------------------|
| `task_init()` | Called at the beginning of each `Task`. | Expensive initialisation operations (e.g. pulling or building images) |
| `sample_init()` | Called at the beginning of each `Sample`. | Create `ToolEnvironment` instances for the sample. |
| `sample_cleanup()` | Called at the end of each `Sample` | Cleanup `ToolEnvironment` instances for the sample. |
| `task_cleanup()` | Called at the end of each `Task`. | Last chance handler for any resources not yet cleaned up (see also discussion below). |
| `cli_cleanup()` | Called via `inspect toolenv cleanup` | CLI invoked manual cleanup of resources created by this `ToolEnvironment`. |
| `max_samples()` | Called at startup | Provide a default `max_samples` (used to cap the default, explicit `max_samples` will override this). |

In the case of parallel execution of a group of tasks that share a working directory and tool environment, the `task_init()` and `task_cleanup()` functions may be called once for the entire group as a performance optimisation.

The `task_cleanup()` has a number of important functions:

Expand All @@ -195,9 +200,9 @@ The `task_cleanup()` function will typically print out the information required

The `ToolEnvironment` instance methods provide access to process execution and file input/output within the environment. A few notes on implementing these methods:

1. The `exec()` method currently only handles text output. If a call results in binary output then a `UnicodeDecodeError` will be raised. Tool environments should catch this and raise a `ToolError`.
1. The `exec()` method currently only handles text output. If a call results in binary output then a `UnicodeDecodeError` will be raised. Tool environments should catch this and raise a `ToolError`.

2. The `read_file()` method raise a `FileNotFoundError` if the specified `file` does not exist in the tool environment, as tools calling `read_file()` will often want to catch the `FileNotFoundError` and re-throw a `ToolError` (since models will frequently attempt to read files that do not exist).
2. The `read_file()` method raise a `FileNotFoundError` if the specified `file` does not exist in the tool environment, as tools calling `read_file()` will often want to catch the `FileNotFoundError` and re-throw a `ToolError` (since models will frequently attempt to read files that do not exist).

The best way to learn about writing tool environments is to look at the source code for the built in environments, [LocalToolEnvironment](https://github.com/UKGovernmentBEIS/inspect_ai/blob/main/src/inspect_ai/solver/_tool/environment/local.py) and [DockerToolEnvironment](https://github.com/UKGovernmentBEIS/inspect_ai/blob/main/src/inspect_ai/solver/_tool/environment/docker/docker.py).

Expand All @@ -209,12 +214,14 @@ For example, if your package was named `inspect_package` and your tool environme

::: {.panel-tabset group="entry-points"}
## Setuptools

``` toml
[project.entry-points.inspect_ai]
inspect_package = "inspect_package.inspect_extensions"
```

## Poetry

``` toml
[tool.poetry.plugins.inspect_ai]
inspect_package = "inspect_package.inspect_extensions"
Expand Down Expand Up @@ -299,12 +306,14 @@ As with Model APIs and Tool Environments, fsspec filesystems should be registere

::: {.panel-tabset group="entry-points"}
## Setuptools

``` toml
[project.entry-points."fsspec.specs"]
myfs = "inspect_package:MyFs"
```

## Poetry

``` toml
[tool.poetry.plugins."fsspec.specs"]
myfs = "inspect_package:MyFs"
Expand Down
Binary file added docs/images/inspect-multiple-models.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -179,9 +179,9 @@ These sections discuss more advanced features and workflow. You don't need to re

- [Caching](#sec-caching) enables you to cache model output to reduce the number of API calls made, saving both time and expense.

- [Eval Logs](#sec-eval-logs) explores how to get the most out of evaluation logs for developing, debugging, and analyzing evaluations.
- [Parallelism](#sec-parallelism) delves into how to obtain maximum performance for evaluations. Inspect uses a highly parallel async architecture---here we cover how to tune this parallelism (e.g to stay under API rate limits or to not overburden local compute) for optimal throughput.

- [Eval Tuning](#sec-eval-tuning) delves into how to obtain maximum performance for evaluations. Inspect uses a highly parallel async architecture---here we cover how to tune this parallelism (e.g to stay under API rate limits or to not overburden local compute) for optimal throughput.
- [Eval Logs](#sec-eval-logs) explores how to get the most out of evaluation logs for developing, debugging, and analyzing evaluations.

- [Eval Suites](#sec-eval-suites) covers Inspect's features for describing, running, and analysing larger sets of evaluation tasks.

Expand Down
2 changes: 1 addition & 1 deletion docs/models.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ Use `inspect eval --help` to learn about all of the available generation config

Inspect uses an asynchronous architecture to run task samples in parallel. If your model provider can handle 100 concurrent connections, then Inspect can utilise all of those connections to get the highest possible throughput. The limiting factor on parallelism is therefore not typically local parallelism (e.g. number of cores) but rather what the underlying rate limit is for your interface to the provider.

If you are experiencing rate-limit errors you will need to experiment with the `max_connections` option to find the optimal value that keeps you under the rate limit (the section on [Eval Tuning](eval-tuning.qmd) includes additional documentation on how to do this). Note that the next section describes how you can set a model-provider specific value for `max_connections` as well as other generation options.
If you are experiencing rate-limit errors you will need to experiment with the `max_connections` option to find the optimal value that keeps you under the rate limit (the section on [Parallelism](parallelism.qmd) includes additional documentation on how to do this). Note that the next section describes how you can set a model-provider specific value for `max_connections` as well as other generation options.

### Model Specific Configuration

Expand Down
Loading

0 comments on commit 2d7561f

Please sign in to comment.