sync 01-07-24

UKGovernmentBEIS · Jul 1, 2024 · 2d7561f · 2d7561f
1 parent b860976
commit 2d7561f
Show file tree

Hide file tree

Showing 32 changed files with 1,002 additions and 678 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,7 +2,12 @@
 
 ## Unreleased
 
+- Evaluate multiple models in parallel by passing a list of models to `eval()`.
 - Add `api_key` to `get_model()` for explicitly specifying an API key for a model.
+- Improved handling of very large (> 100MB) log files in Inspect View.
+- Use `network_mode: none` for disabling networking by default in Docker tool environments.
+- Allow tool environent providers to specify a default `max_samples` (set to 25 for the Docker provider).
+- Prevent concurrent calls to `eval_async()` (unsafe because of need to change directories for tasks). Parallel task evaluation will instead be implemented as a top-level feature of `eval()` and `eval_async()`
 
 ## v0.3.17 (25 June 2024)
 

diff --git a/docs/_format/pre-render.sh b/docs/_format/pre-render.sh
@@ -6,4 +6,4 @@ if [ -n "${QUARTO_PROJECT_RENDER_ALL}" ]; then
   (echo; echo) >> ../examples.qmd
   for f in security_guide.qmd hellaswag.qmd theory_of_mind.qmd mathematics.qmd biology_qa.qmd arc.qmd tool_use.qmd gsm8k.qmd footer.qmd; do (cat "${f}"; echo; echo; echo) >> ../examples.qmd; done
   cd ..
-fi
+fi
diff --git a/docs/_quarto.yml b/docs/_quarto.yml
@@ -68,9 +68,9 @@ book:
       - part: "Advanced"
         chapters:
            - caching.qmd
+           - parallelism.qmd
            - eval-logs.qmd
            - eval-suites.qmd
-           - eval-tuning.qmd
            - extensions.qmd
 
 toc-depth: 2
@@ -96,4 +96,4 @@ format:
    #    date: today
 
 execute: 
-  enabled: false
+  enabled: false
diff --git a/docs/_sample-preservation.md b/docs/_sample-preservation.md
@@ -16,4 +16,6 @@ If dataset shuffling is important to your evaluation and you want to preserve sa
 
 Another consideration is `max_samples`, which is the maximum number of samples to run concurrently within a task. Larger numbers of concurrent samples will result in higher throughput, but will also result in completed samples being written less frequently to the log file, and consequently less total recovable samples in the case of an interrupted task.
 
-By default, Inspect sets the value of `max_samples` to `max_connections + 1`, ensuring that the model API is always fully saturated (note that it would rarely make sense to set it _lower_ than `max_connections`). The default `max_connections` is 10, which will typically result in samples being written to the log frequently. On the other hand, setting a very large `max_connections` (e.g. 100 `max_connections` for a dataset with 100 samples) may result in very few recoverable samples in the case of an interruption. 
+By default, Inspect sets the value of `max_samples` to `max_connections + 1`, ensuring that the model API is always fully saturated (note that it would rarely make sense to set it _lower_ than `max_connections`). The default `max_connections` is 10, which will typically result in samples being written to the log frequently. On the other hand, setting a very large `max_connections` (e.g. 100 `max_connections` for a dataset with 100 samples) may result in very few recoverable samples in the case of an interruption. 
+
+Note also that when using [Tool Environments](#sec-tool-environments), the tool environment provider may place an additional cap on the default `max_samples` (for example, the Docker provider limits the default `max_samples` to no more than 25).
diff --git a/docs/agents.qmd b/docs/agents.qmd
@@ -579,7 +579,7 @@ eval("ctf.py", toolenv_cleanup = False)
 
 When you do this, you'll see something like the following printed out at the end of the eval:
 
-![](images/toolenv-no-cleanup.png){.border}
+![](images/toolenv-no-cleanup.png){.border fig-alt="A printed list of yet to be cleaned up Docker tool environments (including the container id and cleanup command for each one)"}
 
 You then might use this command to get a shell inside one of the containers:
 

diff --git a/docs/datasets.qmd b/docs/datasets.qmd
@@ -2,7 +2,7 @@
 
 ## Overview
 
-Inspect has native support for reading datasets in the CSV, JSON, and JSON Lines formats, as well as from [Hugging Face](#sec-hugging-face-datasets). In addition, the core dataset interface for the evaluation pipeline is flexible enough to accept data read from just about any source.
+Inspect has native support for reading datasets in the CSV, JSON, and JSON Lines formats, as well as from [Hugging Face](#sec-hugging-face-datasets). In addition, the core dataset interface for the evaluation pipeline is flexible enough to accept data read from just about any source (see the [Custom Reader](#sec-custom-reader) section below for details).
 
 If your data is already in a format amenable for direct reading as an Inspect `Sample`, reading a dataset is as simple as this:
 
@@ -216,22 +216,22 @@ ChatMessageUser(content = [
 Note that image input is currently only supported for OpenAI vision models (e.g. [gpt-4-vision-preview](https://platform.openai.com/docs/guides/vision)), Google Gemini vision models (e.g. [gemini-pro-vision](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-pro-vision)), and Anthropic Claude 3 models.
 :::
 
-## Custom Reader
+## Custom Reader {#sec-custom-reader}
 
-You are not restricted to the built in dataset functions for reading samples. Since the `dataset` field of the `Task` class takes either a `Dataset` or a sequences of`Sample`, the following is also valid:
+You are not restricted to the built in dataset functions for reading samples. You can also construct a `MemoryDataset`, and pass that to a task. For example:
 
 ``` python
 from inspect_ai import Task, task
-from inspect_ai.dataset import Sample
+from inspect_ai.dataset import MemoryDataset, Sample
 from inspect_ai.scorer import model_graded_fact
 from inspect_ai.solver import generate, system_message
 
-dataset=[
+dataset=MemoryDataset([
     Sample(
         input="What cookie attributes should I use for strong security?",
         target="secure samesite and httponly",
     )
-]
+])
 
 @task
 def security_guide():
@@ -242,4 +242,4 @@ def security_guide():
     )
 ```
 
-So if the built in dataset functions don't meet your needs, you can create a custom function that yields a list of `Sample` instances and pass those directly to your `Task`.
+So if the built in dataset functions don't meet your needs, you can create a custom function that yields a `MemoryDataset`and pass those directly to your `Task`.
diff --git a/docs/extensions.qmd b/docs/extensions.qmd
@@ -53,12 +53,14 @@ For example, if your package was named `inspect_package` and your model provider
 
 ::: {.panel-tabset group="entry-points"}
 ## Setuptools
+
 ``` toml
 [project.entry-points.inspect_ai]
 inspect_package = "inspect_package.inspect_extensions"
 ```
 
 ## Poetry
+
 ``` toml
 [tool.poetry.plugins.inspect_ai]
 inspect_package = "inspect_package.inspect_extensions"
@@ -170,12 +172,15 @@ class PodmanToolEnvironment(ToolEnvironment):
 The class methods take care of various stages of initialisation, setup, and teardown:
 
 | Method | Lifecycle | Purpose |
-|------------------|------------------|------------------------------------|
+|-------------------|-------------------|----------------------------------|
 | `task_init()` | Called at the beginning of each `Task`. | Expensive initialisation operations (e.g. pulling or building images) |
 | `sample_init()` | Called at the beginning of each `Sample`. | Create `ToolEnvironment` instances for the sample. |
 | `sample_cleanup()` | Called at the end of each `Sample` | Cleanup `ToolEnvironment` instances for the sample. |
 | `task_cleanup()` | Called at the end of each `Task`. | Last chance handler for any resources not yet cleaned up (see also discussion below). |
 | `cli_cleanup()` | Called via `inspect toolenv cleanup` | CLI invoked manual cleanup of resources created by this `ToolEnvironment`. |
+| `max_samples()` | Called at startup | Provide a default `max_samples` (used to cap the default, explicit `max_samples` will override this). |
+
+In the case of parallel execution of a group of tasks that share a working directory and tool environment, the `task_init()` and `task_cleanup()` functions may be called once for the entire group as a performance optimisation.
 
 The `task_cleanup()` has a number of important functions:
 
@@ -195,9 +200,9 @@ The `task_cleanup()` function will typically print out the information required
 
 The `ToolEnvironment` instance methods provide access to process execution and file input/output within the environment. A few notes on implementing these methods:
 
-1. The `exec()` method currently only handles text output. If a call results in binary output then a `UnicodeDecodeError` will be raised. Tool environments should catch this and raise a `ToolError`.
+1.  The `exec()` method currently only handles text output. If a call results in binary output then a `UnicodeDecodeError` will be raised. Tool environments should catch this and raise a `ToolError`.
 
-2. The `read_file()` method raise a `FileNotFoundError` if the specified `file` does not exist in the tool environment, as tools calling `read_file()` will often want to catch the `FileNotFoundError` and re-throw a `ToolError` (since models will frequently attempt to read files that do not exist).
+2.  The `read_file()` method raise a `FileNotFoundError` if the specified `file` does not exist in the tool environment, as tools calling `read_file()` will often want to catch the `FileNotFoundError` and re-throw a `ToolError` (since models will frequently attempt to read files that do not exist).
 
 The best way to learn about writing tool environments is to look at the source code for the built in environments, [LocalToolEnvironment](https://github.com/UKGovernmentBEIS/inspect_ai/blob/main/src/inspect_ai/solver/_tool/environment/local.py) and [DockerToolEnvironment](https://github.com/UKGovernmentBEIS/inspect_ai/blob/main/src/inspect_ai/solver/_tool/environment/docker/docker.py).
 
@@ -209,12 +214,14 @@ For example, if your package was named `inspect_package` and your tool environme
 
 ::: {.panel-tabset group="entry-points"}
 ## Setuptools
+
 ``` toml
 [project.entry-points.inspect_ai]
 inspect_package = "inspect_package.inspect_extensions"
 ```
 
 ## Poetry
+
 ``` toml
 [tool.poetry.plugins.inspect_ai]
 inspect_package = "inspect_package.inspect_extensions"
@@ -299,12 +306,14 @@ As with Model APIs and Tool Environments, fsspec filesystems should be registere
 
 ::: {.panel-tabset group="entry-points"}
 ## Setuptools
+
 ``` toml
 [project.entry-points."fsspec.specs"]
 myfs = "inspect_package:MyFs"
 ```
 
 ## Poetry
+
 ``` toml
 [tool.poetry.plugins."fsspec.specs"]
 myfs = "inspect_package:MyFs"

diff --git a/docs/images/inspect-multiple-models.png b/docs/images/inspect-multiple-models.png
diff --git a/docs/index.qmd b/docs/index.qmd
@@ -179,9 +179,9 @@ These sections discuss more advanced features and workflow. You don't need to re
 
 -   [Caching](#sec-caching) enables you to cache model output to reduce the number of API calls made, saving both time and expense.
 
--   [Eval Logs](#sec-eval-logs) explores how to get the most out of evaluation logs for developing, debugging, and analyzing evaluations.
+-   [Parallelism](#sec-parallelism) delves into how to obtain maximum performance for evaluations. Inspect uses a highly parallel async architecture---here we cover how to tune this parallelism (e.g to stay under API rate limits or to not overburden local compute) for optimal throughput.
 
--   [Eval Tuning](#sec-eval-tuning) delves into how to obtain maximum performance for evaluations. Inspect uses a highly parallel async architecture---here we cover how to tune this parallelism (e.g to stay under API rate limits or to not overburden local compute) for optimal throughput.
+-   [Eval Logs](#sec-eval-logs) explores how to get the most out of evaluation logs for developing, debugging, and analyzing evaluations.
 
 -   [Eval Suites](#sec-eval-suites) covers Inspect's features for describing, running, and analysing larger sets of evaluation tasks.
 

diff --git a/docs/models.qmd b/docs/models.qmd
@@ -118,7 +118,7 @@ Use `inspect eval --help` to learn about all of the available generation config
 
 Inspect uses an asynchronous architecture to run task samples in parallel. If your model provider can handle 100 concurrent connections, then Inspect can utilise all of those connections to get the highest possible throughput. The limiting factor on parallelism is therefore not typically local parallelism (e.g. number of cores) but rather what the underlying rate limit is for your interface to the provider.
 
-If you are experiencing rate-limit errors you will need to experiment with the `max_connections` option to find the optimal value that keeps you under the rate limit (the section on [Eval Tuning](eval-tuning.qmd) includes additional documentation on how to do this). Note that the next section describes how you can set a model-provider specific value for `max_connections` as well as other generation options.
+If you are experiencing rate-limit errors you will need to experiment with the `max_connections` option to find the optimal value that keeps you under the rate limit (the section on [Parallelism](parallelism.qmd) includes additional documentation on how to do this). Note that the next section describes how you can set a model-provider specific value for `max_connections` as well as other generation options.
 
 ### Model Specific Configuration
-Original file line number
+Diff line change
@@ Expand Up / @@ -6,4 +6,4 @@ if [ -n "${QUARTO_PROJECT_RENDER_ALL}" ]; then @@
       (echo; echo) >> ../examples.qmd
       for f in security_guide.qmd hellaswag.qmd theory_of_mind.qmd mathematics.qmd biology_qa.qmd arc.qmd tool_use.qmd gsm8k.qmd footer.qmd; do (cat "${f}"; echo; echo; echo) >> ../examples.qmd; done
       cd ..
-    fi
+    fi