release v0.3.5

UKGovernmentBEIS · May 4, 2024 · eca65fc · eca65fc
1 parent 3f5bf2b
commit eca65fc
Show file tree

Hide file tree

Showing 18 changed files with 227 additions and 74 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,13 @@
 # Changelog
 
+## v0.3.5 (04 May 2024)
+
+- Fix issue with logs from S3 buckets in inspect view.
+- Add `sort()` method to `Dataset` (defaults to sorting by sample input length).
+- Improve tokenization for HF provider (left padding, attention mask, and allow for custom chat template)
+- Improve batching for HF provider (generate as soon as queue fills, thread safety for future.set_result).
+- Various improvements to documentation.
+
 ## v0.3.4 (01 May 2024)
 
 - `write_eval_log()` now ignores unserializable objects in metadata fields.

diff --git a/docs/_quarto.yml b/docs/_quarto.yml
@@ -17,7 +17,7 @@ book:
       description: "Open-source framework for large language model evaluations"
    sidebar:
       header: >
-         [![](images/aisi-logo.png)](https://www.gov.uk/government/organisations/ai-safety-institute)
+         [![](images/aisi-logo.png){fig-alt="UK AI Safety Institute Website"}](https://www.gov.uk/government/organisations/ai-safety-institute)
 
    page-footer: 
       left: 

diff --git a/docs/eval-logs.qmd b/docs/eval-logs.qmd
@@ -8,15 +8,15 @@ Every time you use `inspect eval` or call the `eval()` function, an evaluation l
 $ inspect eval security_guide.py --model openai/gpt-4
 ```
 
-![](images/eval-log.png)
+![](images/eval-log.png){fig-alt="The Inspect task results displayed in the terminal. A link to the evaluation log is at the bottom of the results display."}
 
 You can also use the Inspect log viewer for interactive exploration of logs. Run this command once at the beginning of a working session (the view will update automatically when new evaluations are run):
 
 ```bash
 $ inspect view
 ```
 
-![](images/inspect-view-main.png){.border .lightbox}
+![](images/inspect-view-main.png){.border .lightbox fig-alt="The Inspect log viewer, displahing a summary of results for the task as well as 8 individual samples."}
 
 This section won't cover using `inspect view` though. Rather, it will cover the details of managing log usage from the CLI as well as the Python API for reading logs. See the [Log Viewer](#sec-log-viewer)  section for details on interactively exploring logs.
 

diff --git a/docs/eval-tuning.qmd b/docs/eval-tuning.qmd
@@ -23,7 +23,7 @@ The default value for max connections is 10. By increasing it we might get bette
 
 When you run an eval you'll see information reported on the current active connection usage as well as the number of HTTP rate limit errors that have been encountered (note that Inspect will automatically retry on rate limits and other errors likely to be transient):
 
-![](images/rate-limit.png)
+![](images/rate-limit.png){fig-alt="The Inspect task results displayed in the terminal. The number of HTTP rate limit errors that have occurred (25) is printed in the bottom right of the task results."}
 
 Here we've set a higher max connections than the default (30). While you might be tempted to set this very high to see how much concurrent traffic you can sustain, more often than not setting too high a max connections will result in slower evaluations, because retries are done using [exponential backoff](https://en.wikipedia.org/wiki/Exponential_backoff), and bouncing off of rate limits too frequently will have you waiting minutes for retries to fire.
 

diff --git a/docs/index.qmd b/docs/index.qmd
@@ -12,7 +12,7 @@ toc: false
 
 - Adapt and extend the framework with custom Python components.
 
-![](images/inspect-view-splash.png){.lightbox .border}
+![](images/inspect-view-splash.png){.lightbox .border fig-alt="The Inspect log viewer, displahing a summary of results for the task as well as 5 individual samples."}
 
 :::
 
@@ -142,7 +142,7 @@ The `@task` decorator applied to the `theory_of_mind()` function is what enables
 $ inspect eval theory_of_mind.py --model openai/gpt-4
 ```
 
-![](images/running-theory.png)
+![](images/running-theory.png){fig-alt="The Inspect task results displayed in the terminal. A progress bar indicates that the evaluation is about 60% complete."}
 
 By default, eval logs are written to the `./logs` sub-directory of the current working directory. When the eval is complete you will find a link to the log at the bottom of the task results summary.
 
@@ -152,7 +152,7 @@ You can also explore eval results using the Inspect log viewer. Run `inspect vie
 $ inspect view
 ```
 
-![](images/inspect-view-home.png){.border .lightbox}
+![](images/inspect-view-home.png){.border .lightbox fig-alt="The Inspect log viewer, displahing a summary of results for the task as well as 7 individual samples."}
 
 See the [Log Viewer](#sec-log-viewer) section for additional details on using Inspect View.
 

diff --git a/docs/log-viewer.qmd b/docs/log-viewer.qmd
@@ -4,7 +4,7 @@
 
 Inspect View provides a convenient way to visualise evaluation logs, including drilling into message histories, scoring decisions, and additional metadata written to the log. Here's what the main view of an evaluation log looks like:
 
-![](images/inspect-view-main.png){.border .lightbox}
+![](images/inspect-view-main.png){.border .lightbox fig-alt="The Inspect log viewer, displahing a summary of results for the task as well as 8 individual samples."}
 
 Below we'll describe how to get the most out of using Inspect View.
 
@@ -36,7 +36,7 @@ You only need to run `inspect view` once at the beginning of a session (as it wi
 
 You can view and navigate between a history of all evals in the log directory using the menu at the top right:
 
-![](images/inspect-view-history.png){.border .lightbox}
+![](images/inspect-view-history.png){.border .lightbox fig-alt="The Inspect log viewer, with the history panel displayed on the left overlaying the main interface. Several log files are displayed in the log history, each of which includes a summary of the results."}
 
 ## Sample Details
 
@@ -46,21 +46,21 @@ Click a sample to drill into its messages, scoring, and metadata.
 
 The messages tab displays the message history. In this example we see that the model make two tool calls before answering (the final assistant message is not fully displayed for brevity):
 
-![](images/inspect-view-messages.png){.border .lightbox}
+![](images/inspect-view-messages.png){.border .lightbox fig-alt="The Inspect log viewer showing a sample expanded, with details on the user, assistant, and tool messages for the sample."}
 
 Looking carefully at the message history (especially for agents or multi-turn solvers) is critically important for understanding how well your evaluation is constructed.
 
 ### Scoring
 
 The scoring tab shows additional details including the full input and full model explanation for answers:
 
-![](images/inspect-view-scoring.png){.border .lightbox}
+![](images/inspect-view-scoring.png){.border .lightbox fig-alt="The Inspect log viewer showing a sample expanded, with details on the scoring of the sample, including the input, target, answer, and explanation."}
 
 ### Metadata
 
 The metadata tab shows additional data made available by solvers, tools, an scorers (in this case the `web_search()` tool records which URLs it visited to retreive additional context):
 
-![](images/inspect-view-metadata.png){.border .lightbox}
+![](images/inspect-view-metadata.png){.border .lightbox fig-alt="The Inspect log viewer showing a sample expanded, with details on the metadata recorded by the web search tool during the evaluation (specifically, the URLs queried by the web search tool for the sample)."}
 
 ## Scores and Answers
 
@@ -75,7 +75,7 @@ A scorer can fail to correctly score output at either of these steps. Failing to
 
 You can use the log viewer to catch and evaluate these sorts of issues. For example, here we can see that we were unable to extract answers for a couple of questions that were scored incorrect:
 
-![](images/inspect-view-answers.png){.border .lightbox}
+![](images/inspect-view-answers.png){.border .lightbox fig-alt="The Inspect log viewer with several 5 samples displayed, 3 of which are incorrect. The Answer column displays the answer extracted from the model output for each sample."}
 
 It's possible that these answers are legitimately incorrect. However it's also possible that the correct answer is in the model's output but just in a format we didn't quite expect. In each case you'll need to drill into the sample to investigate.
 
@@ -97,11 +97,11 @@ Note there is also an `explanation` field: this is also important, as it allows
 
 It's often useful to filter log entries by score (for example, to investigate whether incorrect answers are due to scorer issues or are true negatives). Use the **Scores** picker to filter by specific scores:
 
-![](images/inspect-view-filter.png){.border .lightbox}
+![](images/inspect-view-filter.png){.border .lightbox fig-alt="The Inspect log view, with 4 samples displayed, each of which are marked incorrect. The Scores picker is focused, and has selected 'Incorrect', indicating that only incorrect scores should be displayed."}
 
 By default, samples are ordered (with all samples for an epoch presented in sequence). However you can also order by score, or order by samples (so you see all of the results for a given sample across all epochs presented together). Use the **Sort** picker to control this:
 
-![](images/inspect-view-sort.png){.border .lightbox}
+![](images/inspect-view-sort.png){.border .lightbox fig-alt="The Inspect log view, with the results of a single sample for each of the 4 epochs of the evaluation."}
 
 Viewing by sample can be especially valuable for diagnosing the sources of inconsistency (and determining whether they are inherent or an artifact of the evaluation methodology). Above we can see that sample 1 is incorrect in epoch 1 because of issue the model had with forming a correct function call.
 
@@ -121,15 +121,15 @@ logger.info(f"web query: {query}")
 
 You can see all of these log entries in the **Logging** tab:
 
-![](images/inspect-view-logging.png){.border .lightbox}
+![](images/inspect-view-logging.png){.border .lightbox fig-alt="The Logging panel of the Inspect log viewer, displaying several info log messages from the web search tool indicating what queries were executed by the tool."}
 
 It is important to note that the Inspect View will show all log entries level `info` or higher. However, printing every `info` message to the console during an eval might be too distracting, so the default log level for printing is `warning`. If you change it to `info` then you'll also see these log messages in the console:
 
 ``` bash
 $ inspect eval biology_qa.py --log-level info
 ```
 
-![](images/inspect-view-logging-console.png){.lightbox}
+![](images/inspect-view-logging-console.png){.lightbox fig-alt="This Inspect task display in the terminal, with several info log messages from the web search tool printed above the the task diplay."}
 
 A default log level of `warning` enables you to include many calls to `logger.info()` in your code without having them show by default, while also making them available in the log viewer should you need them.
 
@@ -139,4 +139,4 @@ Note that you can also set the log level using the `INSPECT_LOG_LEVEL` environme
 
 The **Info** panel of the log viewer provides additional meta-information about evaluation tasks, including dataset, plan, and scorer details, git revision, and model token usage:
 
-![](images/inspect-view-info.png){style=".border .lightbox"}
+![](images/inspect-view-info.png){.border .lightbox fig-alt="The Info panel of the Inspect log viewer, displaying various details about the evaluation including dataset, plan, and scorer details, git revision, and model token usage."}
diff --git a/docs/models.qmd b/docs/models.qmd
@@ -295,7 +295,7 @@ def theory_of_mind():
 
 ## Model Args
 
-The section above illustrates passing model specific arguments to local models on the command line, in `eval()`, and in `get_model()`. This actually works for all model types, so if there is an additional aspect of a modal you want to tweak that isn't covered by the `GenerationConfig`, you can use this method to do it. For example, here we specify the `transport` option for a Google Gemini model:
+The section above illustrates passing model specific arguments to local models on the command line, in `eval()`, and in `get_model()`. This actually works for all model types, so if there is an additional aspect of a model you want to tweak that isn't covered by the `GenerationConfig`, you can use this method to do it. For example, here we specify the `transport` option for a Google Gemini model:
 
 ``` bash
 inspect eval popularity --model google/gemini-1.0-pro -M transport:grpc
@@ -358,4 +358,4 @@ model = get_model("custom/name-of-model")
 eval(math, model = "custom/name-of-model")
 ```
 
-In this example, the `model_name` argument passed to `__init__()` will be "name-of-model".
+In this example, the `model_name` argument passed to `__init__()` will be "name-of-model".
diff --git a/docs/theme.scss b/docs/theme.scss
@@ -46,3 +46,6 @@
     }
 }
 
+.blockquote {
+    color: #505a62;
+}
diff --git a/docs/tools.qmd b/docs/tools.qmd
@@ -183,7 +183,7 @@ Web search options include:
 
 -   `model`---Model to use to determine if search results are relevant (defaults to the model currently being evaluated).
 
-#### Google Provider
+### Google Provider
 
 The `web_search()` tool uses [Google Programmable Search Engine](https://programmablesearchengine.google.com/about/). To use it you will therefore need to setup your own Google Programmable Search Engine and also enable the [Programmable Search Element Paid API](https://developers.google.com/custom-search/docs/paid_element). Then, ensure that the following environment variables are defined:
 

diff --git a/docs/workflow.qmd b/docs/workflow.qmd
@@ -37,7 +37,7 @@ You can run this evaluation from the shell using the `inspect eval` command. For
 $ inspect eval theory.py --model openai/gpt-4
 ```
 
-![](images/running-theory.png)
+![](images/running-theory.png){fig-alt="The Inspect task results displayed in the terminal. A progress bar indicates that the evaluation is about 60% complete."}
 
 Immediately after an evaluation completes, a link to the log for the evaluation is written to the terminal (if you are running in VS Code this link will open the log in an editor within the IDE).
 
@@ -67,12 +67,12 @@ As you iterate on an evaluation, you'll typically want to dig further into messa
 $ inspect view
 ```
 
-![](images/inspect-view-main.png){.border .lightbox}
+![](images/inspect-view-main.png){.border .lightbox fig-alt="The Inspect log viewer, displahing a summary of results for the task as well as 8 individual samples."}
 
 
 The log viewer will update automatically whenever a new evaluation is completed (you can also navigate back to previous evaluations). The log viewer summarises aggregate data and also provides a detailed view into each sample. For example, here we zoom in on the model's scoring explanation for a specific sample:
 
-![](images/inspect-view-scoring.png){.border .lightbox}
+![](images/inspect-view-scoring.png){.border .lightbox fig-alt="The Inspect log viewer showing a sample expanded, with details on the scoring of the sample, including the input, target, answer, and explanation."}
 
 See the [Log Viewer](#sec-log-viewer) section for additional details on using Inspect View.
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -108,6 +108,7 @@ dev = [
     "mistralai",
     "boto3",
     "transformers",
+    "accelerate",
     "torch",
     "datasets",
     "langchain",

diff --git a/src/inspect_ai/_view/view.py b/src/inspect_ai/_view/view.py
@@ -9,7 +9,7 @@
 from io import BytesIO
 from pathlib import Path
 from typing import Any
-from urllib.parse import parse_qs, urlparse
+from urllib.parse import parse_qs, urlparse, urlunparse
 
 import psutil
 
@@ -128,10 +128,23 @@ def handle_log(self) -> None:
 
         # check for query params
         parsed = urlparse(path)
-        path = parsed.path
+
+        # read query parameters from the URL
         query_params = parse_qs(parsed.query)
         header_only = query_params.get("header-only", None) is not None
 
+        # reconstruct the path
+        path = urlunparse(
+            (
+                parsed.scheme,
+                parsed.netloc,
+                parsed.path,
+                parsed.params,
+                "",  # Clear the query component
+                parsed.fragment,
+            )
+        )
+
         ctype = self.guess_type(path)
         try:
             contents: bytes | None = None
-Original file line number
+Diff line change
@@ Expand Up / @@ -46,3 +46,6 @@ @@
         }
     }
+    .blockquote {
+        color: #505a62;
+    }