gaia tweaks (#450)

Co-authored-by: aisi-inspect <[email protected]>
UKGovernmentBEIS · Sep 18, 2024 · 45d17d0 · 45d17d0
1 parent 33ef823
commit 45d17d0
Show file tree

Hide file tree

Showing 6 changed files with 156 additions and 67 deletions.
diff --git a/docs/examples/examples.bib b/docs/examples/examples.bib
@@ -257,3 +257,13 @@ @misc{jimenez2024swebenchlanguagemodelsresolve
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2310.06770}, 
 }
+
+@misc{mialon2023gaiabenchmarkgeneralai,
+      title={GAIA: a benchmark for General AI Assistants}, 
+      author={Grégoire Mialon and Clémentine Fourrier and Craig Swift and Thomas Wolf and Yann LeCun and Thomas Scialom},
+      year={2023},
+      eprint={2311.12983},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2311.12983}, 
+}
diff --git a/docs/examples/examples.yml b/docs/examples/examples.yml
@@ -31,6 +31,16 @@
   demonstrates: ["Scoring", "Sandbox", "Tools"]
   contributors: ["max-kaufmann"]
 
+- title: "GAIA: A Benchmark for General AI Assistants"
+  description: |
+    GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs
+  path: evals/gaia
+  arxiv: https://arxiv.org/abs/2311.12983
+  cite: mialon2023gaiabenchmarkgeneralai
+  group: Assistants
+  demonstrates: ["Web Search", "Sandbox", "Tools"]
+  contributors: ["max-kaufmann"]
+
 - title: "InterCode: Capture the Flag"
   description: |
     Measure expertise in coding, cryptography (i.e. binary exploitation, forensics), reverse engineering, and recognizing security vulnerabilities. Demonstrates tool use and sandboxing untrusted model code.

diff --git a/evals/gaia/.gitignore b/evals/gaia/.gitignore
@@ -1 +1 @@
-./resources/*
+resources/*
diff --git a/evals/gaia/README.md b/evals/gaia/README.md
@@ -1,26 +1,72 @@
 # GAIA
 This is an inspect-native implementation of [the GAIA (General AI Assistants)](https://arxiv.org/abs/2311.12983) benchmark, consisting of 450 questions testing tool use on realistic assistant tasks (mostly web browsing).
 
-## Installation
+## Prerequisites
+
+1) **Python Dependencies** Install the Python dependencies for the GAIA task with:
+
+   ```
+   pip install -r requirements.txt
+   ```
+
+2) **Docker Engine** The GAIA task uses [tool calling](https://inspect.ai-safety-institute.org.uk/tools.html) to enable the model to execute bash commands. Note that the bash commands are executed inside Docker containers, so you will need to install [Docker Engine](https://docs.docker.com/engine/install/) in order to run the evaluation.
+
+3) **Hugging Face Dataset** Upon running the GAIA task, it will attempt to download the dataset from the HuggingFace hub. For this to work, you will need to gain access to the dataset (by filling out a form on [the GAIA huggingface repository](https://huggingface.co/datasets/gaia-benchmark/GAIA)), and also [create and set an access token](https://huggingface.co/docs/hub/en/security-tokens). You will need to define the `HF_TOKEN` environment variable to access the dataset:
+
+   ```
+   HF_TOKEN=<hf-token>
+   ```
+
+4) **Search Provider** The default solver for the GAIA task uses the inspect [web_search()](https://inspect.ai-safety-institute.org.uk/tools.html#sec-web-search) tool. The `web_search()` tool uses [Google Programmable Search Engine](https://programmablesearchengine.google.com/about/). To use it you will therefore need to setup your own Google Programmable Search Engine and also enable the [Programmable Search Element Paid API](https://developers.google.com/custom-search/docs/paid_element). Then, ensure that the following environment variables are defined:
+
+   ```
+   GOOGLE_CSE_ID=<google-search-engine-id>
+   GOOGLE_CSE_API_KEY=<google-api-key>
+   ```
 
-1) **Install requirements.** As well as the requirements provided by inspect, this benchmark has its own dependencies. You can install them in your virtual environment by running ```pip install -r requirements.txt```
-2) **Download the dataset.** Upon loading gaia.py file, it will attempt to download the dataset from the HuggingFace hub. For this to work, you will need to gain access to the dataset (by filling out a form on [the GAIA huggingface repository](https://huggingface.co/datasets/gaia-benchmark/GAIA)), and also [create and set an access token](https://huggingface.co/docs/hub/en/security-tokens).
 
 ## Usage
-After install, the GAIA benchmark can be used in the following way:
+
+After fulfiling the prerequisites, run the GAIA task aginst various models with:
+
+```bash
+$ inspect eval gaia.py --model openai/gpt-4o
+$ inspect eval gaia.py --model google/gemini-1.5-pro
+```
+
+You might want to limit the number of samples evaluated for initial experimentation:
+
+```bash
+$ inspect eval gaia.py --model openai/gpt-4o --limit 5
+```
+
+## Development
+
+If you want to develop solver for the GAIA task, import the `@task` from the `gaia` module and pass your own solver to it. For example:
 
 ```python
 from inspect_ai import eval
 from inspect_ai.solver import use_tools, generate, system_message, basic_agent
 from inspect_ai.tool import bash, web_search
 from gaia import gaia
 
-# Create a task to attempt to solve GAIA with a basic agent.
 agent = basic_agent(tools=[bash()])
-task = gaia(split="validation", plan=agent)
+task = gaia(plan=agent)
+
+eval(task, model="openai/gpt-4o")
+```
+
+See the documentation on the Inspect [Agents API](https://inspect.ai-safety-institute.org.uk/agents-api.html) for additional information on developing agents.
+
+### Test Split
+
+You can evaluate against the "test" split with:
+
+```python
+agent = basic_agent(tools=[bash()])
+task = gaia(plan=agent, split="test")
 
-# For the demonstration, we will select only the first 5 tasks.
-eval(task, model="openai/gpt-4o", limit=5)
+eval(task, model="openai/gpt-4o")
 ```
 
-Note that the GAIA test split does not come with any solutions - they must be uploaded to the online leaderboard.
+Note that the GAIA "test" split does not come with any solutions so is not scored (rather, solutions are uploaded to the online leaderboard).
diff --git a/evals/gaia/dataset.py b/evals/gaia/dataset.py
diff --git a/evals/gaia/gaia.py b/evals/gaia/gaia.py
@@ -2,8 +2,10 @@
 from pathlib import Path
 from typing import Any, Callable, Literal
 
+from huggingface_hub import snapshot_download  # type: ignore
+
 from inspect_ai import Task, task
-from inspect_ai.dataset import Sample, hf_dataset
+from inspect_ai.dataset import Dataset, Sample, hf_dataset
 from inspect_ai.scorer import match
 from inspect_ai.solver import Plan, basic_agent, system_message
 from inspect_ai.tool import bash, web_search
@@ -14,81 +16,113 @@
 @task
 def gaia(
     plan: Plan | None = None,
-    split: Literal["test", "validation"] = "validation",
+    input_prompt: str | None = None,
+    max_messages: int = 30,
     subset: Literal[
         "2023_all", "2023_level1", "2023_level2", "2023_level3"
     ] = "2023_all",
-    filter: Callable[[Sample], bool] = lambda x: True,
-    input_prompt: str | None = None,
-    max_messages: int = 30,
+    split: Literal["test", "validation"] = "validation",
+    filter: Callable[[Sample], bool] | None = None,
 ) -> Task:
-    if not os.path.exists(GAIA_DATASET_LOCATION):
-        raise FileNotFoundError(
-            "GAIA dataset not found. Please run 'python dataset.py.' to download the dataset."
-        )
-
-    dataset = hf_dataset(
-        str(GAIA_DATASET_LOCATION),
+    # read dataset
+    dataset = read_dataset(
+        input_prompt=input_prompt or DEFAULT_INPUT_PROMPT,
+        subset=subset,
         split=split,
-        sample_fields=gaia_hf_record_to_sample(input_prompt, split),
-        trust=True,  # Trust GAIA's remote code execution during dataset loading
-        name=subset,
+        filter=filter if filter else lambda x: True,
     )
 
-    dataset = dataset.filter(filter)
+    # provide default plan if required
+    if plan is None:
+        plan = basic_agent(
+            init=system_message(
+                "Please solve the coding task below. Once you are done, "
+                + "use the submit() tool to provide your answer."
+            ),
+            tools=[bash(), web_search()],
+        )
 
+    # resolve scorer (test split has no answers)
+    scorer = match() if split == "validation" else None
+
+    # return task
     return Task(
         dataset=dataset,
-        plan=plan if plan is not None else DEFAULT_PLAN,
-        scorer=match() if split == "validation" else None,  # Test split has no answers
+        plan=plan,
+        scorer=scorer,
         sandbox="docker",
         max_messages=max_messages,
     )
 
 
-def gaia_hf_record_to_sample(
-    input_prompt: str | None, split: str
-) -> Callable[[dict[str, Any]], Sample]:
-    input_prompt = input_prompt if input_prompt is not None else DEFAULT_INPUT_PROMPT
-    files_location = GAIA_DATASET_LOCATION / "2023" / split
+DEFAULT_INPUT_PROMPT = """Please answer the question below. You should:
+
+- Return only your answer, which should be a number, or a short phrase with as
+  few words as possible, or a comma separated list of numbers and/or strings.
+- If the answer is a number, return only the number without any units unless
+  specified otherwise.
+- If the answer is a string, don't include articles, and don't use
+  abbreviations (e.g. for states).
+- If the answer is a comma separated list, apply the above rules to each
+  element in the list.
+
+Any files or attachments mentioned in the question can be found in the
+/shared_files/ directory (some questions do not have associated files). Here
+is the question:
 
-    def gaia_hf_record_to_sample(record: dict[str, Any]) -> Sample:
+{question}"""
+
+
+def read_dataset(
+    input_prompt: str,
+    subset: str,
+    split: str,
+    filter: Callable[[Sample], bool] = lambda x: True,
+) -> Dataset:
+    # download dataset if required
+    if not os.path.exists(GAIA_DATASET_LOCATION):
+        GAIA_DATASET_LOCATION.mkdir(parents=True, exist_ok=True)
+        snapshot_download(
+            repo_id="gaia-benchmark/GAIA",
+            repo_type="dataset",
+            local_dir=GAIA_DATASET_LOCATION,
+        )
+
+    # map record to sample
+    def record_to_sample(record: dict[str, Any]) -> Sample:
+        # map fields
         sample = Sample(
             input=input_prompt.format(question=record["Question"]),
             target=record["Final answer"],
             id=record["task_id"],
             metadata={
-                "level": record["level"],
+                "level": record["Level"],
                 "Annotator Metadata": record["Annotator Metadata"],
             },
+            setup="mkdir -p /shared_files/",
         )
 
+        # apply input prompt
         sample.input = input_prompt.format(question=sample.input)
 
-        # Setup files into the sample
+        # provide sample files
+        files_location = GAIA_DATASET_LOCATION / "2023" / split
         files = [file for file in os.listdir(files_location) if str(sample.id) in file]
         if len(files) > 0:
-            sample.files = {"/shared_files/" + files[0]: str((files / files[0]))}
+            sample.files = {
+                "/shared_files/" + files[0]: (files_location / files[0]).as_posix()
+            }
 
         return sample
 
-    return gaia_hf_record_to_sample
-
-
-DEFAULT_INPUT_PROMPT = """Please answer the question below. You should:
-
-- Return only your answer, which should be a number, or a short phrase with as few words as possible, or a comma separated list of numbers and/or strings.
-- If the answer is a number, return only the number without any units unless specified otherwise.
-- If the answer is a string, don't include articles, and don't use abbreviations (e.g. for states).
-- If the answer is a comma separated list, apply the above rules to each element in the list.
-
-Any files or attachments mentioned in the question can be found in the /shared_files/ directory (some questions do not have associated files). Here is the question:
-
-{question}"""
+    # read dataset
+    dataset = hf_dataset(
+        GAIA_DATASET_LOCATION.as_posix(),
+        name=subset,
+        split=split,
+        sample_fields=record_to_sample,
+        trust=True,  # Trust GAIA's remote code execution during dataset loading
+    )
 
-DEFAULT_PLAN = basic_agent(
-    init=system_message(
-        "Please solve the coding task below. Once you are done, use your submit tool."
-    ),
-    tools=[bash(), web_search()],
-)
+    # apply filter (if any) and return
+    return dataset.filter(filter)