Skip to content

Commit

Permalink
gaia tweaks (#450)
Browse files Browse the repository at this point in the history
Co-authored-by: aisi-inspect <[email protected]>
  • Loading branch information
jjallaire-aisi and aisi-inspect authored Sep 18, 2024
1 parent 33ef823 commit 45d17d0
Show file tree
Hide file tree
Showing 6 changed files with 156 additions and 67 deletions.
10 changes: 10 additions & 0 deletions docs/examples/examples.bib
Original file line number Diff line number Diff line change
Expand Up @@ -257,3 +257,13 @@ @misc{jimenez2024swebenchlanguagemodelsresolve
primaryClass={cs.CL},
url={https://arxiv.org/abs/2310.06770},
}

@misc{mialon2023gaiabenchmarkgeneralai,
title={GAIA: a benchmark for General AI Assistants},
author={Grégoire Mialon and Clémentine Fourrier and Craig Swift and Thomas Wolf and Yann LeCun and Thomas Scialom},
year={2023},
eprint={2311.12983},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2311.12983},
}
10 changes: 10 additions & 0 deletions docs/examples/examples.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,16 @@
demonstrates: ["Scoring", "Sandbox", "Tools"]
contributors: ["max-kaufmann"]

- title: "GAIA: A Benchmark for General AI Assistants"
description: |
GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs
path: evals/gaia
arxiv: https://arxiv.org/abs/2311.12983
cite: mialon2023gaiabenchmarkgeneralai
group: Assistants
demonstrates: ["Web Search", "Sandbox", "Tools"]
contributors: ["max-kaufmann"]

- title: "InterCode: Capture the Flag"
description: |
Measure expertise in coding, cryptography (i.e. binary exploitation, forensics), reverse engineering, and recognizing security vulnerabilities. Demonstrates tool use and sandboxing untrusted model code.
Expand Down
2 changes: 1 addition & 1 deletion evals/gaia/.gitignore
Original file line number Diff line number Diff line change
@@ -1 +1 @@
./resources/*
resources/*
64 changes: 55 additions & 9 deletions evals/gaia/README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,72 @@
# GAIA
This is an inspect-native implementation of [the GAIA (General AI Assistants)](https://arxiv.org/abs/2311.12983) benchmark, consisting of 450 questions testing tool use on realistic assistant tasks (mostly web browsing).

## Installation
## Prerequisites

1) **Python Dependencies** Install the Python dependencies for the GAIA task with:

```
pip install -r requirements.txt
```

2) **Docker Engine** The GAIA task uses [tool calling](https://inspect.ai-safety-institute.org.uk/tools.html) to enable the model to execute bash commands. Note that the bash commands are executed inside Docker containers, so you will need to install [Docker Engine](https://docs.docker.com/engine/install/) in order to run the evaluation.

3) **Hugging Face Dataset** Upon running the GAIA task, it will attempt to download the dataset from the HuggingFace hub. For this to work, you will need to gain access to the dataset (by filling out a form on [the GAIA huggingface repository](https://huggingface.co/datasets/gaia-benchmark/GAIA)), and also [create and set an access token](https://huggingface.co/docs/hub/en/security-tokens). You will need to define the `HF_TOKEN` environment variable to access the dataset:

```
HF_TOKEN=<hf-token>
```

4) **Search Provider** The default solver for the GAIA task uses the inspect [web_search()](https://inspect.ai-safety-institute.org.uk/tools.html#sec-web-search) tool. The `web_search()` tool uses [Google Programmable Search Engine](https://programmablesearchengine.google.com/about/). To use it you will therefore need to setup your own Google Programmable Search Engine and also enable the [Programmable Search Element Paid API](https://developers.google.com/custom-search/docs/paid_element). Then, ensure that the following environment variables are defined:

```
GOOGLE_CSE_ID=<google-search-engine-id>
GOOGLE_CSE_API_KEY=<google-api-key>
```

1) **Install requirements.** As well as the requirements provided by inspect, this benchmark has its own dependencies. You can install them in your virtual environment by running ```pip install -r requirements.txt```
2) **Download the dataset.** Upon loading gaia.py file, it will attempt to download the dataset from the HuggingFace hub. For this to work, you will need to gain access to the dataset (by filling out a form on [the GAIA huggingface repository](https://huggingface.co/datasets/gaia-benchmark/GAIA)), and also [create and set an access token](https://huggingface.co/docs/hub/en/security-tokens).

## Usage
After install, the GAIA benchmark can be used in the following way:

After fulfiling the prerequisites, run the GAIA task aginst various models with:

```bash
$ inspect eval gaia.py --model openai/gpt-4o
$ inspect eval gaia.py --model google/gemini-1.5-pro
```

You might want to limit the number of samples evaluated for initial experimentation:

```bash
$ inspect eval gaia.py --model openai/gpt-4o --limit 5
```

## Development

If you want to develop solver for the GAIA task, import the `@task` from the `gaia` module and pass your own solver to it. For example:

```python
from inspect_ai import eval
from inspect_ai.solver import use_tools, generate, system_message, basic_agent
from inspect_ai.tool import bash, web_search
from gaia import gaia

# Create a task to attempt to solve GAIA with a basic agent.
agent = basic_agent(tools=[bash()])
task = gaia(split="validation", plan=agent)
task = gaia(plan=agent)

eval(task, model="openai/gpt-4o")
```

See the documentation on the Inspect [Agents API](https://inspect.ai-safety-institute.org.uk/agents-api.html) for additional information on developing agents.

### Test Split

You can evaluate against the "test" split with:

```python
agent = basic_agent(tools=[bash()])
task = gaia(plan=agent, split="test")

# For the demonstration, we will select only the first 5 tasks.
eval(task, model="openai/gpt-4o", limit=5)
eval(task, model="openai/gpt-4o")
```

Note that the GAIA test split does not come with any solutions - they must be uploaded to the online leaderboard.
Note that the GAIA "test" split does not come with any solutions so is not scored (rather, solutions are uploaded to the online leaderboard).
11 changes: 0 additions & 11 deletions evals/gaia/dataset.py

This file was deleted.

126 changes: 80 additions & 46 deletions evals/gaia/gaia.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@
from pathlib import Path
from typing import Any, Callable, Literal

from huggingface_hub import snapshot_download # type: ignore

from inspect_ai import Task, task
from inspect_ai.dataset import Sample, hf_dataset
from inspect_ai.dataset import Dataset, Sample, hf_dataset
from inspect_ai.scorer import match
from inspect_ai.solver import Plan, basic_agent, system_message
from inspect_ai.tool import bash, web_search
Expand All @@ -14,81 +16,113 @@
@task
def gaia(
plan: Plan | None = None,
split: Literal["test", "validation"] = "validation",
input_prompt: str | None = None,
max_messages: int = 30,
subset: Literal[
"2023_all", "2023_level1", "2023_level2", "2023_level3"
] = "2023_all",
filter: Callable[[Sample], bool] = lambda x: True,
input_prompt: str | None = None,
max_messages: int = 30,
split: Literal["test", "validation"] = "validation",
filter: Callable[[Sample], bool] | None = None,
) -> Task:
if not os.path.exists(GAIA_DATASET_LOCATION):
raise FileNotFoundError(
"GAIA dataset not found. Please run 'python dataset.py.' to download the dataset."
)

dataset = hf_dataset(
str(GAIA_DATASET_LOCATION),
# read dataset
dataset = read_dataset(
input_prompt=input_prompt or DEFAULT_INPUT_PROMPT,
subset=subset,
split=split,
sample_fields=gaia_hf_record_to_sample(input_prompt, split),
trust=True, # Trust GAIA's remote code execution during dataset loading
name=subset,
filter=filter if filter else lambda x: True,
)

dataset = dataset.filter(filter)
# provide default plan if required
if plan is None:
plan = basic_agent(
init=system_message(
"Please solve the coding task below. Once you are done, "
+ "use the submit() tool to provide your answer."
),
tools=[bash(), web_search()],
)

# resolve scorer (test split has no answers)
scorer = match() if split == "validation" else None

# return task
return Task(
dataset=dataset,
plan=plan if plan is not None else DEFAULT_PLAN,
scorer=match() if split == "validation" else None, # Test split has no answers
plan=plan,
scorer=scorer,
sandbox="docker",
max_messages=max_messages,
)


def gaia_hf_record_to_sample(
input_prompt: str | None, split: str
) -> Callable[[dict[str, Any]], Sample]:
input_prompt = input_prompt if input_prompt is not None else DEFAULT_INPUT_PROMPT
files_location = GAIA_DATASET_LOCATION / "2023" / split
DEFAULT_INPUT_PROMPT = """Please answer the question below. You should:
- Return only your answer, which should be a number, or a short phrase with as
few words as possible, or a comma separated list of numbers and/or strings.
- If the answer is a number, return only the number without any units unless
specified otherwise.
- If the answer is a string, don't include articles, and don't use
abbreviations (e.g. for states).
- If the answer is a comma separated list, apply the above rules to each
element in the list.
Any files or attachments mentioned in the question can be found in the
/shared_files/ directory (some questions do not have associated files). Here
is the question:
def gaia_hf_record_to_sample(record: dict[str, Any]) -> Sample:
{question}"""


def read_dataset(
input_prompt: str,
subset: str,
split: str,
filter: Callable[[Sample], bool] = lambda x: True,
) -> Dataset:
# download dataset if required
if not os.path.exists(GAIA_DATASET_LOCATION):
GAIA_DATASET_LOCATION.mkdir(parents=True, exist_ok=True)
snapshot_download(
repo_id="gaia-benchmark/GAIA",
repo_type="dataset",
local_dir=GAIA_DATASET_LOCATION,
)

# map record to sample
def record_to_sample(record: dict[str, Any]) -> Sample:
# map fields
sample = Sample(
input=input_prompt.format(question=record["Question"]),
target=record["Final answer"],
id=record["task_id"],
metadata={
"level": record["level"],
"level": record["Level"],
"Annotator Metadata": record["Annotator Metadata"],
},
setup="mkdir -p /shared_files/",
)

# apply input prompt
sample.input = input_prompt.format(question=sample.input)

# Setup files into the sample
# provide sample files
files_location = GAIA_DATASET_LOCATION / "2023" / split
files = [file for file in os.listdir(files_location) if str(sample.id) in file]
if len(files) > 0:
sample.files = {"/shared_files/" + files[0]: str((files / files[0]))}
sample.files = {
"/shared_files/" + files[0]: (files_location / files[0]).as_posix()
}

return sample

return gaia_hf_record_to_sample


DEFAULT_INPUT_PROMPT = """Please answer the question below. You should:
- Return only your answer, which should be a number, or a short phrase with as few words as possible, or a comma separated list of numbers and/or strings.
- If the answer is a number, return only the number without any units unless specified otherwise.
- If the answer is a string, don't include articles, and don't use abbreviations (e.g. for states).
- If the answer is a comma separated list, apply the above rules to each element in the list.
Any files or attachments mentioned in the question can be found in the /shared_files/ directory (some questions do not have associated files). Here is the question:
{question}"""
# read dataset
dataset = hf_dataset(
GAIA_DATASET_LOCATION.as_posix(),
name=subset,
split=split,
sample_fields=record_to_sample,
trust=True, # Trust GAIA's remote code execution during dataset loading
)

DEFAULT_PLAN = basic_agent(
init=system_message(
"Please solve the coding task below. Once you are done, use your submit tool."
),
tools=[bash(), web_search()],
)
# apply filter (if any) and return
return dataset.filter(filter)

0 comments on commit 45d17d0

Please sign in to comment.