diff --git a/docs/examples/examples.bib b/docs/examples/examples.bib index 89a73305e..cb4e59671 100644 --- a/docs/examples/examples.bib +++ b/docs/examples/examples.bib @@ -257,3 +257,13 @@ @misc{jimenez2024swebenchlanguagemodelsresolve primaryClass={cs.CL}, url={https://arxiv.org/abs/2310.06770}, } + +@misc{mialon2023gaiabenchmarkgeneralai, + title={GAIA: a benchmark for General AI Assistants}, + author={Grégoire Mialon and Clémentine Fourrier and Craig Swift and Thomas Wolf and Yann LeCun and Thomas Scialom}, + year={2023}, + eprint={2311.12983}, + archivePrefix={arXiv}, + primaryClass={cs.CL}, + url={https://arxiv.org/abs/2311.12983}, +} diff --git a/docs/examples/examples.yml b/docs/examples/examples.yml index 765e7d885..1f49a34a0 100644 --- a/docs/examples/examples.yml +++ b/docs/examples/examples.yml @@ -31,6 +31,16 @@ demonstrates: ["Scoring", "Sandbox", "Tools"] contributors: ["max-kaufmann"] +- title: "GAIA: A Benchmark for General AI Assistants" + description: | + GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs + path: evals/gaia + arxiv: https://arxiv.org/abs/2311.12983 + cite: mialon2023gaiabenchmarkgeneralai + group: Assistants + demonstrates: ["Web Search", "Sandbox", "Tools"] + contributors: ["max-kaufmann"] + - title: "InterCode: Capture the Flag" description: | Measure expertise in coding, cryptography (i.e. binary exploitation, forensics), reverse engineering, and recognizing security vulnerabilities. Demonstrates tool use and sandboxing untrusted model code. diff --git a/evals/gaia/.gitignore b/evals/gaia/.gitignore index 6ccf6b463..5676bcbad 100644 --- a/evals/gaia/.gitignore +++ b/evals/gaia/.gitignore @@ -1 +1 @@ -./resources/* \ No newline at end of file +resources/* \ No newline at end of file diff --git a/evals/gaia/README.md b/evals/gaia/README.md index 60c39111f..edb965924 100644 --- a/evals/gaia/README.md +++ b/evals/gaia/README.md @@ -1,13 +1,48 @@ # GAIA This is an inspect-native implementation of [the GAIA (General AI Assistants)](https://arxiv.org/abs/2311.12983) benchmark, consisting of 450 questions testing tool use on realistic assistant tasks (mostly web browsing). -## Installation +## Prerequisites + +1) **Python Dependencies** Install the Python dependencies for the GAIA task with: + + ``` + pip install -r requirements.txt + ``` + +2) **Docker Engine** The GAIA task uses [tool calling](https://inspect.ai-safety-institute.org.uk/tools.html) to enable the model to execute bash commands. Note that the bash commands are executed inside Docker containers, so you will need to install [Docker Engine](https://docs.docker.com/engine/install/) in order to run the evaluation. + +3) **Hugging Face Dataset** Upon running the GAIA task, it will attempt to download the dataset from the HuggingFace hub. For this to work, you will need to gain access to the dataset (by filling out a form on [the GAIA huggingface repository](https://huggingface.co/datasets/gaia-benchmark/GAIA)), and also [create and set an access token](https://huggingface.co/docs/hub/en/security-tokens). You will need to define the `HF_TOKEN` environment variable to access the dataset: + + ``` + HF_TOKEN= + ``` + +4) **Search Provider** The default solver for the GAIA task uses the inspect [web_search()](https://inspect.ai-safety-institute.org.uk/tools.html#sec-web-search) tool. The `web_search()` tool uses [Google Programmable Search Engine](https://programmablesearchengine.google.com/about/). To use it you will therefore need to setup your own Google Programmable Search Engine and also enable the [Programmable Search Element Paid API](https://developers.google.com/custom-search/docs/paid_element). Then, ensure that the following environment variables are defined: + + ``` + GOOGLE_CSE_ID= + GOOGLE_CSE_API_KEY= + ``` -1) **Install requirements.** As well as the requirements provided by inspect, this benchmark has its own dependencies. You can install them in your virtual environment by running ```pip install -r requirements.txt``` -2) **Download the dataset.** Upon loading gaia.py file, it will attempt to download the dataset from the HuggingFace hub. For this to work, you will need to gain access to the dataset (by filling out a form on [the GAIA huggingface repository](https://huggingface.co/datasets/gaia-benchmark/GAIA)), and also [create and set an access token](https://huggingface.co/docs/hub/en/security-tokens). ## Usage -After install, the GAIA benchmark can be used in the following way: + +After fulfiling the prerequisites, run the GAIA task aginst various models with: + +```bash +$ inspect eval gaia.py --model openai/gpt-4o +$ inspect eval gaia.py --model google/gemini-1.5-pro +``` + +You might want to limit the number of samples evaluated for initial experimentation: + +```bash +$ inspect eval gaia.py --model openai/gpt-4o --limit 5 +``` + +## Development + +If you want to develop solver for the GAIA task, import the `@task` from the `gaia` module and pass your own solver to it. For example: ```python from inspect_ai import eval @@ -15,12 +50,23 @@ from inspect_ai.solver import use_tools, generate, system_message, basic_agent from inspect_ai.tool import bash, web_search from gaia import gaia -# Create a task to attempt to solve GAIA with a basic agent. agent = basic_agent(tools=[bash()]) -task = gaia(split="validation", plan=agent) +task = gaia(plan=agent) + +eval(task, model="openai/gpt-4o") +``` + +See the documentation on the Inspect [Agents API](https://inspect.ai-safety-institute.org.uk/agents-api.html) for additional information on developing agents. + +### Test Split + +You can evaluate against the "test" split with: + +```python +agent = basic_agent(tools=[bash()]) +task = gaia(plan=agent, split="test") -# For the demonstration, we will select only the first 5 tasks. -eval(task, model="openai/gpt-4o", limit=5) +eval(task, model="openai/gpt-4o") ``` -Note that the GAIA test split does not come with any solutions - they must be uploaded to the online leaderboard. \ No newline at end of file +Note that the GAIA "test" split does not come with any solutions so is not scored (rather, solutions are uploaded to the online leaderboard). \ No newline at end of file diff --git a/evals/gaia/dataset.py b/evals/gaia/dataset.py deleted file mode 100644 index e7b4e6fd0..000000000 --- a/evals/gaia/dataset.py +++ /dev/null @@ -1,11 +0,0 @@ -from gaia import GAIA_DATASET_LOCATION -from huggingface_hub import snapshot_download - -# Run this file to download the GAIA dataset. - -GAIA_DATASET_LOCATION.mkdir(parents=True, exist_ok=True) -snapshot_download( - repo_id="gaia-benchmark/GAIA", - repo_type="dataset", - local_dir=GAIA_DATASET_LOCATION, -) diff --git a/evals/gaia/gaia.py b/evals/gaia/gaia.py index e1b19d5ee..320247777 100644 --- a/evals/gaia/gaia.py +++ b/evals/gaia/gaia.py @@ -2,8 +2,10 @@ from pathlib import Path from typing import Any, Callable, Literal +from huggingface_hub import snapshot_download # type: ignore + from inspect_ai import Task, task -from inspect_ai.dataset import Sample, hf_dataset +from inspect_ai.dataset import Dataset, Sample, hf_dataset from inspect_ai.scorer import match from inspect_ai.solver import Plan, basic_agent, system_message from inspect_ai.tool import bash, web_search @@ -14,81 +16,113 @@ @task def gaia( plan: Plan | None = None, - split: Literal["test", "validation"] = "validation", + input_prompt: str | None = None, + max_messages: int = 30, subset: Literal[ "2023_all", "2023_level1", "2023_level2", "2023_level3" ] = "2023_all", - filter: Callable[[Sample], bool] = lambda x: True, - input_prompt: str | None = None, - max_messages: int = 30, + split: Literal["test", "validation"] = "validation", + filter: Callable[[Sample], bool] | None = None, ) -> Task: - if not os.path.exists(GAIA_DATASET_LOCATION): - raise FileNotFoundError( - "GAIA dataset not found. Please run 'python dataset.py.' to download the dataset." - ) - - dataset = hf_dataset( - str(GAIA_DATASET_LOCATION), + # read dataset + dataset = read_dataset( + input_prompt=input_prompt or DEFAULT_INPUT_PROMPT, + subset=subset, split=split, - sample_fields=gaia_hf_record_to_sample(input_prompt, split), - trust=True, # Trust GAIA's remote code execution during dataset loading - name=subset, + filter=filter if filter else lambda x: True, ) - dataset = dataset.filter(filter) + # provide default plan if required + if plan is None: + plan = basic_agent( + init=system_message( + "Please solve the coding task below. Once you are done, " + + "use the submit() tool to provide your answer." + ), + tools=[bash(), web_search()], + ) + # resolve scorer (test split has no answers) + scorer = match() if split == "validation" else None + + # return task return Task( dataset=dataset, - plan=plan if plan is not None else DEFAULT_PLAN, - scorer=match() if split == "validation" else None, # Test split has no answers + plan=plan, + scorer=scorer, sandbox="docker", max_messages=max_messages, ) -def gaia_hf_record_to_sample( - input_prompt: str | None, split: str -) -> Callable[[dict[str, Any]], Sample]: - input_prompt = input_prompt if input_prompt is not None else DEFAULT_INPUT_PROMPT - files_location = GAIA_DATASET_LOCATION / "2023" / split +DEFAULT_INPUT_PROMPT = """Please answer the question below. You should: + +- Return only your answer, which should be a number, or a short phrase with as + few words as possible, or a comma separated list of numbers and/or strings. +- If the answer is a number, return only the number without any units unless + specified otherwise. +- If the answer is a string, don't include articles, and don't use + abbreviations (e.g. for states). +- If the answer is a comma separated list, apply the above rules to each + element in the list. + +Any files or attachments mentioned in the question can be found in the +/shared_files/ directory (some questions do not have associated files). Here +is the question: - def gaia_hf_record_to_sample(record: dict[str, Any]) -> Sample: +{question}""" + + +def read_dataset( + input_prompt: str, + subset: str, + split: str, + filter: Callable[[Sample], bool] = lambda x: True, +) -> Dataset: + # download dataset if required + if not os.path.exists(GAIA_DATASET_LOCATION): + GAIA_DATASET_LOCATION.mkdir(parents=True, exist_ok=True) + snapshot_download( + repo_id="gaia-benchmark/GAIA", + repo_type="dataset", + local_dir=GAIA_DATASET_LOCATION, + ) + + # map record to sample + def record_to_sample(record: dict[str, Any]) -> Sample: + # map fields sample = Sample( input=input_prompt.format(question=record["Question"]), target=record["Final answer"], id=record["task_id"], metadata={ - "level": record["level"], + "level": record["Level"], "Annotator Metadata": record["Annotator Metadata"], }, + setup="mkdir -p /shared_files/", ) + # apply input prompt sample.input = input_prompt.format(question=sample.input) - # Setup files into the sample + # provide sample files + files_location = GAIA_DATASET_LOCATION / "2023" / split files = [file for file in os.listdir(files_location) if str(sample.id) in file] if len(files) > 0: - sample.files = {"/shared_files/" + files[0]: str((files / files[0]))} + sample.files = { + "/shared_files/" + files[0]: (files_location / files[0]).as_posix() + } return sample - return gaia_hf_record_to_sample - - -DEFAULT_INPUT_PROMPT = """Please answer the question below. You should: - -- Return only your answer, which should be a number, or a short phrase with as few words as possible, or a comma separated list of numbers and/or strings. -- If the answer is a number, return only the number without any units unless specified otherwise. -- If the answer is a string, don't include articles, and don't use abbreviations (e.g. for states). -- If the answer is a comma separated list, apply the above rules to each element in the list. - -Any files or attachments mentioned in the question can be found in the /shared_files/ directory (some questions do not have associated files). Here is the question: - -{question}""" + # read dataset + dataset = hf_dataset( + GAIA_DATASET_LOCATION.as_posix(), + name=subset, + split=split, + sample_fields=record_to_sample, + trust=True, # Trust GAIA's remote code execution during dataset loading + ) -DEFAULT_PLAN = basic_agent( - init=system_message( - "Please solve the coding task below. Once you are done, use your submit tool." - ), - tools=[bash(), web_search()], -) + # apply filter (if any) and return + return dataset.filter(filter)