Skip to content

Gaia bench with tape agent and multitool env #214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 74 commits into from
Apr 21, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
11071e9
initial commit
recursix Jan 28, 2025
90150ab
multitool environment draft
ollmer Feb 7, 2025
8d4dc18
tapeagents and pydantic deps
ollmer Feb 7, 2025
3040571
Merge branch 'main' into multitool_envs
ollmer Mar 13, 2025
0791f2d
add GAIA gym
ollmer Mar 13, 2025
7e629bd
universal tape agent that can load any agent from config
ollmer Mar 13, 2025
27bcdc5
make tapeagent a single module
ollmer Mar 13, 2025
7691e49
add gaia agent conf
ollmer Mar 13, 2025
76958ee
gaia benchmark class and entrypoint script
ollmer Mar 13, 2025
ec8be75
working gaia agent
ollmer Mar 13, 2025
2ef8500
test tapeagent creation
ollmer Mar 13, 2025
c59e60b
working gaia bench and gym, with test
ollmer Mar 13, 2025
a1c30d1
move conf to the tapeagent dir
ollmer Mar 13, 2025
460febf
gai gym reset method that builds initial task observations
ollmer Mar 13, 2025
418764f
test gym creation and reset
ollmer Mar 13, 2025
55f2ffa
store thoughts in agent info without serialization
ollmer Mar 13, 2025
e0eeaca
move loop module from the browsergym
ollmer Mar 13, 2025
34f67b2
fix loop docstrings
ollmer Mar 13, 2025
2295781
fix gaia env, use agentlabs exp dirs
ollmer Mar 13, 2025
c65cf9d
add available tools to the agent config
ollmer Mar 13, 2025
921958f
small adjustments in loop
ollmer Mar 13, 2025
3ecd519
better logging configuration in study
ollmer Mar 13, 2025
4264168
fix gaia env
ollmer Mar 13, 2025
5b03e9a
fix gaia agent
ollmer Mar 13, 2025
0b4ce04
do not fail on objects in df
ollmer Mar 13, 2025
809ad00
fix docstrings
ollmer Mar 13, 2025
8461301
add gaia scorer for reward in gym, fix envargs serialization
ollmer Mar 13, 2025
ece0931
fix docstring
ollmer Mar 13, 2025
518e17e
gpt4o mini for faster exps
ollmer Mar 13, 2025
f14ef47
fix conf path
ollmer Mar 13, 2025
d34593e
fix code exec container name, add support for gaia levels in benchmark
ollmer Mar 13, 2025
ce6f063
gaia eval with 5 ray jobs on level 1 validation
ollmer Mar 13, 2025
3e07678
fix
ollmer Mar 13, 2025
1bf9d32
fix thoughts storage
ollmer Mar 13, 2025
4c9d14a
save tapes
ollmer Mar 14, 2025
aacd57c
fix formatting
ollmer Mar 14, 2025
528c7e2
restore imports in the loop
ollmer Mar 14, 2025
ef9eb19
shared vscode linter settings
ollmer Mar 14, 2025
09ef9a6
address review comments
ollmer Mar 19, 2025
59e7aaf
Merge branch 'main' into multitool_envs
ollmer Mar 19, 2025
3f31293
make makefile to run test locally
ollmer Mar 19, 2025
db5e743
Merge branch 'multitool_envs' of github.com:ServiceNow/AgentLab into …
ollmer Mar 19, 2025
c940572
update make file
ollmer Mar 19, 2025
8efba9d
mock gaia dataset for test
ollmer Mar 19, 2025
0dcb150
mock data for gaia test
ollmer Mar 19, 2025
f0bdcb8
relax constraints for ray time testing
ollmer Mar 19, 2025
5076b2d
fix tests
ollmer Mar 19, 2025
3fe376a
fix
ollmer Mar 19, 2025
b3df564
check miniwob in makefile
ollmer Mar 19, 2025
e6ebfd8
use local expargs everywhere
ollmer Mar 19, 2025
13eec41
tapes browser ui
ollmer Mar 19, 2025
cabc393
more info in tape metadata, better tape browser
ollmer Mar 20, 2025
a511cf3
script to prepare gaia env and run gaia exp
ollmer Apr 1, 2025
fd59c1c
gaia ray 8 workers
ollmer Apr 1, 2025
0cfe4be
trust remote code flag for loading gaia bench from the hf repo
ollmer Apr 1, 2025
15d0dfe
fix
ollmer Apr 1, 2025
6c6d052
fix
ollmer Apr 1, 2025
0a43a8d
replace run_gaia.sh with setup_gaia.sh
recursix Apr 1, 2025
4896c67
use shared code folder for all gaia tasks
ollmer Apr 14, 2025
7999bb0
separate gaia-related renderings from general tape view
ollmer Apr 15, 2025
3fd383d
fix
ollmer Apr 15, 2025
e98a0c2
render attached gaia task files into steps
ollmer Apr 15, 2025
55378a4
remaining fixes, eval now matched with the old tapeagents evals
ollmer Apr 15, 2025
c747915
config-driven gym with tools and bench
ollmer Apr 15, 2025
995bff7
fix
ollmer Apr 15, 2025
cfc2f20
adjust agent config to be exactly the same as in tapeagents
ollmer Apr 15, 2025
27537e9
common tapedata.sqlite for the experiment
ollmer Apr 15, 2025
9914dcc
separate act and react agent configs
ollmer Apr 15, 2025
dbba760
show steps num for tape in the tape selector
ollmer Apr 15, 2025
b57a1ab
treat tapes without stop step as truncated
ollmer Apr 17, 2025
63c3dd7
move gaia runners to tapeagent
ollmer Apr 17, 2025
1f75796
try o4mini
ollmer Apr 17, 2025
017c203
fix test
ollmer Apr 17, 2025
58c69c4
clean up loop.py
ollmer Apr 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/darglint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,4 @@ jobs:
run: pip list

- name: Darglint checks
run: darglint -v 2 -z short .
run: darglint -v 2 -z short src/
9 changes: 6 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ __pycache__/
*.py[cod]
*$py.class
results/
.vscode

# C extensions
*.so
# Distribution / packaging
Expand Down Expand Up @@ -160,11 +160,14 @@ cython_debug/
# MacOS
**/.DS_Store

.vscode

_sandbox.py

results/

# gradio
.gradio/
.gradio/

outputs/
miniwob-plusplus/
.miniwob-server.pid
19 changes: 19 additions & 0 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python Debugger: Current File",
"type": "debugpy",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"justMyCode": false,
"env": {
"AGENTLAB_DEBUG": "1"
}
}
]
}
15 changes: 15 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"[python]": {
"editor.formatOnSave": true,
"editor.defaultFormatter": "ms-python.black-formatter",
"editor.codeActionsOnSave": {
"source.organizeImports": "explicit",
"source.fixAll": "never"
}
},
"python.testing.pytestArgs": [
"tests"
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
}
32 changes: 32 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
.PHONY: test setup miniwob lint stop-miniwob

setup:
@pip install -e .
@playwright install chromium --with-deps
@python -c 'import nltk; nltk.download("punkt_tab")'

miniwob: stop-miniwob
@git clone https://github.com/Farama-Foundation/miniwob-plusplus.git || true
@cd miniwob-plusplus && git checkout 7fd85d71a4b60325c6585396ec4f48377d049838
@python -m http.server 8080 --directory miniwob-plusplus/miniwob/html & echo $$! > .miniwob-server.pid
@sleep 3
@echo "MiniWob server started on http://localhost:8080"

check-miniwob:
@curl -I "http://localhost:8080/miniwob/" || (echo "MiniWob not reachable" && exit 1)
@echo "MiniWob server is reachable"

stop-miniwob:
@kill -9 `cat .miniwob-server.pid` || true
@rm -f .miniwob-server.pid
@echo "MiniWob server stopped"

run-tests:
@MINIWOB_URL="http://localhost:8080/miniwob/" pytest -n 5 --durations=10 -m 'not pricy' tests/
@echo "Tests completed"

test: setup miniwob check-miniwob run-tests stop-miniwob

lint: setup
@black src/ --check --diff
@darglint -v 2 -z short src/
3 changes: 3 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,15 @@ pytest==7.3.2
flaky
pytest-xdist
pytest-playwright
pydantic~=2.9
dask
distributed
browsergym>=0.7.1
joblib>=1.2.0
openai>=1.7,<2
langchain_community
tiktoken
tapeagents[converters]
huggingface_hub
contexttimer
ipython
Expand All @@ -24,3 +26,4 @@ matplotlib
ray[default]
python-slugify
pillow
gymnasium>=0.27
4 changes: 2 additions & 2 deletions src/agentlab/agents/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ have to specify the type of each field (You can use Any if it is unknown)*
```python
from dataclasses import dataclass
from browsergym.experiment.agent import Agent
from browsergym.experiment.loop import AgentArgs
from agentlab.experiments.loop import AgentArgs


@dataclass
Expand All @@ -116,7 +116,7 @@ class CustomAgentArgs(AgentArgs):
To run experiments with your custom agent, define an instance of `ExpArgs` with the required parameters.

```python
from browsergym.experiment.loop import ExpArgs
from agentlab.experiments.loop import ExpArgs

exp_args = ExpArgs(
agent_args=CustomAgentArgs(custom_param="value"),
Expand Down
8 changes: 1 addition & 7 deletions src/agentlab/agents/generic_agent/reproducibility_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,10 @@

import bgym
from browsergym.experiments.agent import AgentInfo
from browsergym.experiments.loop import ExpArgs, ExpResult, yield_all_exp_results
from bs4 import BeautifulSoup
from langchain.schema import AIMessage, BaseMessage
from langchain_community.adapters.openai import convert_message_to_dict

from agentlab.agents.agent_args import AgentArgs
from agentlab.agents.dynamic_prompting import ActionFlags
from agentlab.experiments.loop import ExpArgs, ExpResult, yield_all_exp_results
from agentlab.experiments.study import Study
from agentlab.llm.chat_api import make_assistant_message
from agentlab.llm.llm_utils import Discussion, messages_to_dict
Expand Down Expand Up @@ -65,7 +62,6 @@ def get_stats(self):

@dataclass
class ReproAgentArgs(GenericAgentArgs):

# starting with "_" will prevent from being part of the index in the load_results function
_repro_dir: str = None

Expand All @@ -81,7 +77,6 @@ def make_agent(self):


class ReproAgent(GenericAgent):

def __init__(
self,
chat_model_args,
Expand All @@ -93,7 +88,6 @@ def __init__(
super().__init__(chat_model_args, flags, max_retry)

def get_action(self, obs):

# replace the chat model with a reproducible chat that will mimic the
# same answers
step = len(self.actions)
Expand Down
6 changes: 3 additions & 3 deletions src/agentlab/agents/most_basic_agent/most_basic_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import bgym

from agentlab.agents.agent_args import AgentArgs
from agentlab.llm.chat_api import make_system_message, make_user_message
from agentlab.experiments.loop import ExpArgs
from agentlab.llm.llm_configs import CHAT_MODEL_ARGS_DICT
from agentlab.llm.llm_utils import (
Discussion,
Expand Down Expand Up @@ -133,7 +133,7 @@ def parser(response: str) -> tuple[dict, bool, str]:

# example for 2 experiments testing chain of thoughts on a miniwob task
exp_args = [
bgym.ExpArgs(
ExpArgs(
agent_args=MostBasicAgentArgs(
temperature=0.1,
use_chain_of_thought=True,
Expand All @@ -142,7 +142,7 @@ def parser(response: str) -> tuple[dict, bool, str]:
env_args=env_args,
logging_level=logging.INFO,
),
bgym.ExpArgs(
ExpArgs(
agent_args=MostBasicAgentArgs(
temperature=0.1,
use_chain_of_thought=False,
Expand Down
2 changes: 0 additions & 2 deletions src/agentlab/agents/tapeagent/.gitignore

This file was deleted.

65 changes: 65 additions & 0 deletions src/agentlab/agents/tapeagent/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
import json
from dataclasses import asdict, is_dataclass

import numpy as np
from tapeagents.core import Step, StepMetadata
from tapeagents.dialog_tape import AssistantStep, AssistantThought
from tapeagents.io import save_json_tape, save_tape_images

from agentlab.agents.tapeagent.agent import DictObservation, Tape, TapeAgent

__all__ = ["as_tape", "save_tape", "TapeAgent", "Tape"]


def as_tape(steps_info: list) -> Tape:
"""
Create a Tape object from the steps info.

Args:
steps_info: list of StepInfo objects.

Returns:
Tape: a Tape object containing the steps and metadata.
"""

class JsonEncoder(json.JSONEncoder):
def default(self, obj):
if is_dataclass(obj):
return asdict(obj) # type: ignore
if isinstance(obj, np.integer):
return int(obj)
if isinstance(obj, np.floating):
return float(obj)
if isinstance(obj, np.ndarray):
return obj.tolist()
return super().default(obj)

steps: list[Step] = []
for step_info in steps_info:
if step_info.obs is not None:
json_obs = json.dumps(step_info.obs, cls=JsonEncoder)
steps.append(DictObservation(content=json_obs))
if thought := step_info.agent_info.get("think"):
steps.append(AssistantThought(content=thought))
if step_info.action is not None:
step_metadata = StepMetadata(
other=dict(
reward=step_info.reward,
raw_reward=step_info.raw_reward,
terminated=step_info.terminated,
truncated=step_info.truncated,
agent_info=step_info.agent_info,
stats=step_info.stats,
)
)
steps.append(AssistantStep(content=step_info.action, metadata=step_metadata))
return Tape(steps=steps)


def save_tape(exp_dir: str, episode_info: list, task: dict, tape: Tape):
tape.metadata.reward = sum([step.reward for step in episode_info])
tape.metadata.truncated = episode_info[-1].truncated
tape.metadata.terminated = episode_info[-1].terminated
tape.metadata.task = task
save_json_tape(tape, exp_dir, "tape.json")
save_tape_images(tape, f"{exp_dir}/tape_attachments")
103 changes: 103 additions & 0 deletions src/agentlab/agents/tapeagent/agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
import logging
from dataclasses import dataclass
from typing import Literal

import bgym
import hydra
from omegaconf import DictConfig
from pydantic import Field
from tapeagents.agent import Agent
from tapeagents.core import Action, Observation, StopStep, TapeMetadata, Thought
from tapeagents.core import Tape as BaseTape

from agentlab.agents.agent_args import AgentArgs

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)


class ExtendedMetadata(TapeMetadata):
name: str = ""
task: dict = {}
terminated: bool = False
truncated: bool = False
reward: float = 0.0
attempt_number: int = 0
other: dict = {}


class Tape(BaseTape):
metadata: ExtendedMetadata = Field(default_factory=ExtendedMetadata) # type: ignore


def load_config(config_name: str) -> DictConfig:
with hydra.initialize(config_path="conf", version_base="1.1"):
config = hydra.compose(config_name=config_name)
return config


@dataclass
class TapeAgentArgs(AgentArgs):
config: DictConfig = None # type: ignore

def make_agent(self) -> bgym.Agent:
agent: Agent = hydra.utils.instantiate(self.config.agent)
return TapeAgent(agent=agent)


@dataclass
class TapeAgentInfo(bgym.AgentInfo):
thoughts: list[Thought] = None # type: ignore


class DictObservation(Observation):
"""
Container for wrapping old dict observation into new Observation class.
"""

kind: Literal["dict_observation"] = "dict_observation" # type: ignore
content: str


class TapeAgent(bgym.Agent):
agent: Agent
tape: Tape

def __init__(self, agent: Agent):
super().__init__()
self.agent = agent
self.tape = Tape(steps=[])

def obs_preprocessor(self, obs: Observation | list[Observation]) -> list[Observation]:
if isinstance(obs, Observation):
obs = [obs]
assert isinstance(obs, list), f"Expected list of Observations, got {type(obs)}"
logger.info(f"Observations: {[type(o).__name__ for o in obs]}")
return obs

def get_action(self, obs: Observation | list[Observation]) -> tuple[Action, TapeAgentInfo]:
self.tape += obs # type: ignore
thoughts: list[Thought] = []
action = None
while not action:
for event in self.agent.run(self.tape):
if not event.step:
continue
self.tape = self.tape.append(event.step)
if isinstance(event.step, Thought):
thoughts.append(event.step)
logger.info(f"Thought: {event.step.llm_view()}")
elif isinstance(event.step, Action) and not action: # we use first action only
action = event.step
logger.info(f"Action: {action.llm_view()}")
else:
# there could be control flow steps for switching nodes and if clauses
logger.info(f"Other step: {type(event.step)}")
logger.info(f"Tape after run: ({len(self.tape)}) {[type(s).__name__ for s in self.tape]}")
return (action, TapeAgentInfo(thoughts=thoughts))

@property
def final_tape(self) -> Tape:
truncated = not any([isinstance(s, StopStep) for s in self.tape.steps])
self.tape.metadata = ExtendedMetadata(author=self.agent.name, truncated=truncated)
return self.tape
Loading