Gaia bench with tape agent and multitool env #214

ollmer · 2025-02-07T12:55:14Z

TODO:

Reuse tapeagents Tool Collection Environment
Make agents and environments interact through the tape
Introduce a simple way to wrap new benchmarks with their own action space into a new environment

Description by Korbit AI

What change is being made?

Add a new tapeagent with integration for the Gaia benchmarking environment and multitool environment, including new configurations, agent classes, a new Makefile for setup tasks, and support for LLM configurations.

Why are these changes being made?

This integration aims to enhance the Gaia benchmarking capabilities by utilizing tape agents and providing a flexible environment configuration to allow tape agents to run multiple tools. The approach consolidates functionality, enabling advanced testing and evaluation of AI models within complex task environments, while maintaining an extensible and organized codebase structure.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

recursix · 2025-04-16T19:42:49Z

scripts/run_gaia.py

@@ -0,0 +1,25 @@
+import logging


To have agents self-contained, let's move these scripts to the agent's directory.

but this script is not specific to a single agent, but for the pair "agent + benchmark". Do you want to put it into the tapeagent dir?

move to agents/tapeagent/experiments/

recursix · 2025-04-16T19:42:58Z

scripts/setup_gaia.sh

@@ -0,0 +1,41 @@
+#!/bin/bash
+


To have agents self-contained, let's move these scripts to the agent's directory.

src/agentlab/benchmarks/tau_bench.py

recursix and others added 2 commits January 27, 2025 21:46

initial commit

11071e9

multitool environment draft

90150ab

ollmer requested a review from recursix February 7, 2025 12:55

ollmer added 7 commits February 7, 2025 14:10

tapeagents and pydantic deps

8d4dc18

Merge branch 'main' into multitool_envs

3040571

add GAIA gym

0791f2d

universal tape agent that can load any agent from config

7e629bd

make tapeagent a single module

27bcdc5

add gaia agent conf

7691e49

gaia benchmark class and entrypoint script

76958ee

ollmer changed the base branch from tau-bench to main March 13, 2025 11:39

ollmer changed the title ~~[WIP] Multitool envs~~ [WIP] Gaia bench with tape agent and multitool env Mar 13, 2025

ollmer added 18 commits March 13, 2025 13:05

working gaia agent

ec8be75

test tapeagent creation

2ef8500

working gaia bench and gym, with test

c59e60b

move conf to the tapeagent dir

a1c30d1

gai gym reset method that builds initial task observations

460febf

test gym creation and reset

418764f

store thoughts in agent info without serialization

55f2ffa

move loop module from the browsergym

e0eeaca

fix loop docstrings

34f67b2

fix gaia env, use agentlabs exp dirs

2295781

add available tools to the agent config

c65cf9d

small adjustments in loop

921958f

better logging configuration in study

3ecd519

fix gaia env

4264168

fix gaia agent

5b03e9a

do not fail on objects in df

0b4ce04

fix docstrings

809ad00

add gaia scorer for reward in gym, fix envargs serialization

8461301

ollmer and others added 19 commits March 19, 2025 19:50

tapes browser ui

13eec41

more info in tape metadata, better tape browser

cabc393

script to prepare gaia env and run gaia exp

a511cf3

gaia ray 8 workers

fd59c1c

trust remote code flag for loading gaia bench from the hf repo

0cfe4be

fix

15d0dfe

fix

6c6d052

replace run_gaia.sh with setup_gaia.sh

0a43a8d

use shared code folder for all gaia tasks

4896c67

separate gaia-related renderings from general tape view

7999bb0

fix

3fd383d

render attached gaia task files into steps

e98a0c2

remaining fixes, eval now matched with the old tapeagents evals

55378a4

config-driven gym with tools and bench

c747915

fix

995bff7

adjust agent config to be exactly the same as in tapeagents

cfc2f20

common tapedata.sqlite for the experiment

27537e9

separate act and react agent configs

9914dcc

show steps num for tape in the tape selector

dbba760

recursix reviewed Apr 16, 2025

View reviewed changes

src/agentlab/benchmarks/tau_bench.py Outdated Show resolved Hide resolved

ollmer added 5 commits April 17, 2025 11:23

treat tapes without stop step as truncated

b57a1ab

move gaia runners to tapeagent

63c3dd7

try o4mini

1f75796

fix test

017c203

clean up loop.py

58c69c4

recursix approved these changes Apr 21, 2025

View reviewed changes

recursix merged commit cad0629 into main Apr 21, 2025
3 checks passed

recursix deleted the multitool_envs branch April 21, 2025 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gaia bench with tape agent and multitool env #214

Gaia bench with tape agent and multitool env #214

Uh oh!

ollmer commented Feb 7, 2025 •

edited by korbit-ai bot

Loading

Uh oh!

recursix Apr 16, 2025

Uh oh!

ollmer Apr 17, 2025

Uh oh!

ollmer Apr 17, 2025

Uh oh!

recursix Apr 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gaia bench with tape agent and multitool env #214

Gaia bench with tape agent and multitool env #214

Uh oh!

Conversation

ollmer commented Feb 7, 2025 • edited by korbit-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description by Korbit AI

What change is being made?

Why are these changes being made?

Uh oh!

recursix Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

ollmer Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

ollmer Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

recursix Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ollmer commented Feb 7, 2025 •

edited by korbit-ai bot

Loading