Skip to content

Gaia bench with tape agent and multitool env #214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 74 commits into from
Apr 21, 2025
Merged

Gaia bench with tape agent and multitool env #214

merged 74 commits into from
Apr 21, 2025

Conversation

ollmer
Copy link
Collaborator

@ollmer ollmer commented Feb 7, 2025

TODO:

  1. Reuse tapeagents Tool Collection Environment
  2. Make agents and environments interact through the tape
  3. Introduce a simple way to wrap new benchmarks with their own action space into a new environment

Description by Korbit AI

What change is being made?

Add a new tapeagent with integration for the Gaia benchmarking environment and multitool environment, including new configurations, agent classes, a new Makefile for setup tasks, and support for LLM configurations.

Why are these changes being made?

This integration aims to enhance the Gaia benchmarking capabilities by utilizing tape agents and providing a flexible environment configuration to allow tape agents to run multiple tools. The approach consolidates functionality, enabling advanced testing and evaluation of AI models within complex task environments, while maintaining an extensible and organized codebase structure.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

@ollmer ollmer requested a review from recursix February 7, 2025 12:55
@ollmer ollmer changed the base branch from tau-bench to main March 13, 2025 11:39
@ollmer ollmer changed the title [WIP] Multitool envs [WIP] Gaia bench with tape agent and multitool env Mar 13, 2025
@@ -0,0 +1,25 @@
import logging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have agents self-contained, let's move these scripts to the agent's directory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this script is not specific to a single agent, but for the pair "agent + benchmark". Do you want to put it into the tapeagent dir?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to agents/tapeagent/experiments/

@@ -0,0 +1,41 @@
#!/bin/bash

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have agents self-contained, let's move these scripts to the agent's directory.

@recursix recursix merged commit cad0629 into main Apr 21, 2025
3 checks passed
@recursix recursix deleted the multitool_envs branch April 21, 2025 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants