-
Notifications
You must be signed in to change notification settings - Fork 78
Gaia bench with tape agent and multitool env #214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
scripts/run_gaia.py
Outdated
@@ -0,0 +1,25 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To have agents self-contained, let's move these scripts to the agent's directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but this script is not specific to a single agent, but for the pair "agent + benchmark". Do you want to put it into the tapeagent dir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to agents/tapeagent/experiments/
scripts/setup_gaia.sh
Outdated
@@ -0,0 +1,41 @@ | |||
#!/bin/bash | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To have agents self-contained, let's move these scripts to the agent's directory.
TODO:
Description by Korbit AI
What change is being made?
Add a new
tapeagent
with integration for the Gaia benchmarking environment and multitool environment, including new configurations, agent classes, a new Makefile for setup tasks, and support for LLM configurations.Why are these changes being made?
This integration aims to enhance the Gaia benchmarking capabilities by utilizing tape agents and providing a flexible environment configuration to allow tape agents to run multiple tools. The approach consolidates functionality, enabling advanced testing and evaluation of AI models within complex task environments, while maintaining an extensible and organized codebase structure.