pip install -r requirements.txt
Populate a .env
file in the root directory of the project with the following keys:
FIREWORKS_API_KEY=<key3>
ANTHROPIC_API_KEY=<key4>
OPENAI_API_KEY=<key5>
python -m src.main \
--model, -m # Which model is used for the execution agent
--model-type, -mt # This model's provider (eg. openai)
--explore-model, -em # The model use for exploration
--explore-model-type, emt # The explore model's provider (eg. openai)
--execute-temp, -et # What temp to execute at (default 0)
--explore-temp, -ept # What tempt to explore at (default 0)
--explore-environment-iterations, -eei # How many iterations of SSEAL to run.
--max_iterations_per_episode, -mipe # How many iterations per query
--benchmark_path, -bp # Which benchmark to run, should be a .py file in src/benchmarks
--task, -t # Which task to run (eg. sports_data)
Running gpt-4o-mini as agent with gpt-4o as the exploration model.
python -m src.main -bp linux_terminal.py --task linux_terminal -mipe 10 -eei 4 -m gpt-4o-mini -em gpt-4o
To save time, the exploration phase is cached in caches/
. The cache is specific to a task, # of exporation iterations, and model. You can clear a cache by deleting it. The repo comes pre-loaded with caches for our provided test tasks.
Let's say we have a py file containing our new function context. We first should put this into src/benchmarks/
. For example, we can see a math_demo.py
. Then we would run: python -m src.main -bp math_demo.py --task custom -mipe 0 -eei <num_exporation> -m gpt-4o -mt openai
. In this case, -mipe 0
indicates that we don't want to execute any testing tasks, but just want to run prompt optimization. We could see the output of the exploration either in the log file generated (logs/) or get the metadata in cache/. It will be located in cache/<function_file_name>/<explore_model_name>.json
. In particular, it will be under the key <num_exporation>
.
The commands we ran for this project are in commands.sh
. They can be run with source commands.sh
. Running all of them will probably take many days. See analysis.ipynb for generating the graphs in our paper. We also provide the outputs from our experiments in experiments/.
For our VLA experiments please see the vla_agents directory. For the SWE-agent experiments please see the swe-age directory.