Skip to content

Latest commit

 

History

History
 
 

evaluation

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Evaluation

The evaluation/ folder provides SWE-agent compatible scripts for running SWE-bench style evaluation on model patch predictions. In addition, we also include additional scripts to quantify model performance on "subtasks" within the SWE-bench task, such as identifying the right file(s) to edit.

📖 Table of Contents

🐇 Quick Start

You can run evaluations on SWE-bench by passing in the predictions generated by SWE-agent (usually named all_preds.jsonl). Simply run the following script:

./run_eval.sh <path to predictions>

Then run ./run_eval.sh. Depending on the # of task instances and how long setting up the execution environment takes, the evaluation could take a couple minutes or to 7 hours for the entirety of the SWE-bench test split.

When evaluation finishes, you should see an output similar to the following:

2024-03-31 16:47:00,263 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Installing with command: . /n/fs/p-swe-bench/testbed/ba397fe0d6/pvlib__pvlib-python/0.8/tmpom22t9na/miniconda3/bin/activate pvlib__pvlib-python__0.8 && echo 'activate successful' && pip install -e .[all]
2024-03-31 16:47:10,602 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Installation successful
2024-03-31 16:47:10,619 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Apply patch successful (test)
2024-03-31 16:47:10,635 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Apply patch successful (pred)
2024-03-31 16:47:13,453 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Test script run successful
==================================
Log directory for evaluation run: /n/fs/p-swe-bench/results/gpt-4-1106-preview__swe-bench-dev-40-seed24__default_sys-env_window100-detailed_cmd_format-full_history-1_demos__t-0.20__p-0.95__c-4.00__install-1__sweep-01-run-4
== Evaluation Report ==
{'# Not Generated': 1, '# Generated': 36, '# Applied': 34, '# Resolved': 5}
- Wrote per-instance scorecards to /<path to SWE-agent>/trajectories/carlosejimenez/gpt-4-1106-preview__swe-bench-dev-40-seed24__default_sys-env_window100-detailed_cmd_format-full_history-1_demos__t-0.20__p-0.95__c-4.00__install-1__sweep-01-run-4/scorecards.json
- Wrote summary of run to /<path to SWE-agent>/trajectories/carlosejimenez/gpt-4-1106-preview__swe-bench-dev-40-seed24__default_sys-env_window100-detailed_cmd_format-full_history-1_demos__t-0.20__p-0.95__c-4.00__install-1__sweep-01-run-4/results.json
Reference Report:
{'# Not Generated': 1, '# Generated': 36, '# Applied': 34, '# Resolved': 5}

🪑 SWE-bench Evaluation

evaluation.py: This script contains the logic for SWE-bench evaluation adapted for the SWE-agent setting. Given a set of predictions (e.g. trajectories/<user>/<experiment>/all_preds.jsonl), we...

  1. Filter + analyze predictions.
  2. Run SWE-bench style execution based evaluation.
  3. Save outcomes to results.json and scorecards.json files with info about task-specific and overall performance.

run_eval.sh is provided as an example of how to run evaluation.py

Arguments:

  • --predictions_path (required): The path to the file containing predictions (.jsonl format). This file includes the predictions that need to be evaluated against the benchmark tasks.
  • --log_dir (required): The directory path where log files related to the evaluation process will be stored. It's used for saving logs that are generated during the evaluation.
  • --swe_bench_tasks (required): The path to the file containing the SWE-bench task instances. This file includes the details of the tasks against which the predictions will be evaluated.
  • --testbed (required): The directory path for the testbed, which is likely used for setting up the environment or context for the evaluations.
  • --skip_existing (optional): If specified, the script will skip over log files that already exist, preventing re-evaluation of those tasks.
  • --timeout (optional): Specifies the timeout in seconds for the evaluation process (default is 900 seconds). This helps in controlling the duration of each evaluation task to avoid excessively long running times.
  • --verbose (optional): Enables verbose mode, which will provide more detailed output during the script execution. This is useful for debugging or getting more insight into the process.
  • --conda_link (optional): Allows specifying a URL to a Conda installation that should be used for the evaluation environment. This can be necessary if the evaluation requires a specific software environment.
  • --log_suffix (optional): An additional parameter to specify a suffix for log files. This can be used for organizing logs more effectively, especially when running multiple evaluations in parallel or under different configurations.

📈 Viewing Results

aggregate_results.py: This script aggregates and displays experiment results from the trajectories/ folder.

  • Experiments are grouped by (Model, Dataset, Config File, Temp., Top P, Cost, Install).
  • The following statistics for each experiment run are shown:
    • Not Generated: # of task instances with no patch generated
    • Generated: # of task instances with patch
    • Applied: # of patches that applied successfully
    • Resolved: # of task instances resolved
    • Costs [Success|Failed|Overall]: Cost of [successful|failed|any] run
  • If there are multiple runs of an experiment (distinguished by --suffix run<i>), the above statistics are aggregate as totals or means.

Usage:

python aggregate_results.py

Arguments:

  • --folder (type: str, default: ../trajectories): Specifies the folder containing the experiment * results. This is where the script will look to gather data.
  • --model (type: str, nargs: '+'): Filters the results by model(s). Only results corresponding to the * specified model(s) will be included.
  • --dataset (type: str, nargs: '+'): Filters the results by dataset(s). Only results for the specified * dataset(s) will be analyzed.
  • --setup (type: str, nargs: '+'): Filters the results by setup(s). This allows focusing on specific * experiment configurations.
  • --runs_min (type: int): The minimum number of runs an experiment should have to be included in the * analysis. Helps exclude experiments with insufficient data.
  • --runs_max (type: int): The maximum number of runs to consider for each experiment. This can limit the data to the most relevant runs.