The evaluation/
folder provides SWE-agent compatible scripts for running SWE-bench style evaluation on model patch predictions. In addition, we also include additional scripts to quantify model performance on "subtasks" within the SWE-bench task, such as identifying the right file(s) to edit.
You can run evaluations on SWE-bench by passing in the predictions generated by SWE-agent (usually named all_preds.jsonl
). Simply run the following script:
./run_eval.sh <path to predictions>
Then run ./run_eval.sh
. Depending on the # of task instances and how long setting up the execution environment takes, the evaluation could take a couple minutes or to 7 hours for the entirety of the SWE-bench test split.
When evaluation finishes, you should see an output similar to the following:
2024-03-31 16:47:00,263 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Installing with command: . /n/fs/p-swe-bench/testbed/ba397fe0d6/pvlib__pvlib-python/0.8/tmpom22t9na/miniconda3/bin/activate pvlib__pvlib-python__0.8 && echo 'activate successful' && pip install -e .[all]
2024-03-31 16:47:10,602 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Installation successful
2024-03-31 16:47:10,619 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Apply patch successful (test)
2024-03-31 16:47:10,635 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Apply patch successful (pred)
2024-03-31 16:47:13,453 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Test script run successful
==================================
Log directory for evaluation run: /n/fs/p-swe-bench/results/gpt-4-1106-preview__swe-bench-dev-40-seed24__default_sys-env_window100-detailed_cmd_format-full_history-1_demos__t-0.20__p-0.95__c-4.00__install-1__sweep-01-run-4
== Evaluation Report ==
{'# Not Generated': 1, '# Generated': 36, '# Applied': 34, '# Resolved': 5}
- Wrote per-instance scorecards to /<path to SWE-agent>/trajectories/carlosejimenez/gpt-4-1106-preview__swe-bench-dev-40-seed24__default_sys-env_window100-detailed_cmd_format-full_history-1_demos__t-0.20__p-0.95__c-4.00__install-1__sweep-01-run-4/scorecards.json
- Wrote summary of run to /<path to SWE-agent>/trajectories/carlosejimenez/gpt-4-1106-preview__swe-bench-dev-40-seed24__default_sys-env_window100-detailed_cmd_format-full_history-1_demos__t-0.20__p-0.95__c-4.00__install-1__sweep-01-run-4/results.json
Reference Report:
{'# Not Generated': 1, '# Generated': 36, '# Applied': 34, '# Resolved': 5}
evaluation.py
: This script contains the logic for SWE-bench evaluation adapted for the SWE-agent setting. Given a set of predictions (e.g. trajectories/<user>/<experiment>/all_preds.jsonl
), we...
- Filter + analyze predictions.
- Run SWE-bench style execution based evaluation.
- Save outcomes to
results.json
andscorecards.json
files with info about task-specific and overall performance.
run_eval.sh
is provided as an example of how to runevaluation.py
Arguments:
--predictions_path (required)
: The path to the file containing predictions (.jsonl format). This file includes the predictions that need to be evaluated against the benchmark tasks.--log_dir (required)
: The directory path where log files related to the evaluation process will be stored. It's used for saving logs that are generated during the evaluation.--swe_bench_tasks (required)
: The path to the file containing the SWE-bench task instances. This file includes the details of the tasks against which the predictions will be evaluated.--testbed (required)
: The directory path for the testbed, which is likely used for setting up the environment or context for the evaluations.--skip_existing (optional)
: If specified, the script will skip over log files that already exist, preventing re-evaluation of those tasks.--timeout (optional)
: Specifies the timeout in seconds for the evaluation process (default is 900 seconds). This helps in controlling the duration of each evaluation task to avoid excessively long running times.--verbose (optional)
: Enables verbose mode, which will provide more detailed output during the script execution. This is useful for debugging or getting more insight into the process.--conda_link (optional)
: Allows specifying a URL to a Conda installation that should be used for the evaluation environment. This can be necessary if the evaluation requires a specific software environment.--log_suffix (optional)
: An additional parameter to specify a suffix for log files. This can be used for organizing logs more effectively, especially when running multiple evaluations in parallel or under different configurations.
aggregate_results.py
: This script aggregates and displays experiment results from the trajectories/
folder.
- Experiments are grouped by
(Model, Dataset, Config File, Temp., Top P, Cost, Install)
. - The following statistics for each experiment run are shown:
Not Generated
: # of task instances with no patch generatedGenerated
: # of task instances with patchApplied
: # of patches that applied successfullyResolved
: # of task instances resolvedCosts [Success|Failed|Overall]
: Cost of [successful|failed|any] run
- If there are multiple runs of an experiment (distinguished by
--suffix run<i>
), the above statistics are aggregate as totals or means.
Usage:
python aggregate_results.py
Arguments:
--folder (type: str, default: ../trajectories)
: Specifies the folder containing the experiment * results. This is where the script will look to gather data.--model (type: str, nargs: '+')
: Filters the results by model(s). Only results corresponding to the * specified model(s) will be included.--dataset (type: str, nargs: '+')
: Filters the results by dataset(s). Only results for the specified * dataset(s) will be analyzed.--setup (type: str, nargs: '+')
: Filters the results by setup(s). This allows focusing on specific * experiment configurations.--runs_min (type: int)
: The minimum number of runs an experiment should have to be included in the * analysis. Helps exclude experiments with insufficient data.--runs_max (type: int)
: The maximum number of runs to consider for each experiment. This can limit the data to the most relevant runs.