Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding SWE-bench #268

Merged
merged 459 commits into from
Sep 12, 2024
Merged
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
459 commits
Select commit Hold shift + click to select a range
29b8aff
DROP benchmark (#235)
xeon27 Aug 14, 2024
8997eb7
mypy cleanup for drop benchmark
jjallaire Aug 14, 2024
ecc0a7e
WinoGrande benchmark (#236)
xeon27 Aug 14, 2024
245eb90
Update CHANGELOG.md
jjallaire-aisi Aug 14, 2024
fcc647e
Log Reduced Scores (#245)
dragonstyle Aug 15, 2024
effbea5
Update CHANGELOG.md
aisi-inspect Aug 15, 2024
a3b8018
vscode: correct env var for model base url
jjallaire Aug 15, 2024
a185819
Update inspect-constants.ts
hanbyul-kim Aug 15, 2024
43d55ff
default image logging to false
jjallaire Aug 15, 2024
fc542b2
reformat attribs
aisi-inspect Aug 15, 2024
f48f6fd
Update _model_output.py (#248)
hanbyul-kim Aug 15, 2024
de4c1ef
Merge branch 'main' into patch-1
dragonstyle Aug 15, 2024
d82ce3e
Merge pull request #247 from hanbyul-kim/patch-1
dragonstyle Aug 15, 2024
12c6511
Don’t include override css in bundled version (#249)
dragonstyle Aug 15, 2024
86f2c04
task panel: don't print model, remove unused config arg (#250)
jjallaire Aug 15, 2024
5fc8c7c
Update CHANGELOG.md
aisi-inspect Aug 16, 2024
2227099
update changelog
aisi-inspect Aug 16, 2024
a023097
[Fixed]First Draft: Enable Tool Calling for Llama 3.1 Models in Bedro…
thickett Aug 16, 2024
95e95f2
Update CHANGELOG.md
jjallaire-aisi Aug 16, 2024
c91be38
use strict mode for openai tool calls (#225)
jjallaire Aug 16, 2024
b4b6e7c
report json schema validation errors to model in tool response.
jjallaire Aug 18, 2024
9dfe852
fixed small type (#254)
weissercn Aug 18, 2024
f0d99b6
update changelog
aisi-inspect Aug 18, 2024
06e41bf
update comment to not reference eval_gather
aisi-inspect Aug 19, 2024
be1f70c
adding swebench
max-kaufmann Aug 19, 2024
61d4feb
intermediate changing
max-kaufmann Aug 19, 2024
11e28ed
moving a name
max-kaufmann Aug 19, 2024
e1f591f
docker init for faster shutdown (#257)
jjallaire-aisi Aug 20, 2024
adef9f6
Low Level Solver API (#82)
jjallaire Aug 20, 2024
e824d8a
fix broken anchor (#258)
art-dsit Aug 20, 2024
8025f79
update changelog
jjallaire Aug 20, 2024
e119aca
remove file bodies from sample init event (#259)
jjallaire Aug 20, 2024
8be37ed
Fix layout issue in event panel
dragonstyle Aug 21, 2024
a5440b4
Forward vscode diff colors
dragonstyle Aug 21, 2024
0d4e116
properly format usage numbers
dragonstyle Aug 20, 2024
b52ef69
Remove indenting of content in event panels
dragonstyle Aug 20, 2024
9725e22
Merge pull request #265 from UKGovernmentBEIS/bugfix/transcript-tweaks
dragonstyle Aug 21, 2024
85337e8
adding the building of docker images
max-kaufmann Aug 21, 2024
bce0ecc
adding a requirements.txt
max-kaufmann Aug 21, 2024
90ab02f
moving dataset
max-kaufmann Aug 21, 2024
ab79ce1
fixing edit to inspect
max-kaufmann Aug 21, 2024
e0cbbb9
Merge remote-tracking branch 'origin/main' into max/swe_bench
max-kaufmann Aug 21, 2024
2d5343b
making env exec better
max-kaufmann Aug 21, 2024
6e5acdb
making setup script betteR
max-kaufmann Aug 21, 2024
c707df7
testbed
max-kaufmann Aug 21, 2024
bd4f06c
rename and move
max-kaufmann Aug 22, 2024
e1f584b
fix bug that always sets temperature to 0.02 or lower for vllm (#266)
tadamcz Aug 22, 2024
7bf1de7
moving stuff into tests
max-kaufmann Aug 22, 2024
7115fef
refactor into a helper function
max-kaufmann Aug 22, 2024
8b1a3ab
simplify test
max-kaufmann Aug 22, 2024
c8bae34
adding tests
max-kaufmann Aug 22, 2024
5204682
add skipping of matplotlib
max-kaufmann Aug 22, 2024
6f57cc1
fix amazon link
aisi-inspect Aug 23, 2024
5b8e0e1
[draft] RACE-H Benchmark Implementation | ASET - Arcadia Impact (#263)
mdrpanwar Aug 23, 2024
c2bbfb7
add benchmark to readme
aisi-inspect Aug 23, 2024
e2717d8
Add function field to ChatMessageTool (#269)
jjallaire Aug 23, 2024
1719eab
anthropic prompt caching (#262)
jjallaire Aug 23, 2024
c18bfc7
set sample error tolerance via fail_on_error option (#261)
jjallaire Aug 23, 2024
16cc30f
Update CHANGELOG.md
jjallaire-aisi Aug 23, 2024
add5c60
fix changelog link
aisi-inspect Aug 23, 2024
80cddb8
cache pip dependencies (#273)
jjallaire-aisi Aug 23, 2024
45278a4
lighten dev dependencies (#274)
jjallaire Aug 23, 2024
b743853
only do caching for anthropic models that support it
aisi-inspect Aug 23, 2024
9909f4f
Fix error when usage is not available.
dragonstyle Aug 24, 2024
c3d69e2
Merge pull request #275 from UKGovernmentBEIS/bugfix/usage-error
dragonstyle Aug 24, 2024
e4aad3f
allow any for tool arguments
aisi-inspect Aug 24, 2024
6de12cc
Merge branch 'main' of github.com:UKGovernmentBEIS/inspect_ai
aisi-inspect Aug 24, 2024
0a648d9
Event View Fixes (#277)
dragonstyle Aug 25, 2024
715e91f
update changelog and docs for release
aisi-inspect Aug 25, 2024
353986c
input_screen() context manager for user input (#276)
jjallaire-aisi Aug 25, 2024
9158051
allow sandboxes to specified at the sample level (#278)
jjallaire-aisi Aug 25, 2024
858b417
Add `user` param to `SandboxEnvironment.exec` (#280)
rusheb Aug 26, 2024
c7a8018
Improve onboarding for new devs (#281)
rusheb Aug 26, 2024
c2f12d1
add 'transient` option for input_screen() (#284)
jjallaire-aisi Aug 26, 2024
54460a3
Add api pytest mark and --runapi flag (#285)
rusheb Aug 26, 2024
affc2b3
pytest --runapi: add groq/cloudflare + changelog entry
jjallaire Aug 26, 2024
766dfa6
remove fewshots from intercode ctf system prompt
jjallaire Aug 26, 2024
4c44c9c
add xstest benchmark
NelsonG-C Aug 26, 2024
601a130
serialize plan when its passed as a task arg (#287)
jjallaire Aug 26, 2024
c12be77
Add MMMU folder
shaheenahmedc Aug 23, 2024
e718d82
Remove notes
shaheenahmedc Aug 23, 2024
95ef175
update README and formatting
shaheenahmedc Aug 26, 2024
f924e40
remove equation prompt and add comments
shaheenahmedc Aug 27, 2024
0ed28bd
update README
shaheenahmedc Aug 27, 2024
26b84c5
CommonsenseQA Benchmark Implementation | ASET - Arcadia Impact (#279)
lauritowal Aug 27, 2024
26814e6
Update CHANGELOG.md
jjallaire-aisi Aug 27, 2024
b0b1f67
Update CHANGELOG.md
jjallaire-aisi Aug 27, 2024
780c209
Update CHANGELOG.md
jjallaire-aisi Aug 27, 2024
55ed64d
Add tests for local and docker sandbox, fix cwd (#293)
art-dsit Aug 27, 2024
1e2046d
discover compose.yaml for filesystem tasks w/ chdir=True (#294)
jjallaire-aisi Aug 27, 2024
6351821
Update CHANGELOG.md
jjallaire-aisi Aug 27, 2024
26e2247
provide default for store.get overload
jjallaire Aug 27, 2024
c2c1411
Merge branch 'main' of github.com:UKGovernmentBEIS/inspect_ai
jjallaire Aug 27, 2024
dfa2111
task run dir switching
aisi-inspect Aug 27, 2024
ee3b7e4
do not process `.env` files in task directories
aisi-inspect Aug 28, 2024
5ba5143
Merge pull request #295 from UKGovernmentBEIS/feature/task-chdir
jjallaire Aug 28, 2024
d6018a4
check for __name__ on callables
aisi-inspect Aug 29, 2024
e13939c
use /tmp as working dir for intercode
aisi-inspect Aug 29, 2024
9aa8d86
Refactor MATH benchmark (#298)
xeon27 Aug 29, 2024
ed32d5c
Add MMLU-Pro benchmark (#297)
xeon27 Aug 29, 2024
08f1076
Update README.md
jjallaire Aug 29, 2024
8620665
Update README.md
jjallaire Aug 29, 2024
e04a357
Update CHANGELOG.md
jjallaire Aug 29, 2024
fd782bb
preserve types of task params
aisi-inspect Aug 29, 2024
442f5a9
str for unknown param types
jjallaire Aug 29, 2024
70d514b
Merge branch 'UKGovernmentBEIS:main' into main
NelsonG-C Aug 30, 2024
ce3fabb
fix README description
NelsonG-C Aug 30, 2024
d6b1b04
add scorer model parameter
NelsonG-C Aug 30, 2024
ffa9a15
use scorer wrapper to remove unused metrics, update scorer template t…
NelsonG-C Aug 30, 2024
c019f1b
Add test and fix incorrect assumption that project has a single worki…
art-dsit Aug 30, 2024
ad3e515
Bump ruff from 0.6.2 to 0.6.3 in the python-packages group (#304)
dependabot[bot] Aug 30, 2024
ed189f0
refactor MMMU code and update READMEs
shaheenahmedc Sep 2, 2024
6c313f0
Merge remote-tracking branch 'upstream/main' into MMMU
shaheenahmedc Sep 2, 2024
ce475eb
add question_ to temporary image file path
shaheenahmedc Sep 2, 2024
6213626
fix scorer template prompt
NelsonG-C Sep 2, 2024
125976a
Remove TODO item from README
shaheenahmedc Sep 2, 2024
c91aa2a
Add metadata to multiple-choice questions
shaheenahmedc Sep 2, 2024
b650d5c
eval_set function and CLI command for running sets of evaluations w/ …
jjallaire Sep 2, 2024
fc94050
Test tweaks (#307)
art-dsit Sep 2, 2024
db7a573
Roll back usage of `strict` mode for OpenAI tool calls (#309)
jjallaire-aisi Sep 2, 2024
a33094c
remove base64 images from model api logging (#313)
jjallaire Sep 3, 2024
bc13560
Update CHANGELOG.md
jjallaire Sep 3, 2024
94b7366
update comment instructions for running subsets
NelsonG-C Sep 3, 2024
2bf2cc4
Make documentation consistent, and raise PermissionError for exec (#308)
art-dsit Sep 3, 2024
e1ad97f
indicate that base64 data has been removed from log
aisi-inspect Sep 3, 2024
aee43a0
make eval_set error prereq errors
aisi-inspect Sep 3, 2024
95b4fa6
Task `metrics` now override built in scorer metrics
aisi-inspect Sep 3, 2024
16f9b2b
prevent duplicates in eval-set scheduler (#317)
jjallaire-aisi Sep 3, 2024
32c6325
`write_log_dir_manifest()` to write log header manifests (#316)
jjallaire-aisi Sep 3, 2024
dea5a97
together lobprobs test: use valid serverless model
jjallaire Sep 3, 2024
7310f75
inspect list logs: remove --retryable and make --no-recursive work
aisi-inspect Sep 3, 2024
1b05ff6
remove unused import
aisi-inspect Sep 3, 2024
4636b6d
update together docs
jjallaire Sep 3, 2024
7828b6d
support recursive log dir manifest listings (#320)
jjallaire-aisi Sep 3, 2024
c98bec6
don't resolve 'model' in task_args
aisi-inspect Sep 3, 2024
c582380
initial commit
ShivMunagala Sep 3, 2024
7267da9
improved handling of parallel_tool_calls
aisi-inspect Sep 3, 2024
41feb2f
Merge branch 'main' into mathvista_implementation
ShivMunagala Sep 3, 2024
0d56f3b
update schema with is_a_directory error type
aisi-inspect Sep 3, 2024
fcad306
only enable `strict` mode for OpenAI tool calls when all function par…
aisi-inspect Sep 3, 2024
7d880bd
Merge pull request #321 from UKGovernmentBEIS/feature/strict-tools-op…
jjallaire-aisi Sep 3, 2024
25ebd5a
Merge branch 'main' into mathvista_implementation
ShivMunagala Sep 3, 2024
fe8093b
compress image and rename utils
shaheenahmedc Sep 4, 2024
f7b24cf
Merge branch 'main' into MMMU
shaheenahmedc Sep 4, 2024
400d2ce
remove image tags from question and fix utils import
shaheenahmedc Sep 4, 2024
d70916e
fix basename test
aisi-inspect Sep 4, 2024
9a4eeff
move @task functions above helper functions
shaheenahmedc Sep 4, 2024
d7af780
Merge branch 'main' into MMMU
jjallaire-aisi Sep 4, 2024
3e251ee
Merge pull request #292 from shaheenahmedc/MMMU
jjallaire-aisi Sep 4, 2024
6998906
Update CHANGELOG.md
jjallaire-aisi Sep 4, 2024
ef32862
Update README.md
jjallaire-aisi Sep 4, 2024
58ef51e
reformat comments, refactor code order in main file
NelsonG-C Sep 4, 2024
17a4b6f
Move vendored inspect view packages into bundle (#305)
seddy-aisi Sep 4, 2024
414aea6
Improve loading performance (#286)
dragonstyle Sep 4, 2024
7e1dbda
add doc comment to eval
jjallaire Sep 4, 2024
13427b0
Merge branch 'main' into feature/override-metrics
dragonstyle Sep 4, 2024
cc32149
Merge pull request #314 from UKGovernmentBEIS/feature/override-metrics
jjallaire-aisi Sep 4, 2024
666291f
improved docs on multiple choice (#323)
jjallaire Sep 4, 2024
b84c089
Display a message index in messages tab (#325)
dragonstyle Sep 4, 2024
f462e1b
Add proposed `exact` and `f1` scorers. (#306)
dragonstyle Sep 4, 2024
3439c34
don't show parallel_tool_calls when enabled
jjallaire Sep 4, 2024
d22d2be
properly process the header only parameter (#326)
dragonstyle Sep 4, 2024
854f92c
Default header-only to None (#327)
dragonstyle Sep 5, 2024
a924caf
relocate store, subtask, and transcript modules (#328)
jjallaire Sep 5, 2024
faa50b3
Merge branch 'main' into main
NelsonG-C Sep 5, 2024
be8cbda
remove custom scorer and inline custom metric
NelsonG-C Sep 5, 2024
435a608
Merge pull request #288 from NelsonG-C/main
jjallaire-aisi Sep 5, 2024
50a40d9
add xstest to readme and changelog
aisi-inspect Sep 5, 2024
adbc8f3
Improve Task Complete Workflow
dragonstyle Sep 5, 2024
5a55d5f
Rename logViewAuto setting to openLogView
dragonstyle Sep 5, 2024
9a63dd1
Update the vscode version
dragonstyle Sep 5, 2024
5e7b1d2
setup file
max-kaufmann Sep 5, 2024
857c263
Don’t import readline on Windows
dragonstyle Sep 5, 2024
258e0b1
Merge pull request #329 from UKGovernmentBEIS/bugfix/readline
jjallaire-aisi Sep 5, 2024
ad5a4cd
Merge branch 'main' into feature/vscode-330
jjallaire-aisi Sep 5, 2024
241b76f
Merge pull request #330 from UKGovernmentBEIS/feature/vscode-330
jjallaire-aisi Sep 5, 2024
ac9973e
fixing environment length
max-kaufmann Sep 5, 2024
15c50c3
fixing up some repo stuff
max-kaufmann Sep 5, 2024
0af52d4
Merge remote-tracking branch 'origin/main' into max/swe_bench
max-kaufmann Sep 5, 2024
6ff5fb8
only populate docker compose config metadata values when they are use…
jjallaire Sep 5, 2024
d977edc
Merge branch 'UKGovernmentBEIS:main' into mathvista_implementation
ShivMunagala Sep 5, 2024
09b589d
Add mathvista to benchmarks readme
ShivMunagala Sep 5, 2024
3adae4d
Move scorer and solver under task
ShivMunagala Sep 5, 2024
0ac3057
Add example.png to `mathvista` directory
ShivMunagala Sep 5, 2024
5f3b6c2
Delete benchmarks/mathvista/resources directory
ShivMunagala Sep 5, 2024
b597ffd
Edit image link in readme to point to example.png
ShivMunagala Sep 5, 2024
ff35f22
Edit image link to point to example.png instead of example image.png
ShivMunagala Sep 5, 2024
2c53f17
Reword `FREEFORM_TEMPLATE` to remove superfluous word
ShivMunagala Sep 5, 2024
660a2fa
Rename solver and scorer to `mathvista_*`
ShivMunagala Sep 5, 2024
b47e9eb
Raise `ValueError` if solver gets an invalid question_type
ShivMunagala Sep 5, 2024
d9a7329
Add comment clarifying branch is for scoring pattern found
ShivMunagala Sep 5, 2024
295edf8
update changelog for release
aisi-inspect Sep 6, 2024
bb356b7
Merge branch 'main' into mathvista_implementation
jjallaire-aisi Sep 6, 2024
974a8ad
cleanup formatting and typing
aisi-inspect Sep 6, 2024
62e175c
remove some internal imports
aisi-inspect Sep 6, 2024
6fab9af
Merge branch 'ShivMunagala-mathvista_implementation'
aisi-inspect Sep 6, 2024
bf43aef
update changelog
aisi-inspect Sep 6, 2024
3c23369
moving to create_test_repos
max-kaufmann Sep 6, 2024
b6b0ee1
Merge remote-tracking branch 'origin/main' into max/swe_bench
max-kaufmann Sep 6, 2024
7daafa8
moving back metadata
max-kaufmann Sep 6, 2024
31d7a3f
updating test generator
max-kaufmann Sep 9, 2024
2b9870e
adding trajectory
max-kaufmann Sep 9, 2024
7d4e118
awkwardeness with strs and Paths
max-kaufmann Sep 9, 2024
1d5bcc2
build images
max-kaufmann Sep 9, 2024
5aa90c5
adding a README
max-kaufmann Sep 9, 2024
49491e2
fix name
max-kaufmann Sep 9, 2024
4ce80f4
adding the name of the plan at the end
max-kaufmann Sep 9, 2024
0038eac
linting
max-kaufmann Sep 9, 2024
fb64455
adding tests to git
max-kaufmann Sep 9, 2024
92fa7e9
moving around where the creation of the thing is
max-kaufmann Sep 9, 2024
65bd785
merge conflicts (werid)
max-kaufmann Sep 9, 2024
59b8196
rm one sample
max-kaufmann Sep 9, 2024
8385df1
git ignore
max-kaufmann Sep 9, 2024
1b7cc92
editing the benchmark creation
max-kaufmann Sep 9, 2024
8a5513a
fixing setup file
max-kaufmann Sep 10, 2024
32d6bea
Update settings.json
max-kaufmann Sep 10, 2024
991676e
Merge remote-tracking branch 'refs/remotes/origin/max/swe_bench' into…
max-kaufmann Sep 10, 2024
9ad0bdd
Merge remote-tracking branch 'origin/main' into max/swe_bench
max-kaufmann Sep 10, 2024
d263183
git ignore
max-kaufmann Sep 10, 2024
254e8ce
some edits
max-kaufmann Sep 10, 2024
57ce18f
did a delete
max-kaufmann Sep 10, 2024
9d97502
Merge remote-tracking branch 'origin/main' into max/swe_bench
max-kaufmann Sep 11, 2024
85146fe
add agent name
max-kaufmann Sep 11, 2024
b7a2334
changed README a bit to actually work
max-kaufmann Sep 11, 2024
e06c393
indenting
max-kaufmann Sep 11, 2024
4136f63
reverting merge changes
max-kaufmann Sep 11, 2024
fe97353
reverting docs
max-kaufmann Sep 11, 2024
e6c1e29
adding the trajectories
max-kaufmann Sep 11, 2024
1efd15e
gitignore
max-kaufmann Sep 11, 2024
098aa34
ruff
max-kaufmann Sep 11, 2024
4a48422
ruff
max-kaufmann Sep 11, 2024
19d5632
delete baselines
max-kaufmann Sep 11, 2024
5c91048
delete bottom bit of PR
max-kaufmann Sep 11, 2024
c4efa66
dataset single instance
max-kaufmann Sep 11, 2024
420df88
run swe-bench
max-kaufmann Sep 11, 2024
89488e6
change README
max-kaufmann Sep 12, 2024
372ad2a
fixing json
max-kaufmann Sep 12, 2024
fba66c0
some edits
max-kaufmann Sep 12, 2024
409de1a
Merge remote-tracking branch 'origin/main' into max/swe_bench
max-kaufmann Sep 12, 2024
86ffe2e
repo specific scikit thing
max-kaufmann Sep 12, 2024
e02a2e0
add baselines to gitignore
max-kaufmann Sep 12, 2024
65e4a89
fixing up the edit
max-kaufmann Sep 12, 2024
d332ce2
removing ERROR
max-kaufmann Sep 12, 2024
6cc1517
create_test_repos
max-kaufmann Sep 12, 2024
f4f7cb9
ruff
max-kaufmann Sep 12, 2024
dbb00a3
ruff
max-kaufmann Sep 12, 2024
953eb1a
vscode revert
max-kaufmann Sep 12, 2024
f829289
create test repos
max-kaufmann Sep 12, 2024
1c61bfe
Merge remote-tracking branch 'origin/main' into max/swe_bench
max-kaufmann Sep 12, 2024
d771f45
requests
max-kaufmann Sep 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,10 @@
"logs/**": true
},
"python.testing.pytestArgs": [
"tests"
"."
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
jjallaire-aisi marked this conversation as resolved.
Show resolved Hide resolved
"quarto.render.renderOnSave": true
"quarto.render.renderOnSave": true,
"python.testing.pytestPath": ""
}
1 change: 1 addition & 0 deletions benchmarks/swe_bench/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
resources/compose_files/*
68 changes: 68 additions & 0 deletions benchmarks/swe_bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# SWE-agent
This is an inspect-native implementation of [the SWE-bench dataset](https://www.swebench.com/).

## Installation

- **Install requirements.** As well as the requirements provided by inspect, this benchmark has its own dependencies. You can install them in your virtual enviro nment by running ```pip install -r requirements.txt```
- **Build environment images.** SWE-bench requires a set of images with all the dependencies of each repository. Yo build them for the SWE-bench-verified split, run ```./build_images.py --dataset_path princeton-nlp/SWE-bench_Verified --split train```. See ```./build_images.py --help``` for all arguments. NOTE: Depending on the repositories contained in your subset of SWE-bench, this can take a while to run (several hours).

## Usage


### Running the benchmark
The ```swe_bench.py``` file contains the ```swe_bench``` function, which creates an instance of a SWE-bench Task:

```python
from inspect_ai import eval
from inspect_ai.solver import use_tools, generate, system_message
from inspect_ai.solver import basic_agent
from inspect_ai.tool import bash, python
from swe_bench import swe_bench
from inspect_ai.solver import system_message

# Create an agent that only uses bash, and an agent that only uses python.
agents = [
basic_agent(
init=system_message("Please solve the coding task below. Once you are done, use your submit tool."),
tools=[tool],
)
for tool in [bash(), python()]
]

# Create a swe-bench instance for each of those agents.
swebench_task = [swe_bench(dataset="princeton-nlp/SWE-bench_Verified", split="test", plan=agent) for agent in agents]

# For the demonstration, we will select only the first 5 tasks.
swebench_task.dataset = swebench_task.dataset[:5]

# Compare how these two agents perform.
eval(swebench_task, model="openai/gpt-4o")
```

NOTE: SWE-bench will take a while to run, and uses a lot of tokens. If things are too slow, you should increase the level of parallelism - see https://inspect.ai-safety-institute.org.uk/parallelism.html. Note that running too many docker containers on your machine can also cause issues, most notably with running out of open ports - we don't recommend running more than 32 containers at any one time.

### Comparing to baselines
Submissions to [the official SWE-bench leaderboard](https://www.swebench.com/) often come with a set of trajectories and per-swebench-instance results. This means that you can compare the performance of your agent not just on the full dataset, but also on subsets of that dataset. This is often useful when trying to run smaller, cheaper experiments. To make this comparison easy, we provide a scorer that will calculate the average performance of an existing baseline on that set of trajectories:

```python
from inspect_ai import eval
from swe_bench import swe_bench, swe_bench_scorer, swebench_baseline_scorer

# Select only the subset where the patch is more than 20 lines long. Select the default plan.
swebench_verified_short_descriptions = swe_bench(dataset="princeton-nlp/SWE-bench_Verified", split="train",plan=simple_bash_plan, filter=lambda x: len(x["patch"].split("\n")) > 20)


# Report the Claude 3.5 Sonnet + SweAgent baseline, as well as the performance of the agent. To download the baseline run ./resources/baselines/download_swebench_baseline.sh
swebench_verified_task_subset.scorer = [swebench_baseline_scorer("./resources/baselines/swebench_verified_20240620_sweagent_claude3.5sonnet",name="sweagent_baseline"), swebench_scorer()]

eval(swebench_verified_task_subset)
```

This will lead to both numbers being reported in the final output, allowing you to compare baselines:

![SWE-bench baseline comparison](./resources/docs/swebench_comparison.jpeg)
```




111 changes: 111 additions & 0 deletions benchmarks/swe_bench/build_images.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
"""This is a utility script which installs all of the environment images for the SWE-bench dataset. These images contain all of the dependencies required to run the tests in the dataset."""

import argparse
import json
import os

from datasets import load_dataset
from docker.client import DockerClient
from swe_bench import SAMPLE_TO_IMAGE_PATH
from swebench.harness.docker_build import build_env_images
from swebench.harness.test_spec import get_test_specs_from_dataset
from swebench.harness.utils import load_swebench_dataset


def build_images(
dataset_name: str,
split: str | None = None,
max_workers: int = 4,
force_rebuild=False,
) -> None:
"""This function uses the swe_bench library to build docker images for the environment of the SWE-bench dataset. It also creates a mapping from the information contained in the dataset itself ( in particular, the "instance_id" and env_image_key) to the name of the docker images. This mapping lets us find the images directly from the dataset entries, rather than relying on objects created in the swe_bench code."""
# Code copied from the swe_bench repository
docker_client = DockerClient.from_env()

swebench_dataset = load_swebench_dataset(dataset_name, split)
if not isinstance(swebench_dataset[0], dict):
raise ValueError(
f"After loading the dataset, the elements should be a dictonary. Got {type(swebench_dataset[0])} instead. Did you pass the correct dataset name and split? \n\nOutput of huggingface load_dataset: {load_dataset(dataset_name, split)}"
)

# We then use a couple of internal functions from swebench to build the environment images
test_specs = get_test_specs_from_dataset(swebench_dataset)
build_env_images(docker_client, swebench_dataset, force_rebuild, max_workers)

# We build a mapping of insance_ids and envirnoment_commits to the name of the docker images. This is used to find the images in our main swe_bench code.
environment_name_mapping = (
{}
if not os.path.exists(SAMPLE_TO_IMAGE_PATH)
else json.load(open(SAMPLE_TO_IMAGE_PATH))
)

for spec, swebench_instance in zip(test_specs, swebench_dataset):
assert spec.env_image_key in [
image_tag
for image in docker_client.images.list()
for image_tag in image.tags
], f"Image {spec.env_image_key} not found in docker images"
# Check if entry already exists
if swebench_instance["instance_id"] not in environment_name_mapping:
environment_name_mapping[swebench_instance["instance_id"]] = {}
if (
swebench_instance["environment_setup_commit"]
in environment_name_mapping[swebench_instance["instance_id"]]
):
assert (
environment_name_mapping[swebench_instance["instance_id"]][
swebench_instance["environment_setup_commit"]
]["image_key"]
== spec.env_image_key
), f"Image {spec.env_image_key} already mapped to a different image"
else:
environment_name_mapping[swebench_instance["instance_id"]][
swebench_instance["environment_setup_commit"]
] = {
"image_key": spec.env_image_key,
"env_setup_script": "\n".join(spec.env_script_list),
}

# Add the mappings to a file
json.dump(environment_name_mapping, open(SAMPLE_TO_IMAGE_PATH, "w+"), indent=4)


if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Build environment images for SWE-bench"
)

parser.add_argument(
"--dataset_path",
type=str,
help="Name of the dataset to build images for",
required=False,
default="princeton-nlp/SWE-bench_Verified",
)

parser.add_argument(
"--split",
type=str,
help="Split of the dataset to build images for",
required=False,
default="test",
)

parser.add_argument(
"--max_workers",
type=int,
help="Maximum number of workers to use for building images",
required=False,
default=4,
)

parser.add_argument(
"--force_rebuild",
action="store_true",
help="Force rebuild of images",
required=False,
default=False,
)

args = parser.parse_args()
build_images(args.dataset_path, args.split, args.max_workers, args.force_rebuild)
2 changes: 2 additions & 0 deletions benchmarks/swe_bench/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
swebench
docker
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Download the swe-bench baselines. NOTE: THEY ARE QUITE LARGE.
git clone https://github.com/swe-bench/experiments /tmp/swebench_baselines
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
# Copy to current directory
cp -r /tmp/swebench_baselines/evaluation/verified/20240620_sweagent_claude3.5sonnet $SCRIPT_DIR
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
96 changes: 96 additions & 0 deletions benchmarks/swe_bench/resources/tests/create_test_repos.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
import json
import os
from pathlib import Path

import pandas as pd
from datasets import load_dataset

dataset = load_dataset("princeton-nlp/SWE-bench_Verified")["test"]

# We create a subset of SWE-bench-verfied which:
# 1) Contains all of the repositories in SWE-bench verified
# 2) For each repository, contains an example where swe_agent + Claude 3.5 + Sonnet resolved the issue, and an example where it did not

results_per_repo = {}
baseline_dir = (
Path(__file__).parent.parent
/ "baselines"
)
logs_dir = baseline_dir / "20240620_sweagent_claude3.5sonnet"
/ "logs"

if not logs_dir.exists():
print(f"Please run the baseline creation script at {baseline_dir / "download_sweagent_baselines.sh"} to make the baselines to compare your agents against.")
exit()


results = []
missing_results = []

# Load results from the log directory
for result in os.listdir(logs_dir):
results_path = os.path.join(logs_dir, result, "report.json")
if os.path.exists(results_path):
with open(results_path, "r") as f:
result_dict = json.load(f)
result_name, results_value = next(iter(result_dict.items()))
output_dict = dict(instance_id=result_name, **results_value)
patch_path = os.path.join(logs_dir, result, "patch.diff")
with open(patch_path, "r") as f:
output_dict["swe_agent_patch"] = f.read()
results.append(output_dict)

else:
missing_results.append(result)

# Get repository name from the results
results = pd.DataFrame.from_records(results)
results["repo"] = results["instance_id"].apply(lambda x: x.split("__")[0])

# Github patches which change binary files cannot actually be applied. We wil remove these entries in the dataset
results = results[~results["swe_agent_patch"].str.contains("Binary files")]

# Group by repository, and success. Then pick one from each group.
results_per_repo = results.groupby(["repo", "resolved"])
results_per_repo = results_per_repo.apply(lambda x: x.sample(1)).reset_index(drop=True)


# Filter dataset by those instance ids, and add a "reolved_by_swe_agent" column.
instance_ids = results_per_repo["instance_id"].values
resolved = results_per_repo["resolved"].values

dataset = dataset.filter(lambda x: x["instance_id"] in instance_ids)
dataset = dataset.map(
lambda x: dict(
x, resolved_by_swe_agent=resolved[instance_ids == x["instance_id"]][0]
),
num_proc=4,
)
# Add swe-agent-patch
dataset = dataset.map(
lambda x: dict(
x,
swe_agent_patch=results_per_repo[
results_per_repo["instance_id"] == x["instance_id"]
]["swe_agent_patch"].values[0],
),
num_proc=4,
)
# Add resolved column
dataset = dataset.map(
lambda x: dict(x, resolved=resolved[instance_ids == x["instance_id"]][0]),
num_proc=4,
)

# This repo is bugged for testing, as the setup script edits files, breaking the patch from swe-agent.
dataset = dataset.filter(lambda x: "sphinx" not in x["instance_id"])

# Calculate the accuracy. Should be 0.42857142857142855
accuracy = sum(resolved) / len(resolved)

# Save tbe dataset
dataset_dir = Path(__file__).parent / "all_repos_swe_agent_50_percent.hf"
os.makedirs(str(dataset_dir), exist_ok=True)
dataset.to_parquet(dataset_dir / "dataset.parquet")

print(f"Saved dataset to {dataset_dir}, accuracy {accuracy}")
Loading
Loading