Adding SWE-bench #268

max-kaufmann · 2024-08-22T00:19:00Z

This PR contains:

What is the current behavior? (You can also link to an open issue here)

This a draft PR which branch adds SWE-bench under benchmarks/. The implementation is in benchamarks/swe_bench/swebench.py. Start by reading the README.

What is the new behavior?

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

Other information:

* Add DROP benchmark * Create separate constants for prompts * Create function for building task plan * Transfer parsing logic to scorer * Remove version * Update with DROP * Remove inspect-ai and treat it as implied --------- Co-authored-by: xeon27 <[email protected]>

* Add WinoGrande benchmark * Create separate prompt templates * Use scorer and config inline * Add function to build plan * Mypy fix * Update with WinoGrande * Remove minor bug --------- Co-authored-by: xeon27 <[email protected]>

* Proposal to Store Intermediate Results * Improved the legibility of the intermediate reduced scores - Add an explanation - Remove the explanation from each subscore stored in metadata * Refine naming, captured data, schema --------- Co-authored-by: jjallaire-aisi <[email protected]> Co-authored-by: aisi-inspect <[email protected]>

Co-authored-by: jjallaire-aisi <[email protected]>

Fix environment variable name

* Don’t include override css in bundled version * update changelog * update version --------- Co-authored-by: jjallaire-aisi <[email protected]>

…ck API (#252) * First draft: Refactor Bedrock API to enable tool calling for Llama 3.1 models - Introduced `chat_api_input` function to format input messages with tools, inline with what is done in for Azure. - Replaced `ChatMessage` with `ChatAPIMessage` in relevant methods to support the new input format. - Updated `completion_choice` methods to use `ChatAPIHandler` for parsing assistant response, inline with the azure code. - Refactored a number of other functions to be inline with the above changes. * fix: remove old Llama3ChatHandler --------- Co-authored-by: jjallaire-aisi <[email protected]>

* support for strict mode in openai tool calling * fix typing error in tests --------- Co-authored-by: jjallaire-aisi <[email protected]>

Co-authored-by: jjallaire <[email protected]>

* use init to speed shutdown this enables containers to correctly receive exit signals and exit immediately. * update changelog --------- Co-authored-by: aisi-inspect <[email protected]>

* initial work on subtasks * update readme * remove readme (it's now in the PR writeup) * skip tool store test if no openai * fix typo * correct message reading * correct type * Proof of concept for JSDoc Styles Baseline configuration with baseline implementation in a few places * Use yarn to manage preact / htm This allows the types to flow from package.json * Fully type util as proof of concept * Cleanup in utils * Proof of concept using types from log.d.ts * Rough in of transcript dump * Try to reign in type checking for prism here * update to latest prettier * Conditionally show transcript tab * Another solver rendering * Move transcript views * Including TZ information in timestamp * Tweaked step implementation * Add tools to model dump * Revise model event view (still WIP) * More structured transcript * A little more air * A little more tweaking * fix prettier complaint * trying updating yarn.lock * Attempt to force resolution * Remove `json-schema-to-typescript` This is causing a failure in resolving dependencies which prevents yarn install from working. * Improved state change view * Further fine tuning of appearance * Ensure store support too * More standard appearance * Fix content size * Improve appearance * Properly render objects and arrays in state changes * Improve grid appearance * Remove unused imports * Correct subtask inline display * Simplify state change event rendering * Fix prettier * Share event layout * Improve logger event view * add ScoreEvent * remove unused var * track state changes in transcript steps * remove subtask stuff from web_search for now * Improve state changes rendering * Remove logger event title for more compactness (also includes improvements to the transcript event view itself) * Add a scorer event handler * Improve subtask rendering * fix heading cursor * Improve Score Event View * merge from main * turn event into a base class * don't export event types * regen schema * fixup imports * revert event type changes * write schema to types dir * transcript events: no export and standard 'data' field * regen schema * fix transcript test * use pydantic 2.0 field_serialiser * Revert "transcript events: no export and standard 'data' field" This reverts commit 5f2b654. * use pydantic 2.0 field_serialiser * don't export events * remove unused import * log the full logging message * rename log method * drive transcript events into recorder * write through to trsc file * cleaner interface for transcript event forwarding * initial write to sqlite * Standardize font-size on rem * decorate the html tag for the logview so it can detect vscode immediately * Improve column allocation * Create Shared Fonts Object Move all things to a shared notion of fonts and styles that can be re-used easily. Use font scaling in vscode to achieve the correct appearance (now that we’re rem based we can just change the base font size). * Move summary stats into navbar * Restructure navbar into workspace * Improve progress bar appearance * Improve column sizing * Refactor tab appearance into navbar * Adjust correct/incorrect appearance * Baseline pill improvements * fix heading height in vscode * correct sidebar * Improve sidebar appearance (+prettier) * widen sidebar slightly * Sample Display Tweaks * Tweaks to config * initial work on content db * more comprehensive event walking * de-duplicate content in transcript events * Remove timestamps, correct prop name * Baseline implementation of evalevents resolving plus some prettier formatting * Correct content resolution * remove logging section of evallog (now covered in sample transcript) * Improve density when hosted in vscode at narrow short sizes * Revised appearance to grouped cards * formatting * A little more tweakage * generate_loop and tool_call steps * Fix lint issues * no srsly fix lint * resolve circular import * run prettier on event panel * Fix error w/specific logs * update test * Improve find band appearance * sample init event * Proof of concept state rendering * Relocate state code since it will grow * correct resolution of objects * lint and formatting * sample_init event * Add collapsible state to event panel, collapse certain panels * Subtask rough in * ensure we have vite * Correct merge + regen dist * add a watch command * correct formatting * correct build (investigating why my local build wasn’t flagging this) * include source maps * Add Sample Init, track state across transcript * fix lint * update dist * ensure nav-pills align bottom * correct lint * Add chat view to model * prettier * ran commands in wrong order * Improve sample init event (still mostly a dump of data) * Add all messages view * Simplify transcript view * Improvements to display * Chatview should show tool call even if no message * Improve state event display * Display choices in sampleinit event * Improve Score Event appearance * Tweak title results * More appearance tweakage * Improve tab appearance * Fix tab selection issue in subtask transcripts * Improved spacing * Fix scoring metadata layout * toolcall event * initial work on raw request/response for model event * Add placeholder tool event * initial work on raw model calls * log raw tool argument not converted * log raw model call for anthropic * format attribs notebook * raw model request for mistral * Add depth to cards (with basic impl) * remove map * ignore source map * Add baseline model raw view * Improve state appearance * Improve log display * fix formatting * properly default to messages if no transcript * add one last debug message * Disable build checking with note * Appearance refinement - only start indenting at second level step - create section component * raw capture for google * Don’t capture args when logging This is doing a lot of work which shouldn’t be happening in a log handler (and the value of the args is suspect anyhow). Causing an exception in certain environments. * Remove disused imports * record raw api calls for groq * Improve root solver display - break up root cards - add sample init step (synthetic) * raw api call recording * raw model access for cloudflare * raw model output for azureai * Improve subtask display * raw model capture for vertex * eliminate qualifying note on tool descriptions * improve setup display * Add ToolView * improved agents api docs * Tweaks * eliminate tool steps * hide agents api for now * agents api docs * Resolve the model call contents * Special case handling for sample init event title (no dupe title) * Improve logging event appearance * more tool docs * rename to agents api * remove bash prompt * Correct transcript icons * improve tab selection behavior * Improved model display * Correct font size in metadatagrid * initial work on tool transcript * more tool event work * schema updates * Refactor content resolution to the very top level (move it out of transcript view - it expects to receive events with resolve content) * Resolve the whole state object for events * remove generate_loop step type * Fix ruff errors * don’t force wrap in weird ways * Correct tool rendering for state events * Baseline visual diff implementation * Move tools to single transcript tab * Improve tool event output rendering * Don’t intend tool input / output * enable docs + add parallel execution * Fix prism coloring for js, python, json * show no output message if there is not tool output * allow event titles to wrap * Improve wrapping at small sizes (model event) * crossref to agents api article --------- Co-authored-by: aisi-inspect <[email protected]> Co-authored-by: Charles Teague <[email protected]> Co-authored-by: jjallaire-aisi <[email protected]>

max-kaufmann · 2024-09-11T11:56:45Z

@jjallaire I made the changes! (and removed all the files which were changed as part of the failed merge). I have one more bug to hunt down, and then everything should be good!

jjallaire · 2024-09-11T18:25:26Z

Thanks! Have you nailed the bug yet? If so I'd say just run ruff format and we should be good to merge.

xeon27 and others added 30 commits August 14, 2024 13:32

mypy cleanup for drop benchmark

8997eb7

WinoGrande benchmark (#236)

ecc0a7e

* Add WinoGrande benchmark * Create separate prompt templates * Use scorer and config inline * Add function to build plan * Mypy fix * Update with WinoGrande * Remove minor bug --------- Co-authored-by: xeon27 <[email protected]>

Update CHANGELOG.md

245eb90

Update CHANGELOG.md

effbea5

vscode: correct env var for model base url

a3b8018

Update inspect-constants.ts

a185819

default image logging to false

43d55ff

reformat attribs

fc542b2

Update _model_output.py (#248)

f48f6fd

Co-authored-by: jjallaire-aisi <[email protected]>

Merge branch 'main' into patch-1

de4c1ef

Merge pull request #247 from hanbyul-kim/patch-1

d82ce3e

Fix environment variable name

Don’t include override css in bundled version (#249)

12c6511

* Don’t include override css in bundled version * update changelog * update version --------- Co-authored-by: jjallaire-aisi <[email protected]>

task panel: don't print model, remove unused config arg (#250)

86f2c04

Update CHANGELOG.md

5fc8c7c

update changelog

2227099

Update CHANGELOG.md

95e95f2

use strict mode for openai tool calls (#225)

c91be38

* support for strict mode in openai tool calling * fix typing error in tests --------- Co-authored-by: jjallaire-aisi <[email protected]>

report json schema validation errors to model in tool response.

b4b6e7c

fixed small type (#254)

9dfe852

Co-authored-by: jjallaire <[email protected]>

update changelog

f0d99b6

update comment to not reference eval_gather

06e41bf

adding swebench

be1f70c

intermediate changing

61d4feb

moving a name

11e28ed

docker init for faster shutdown (#257)

e1f591f

* use init to speed shutdown this enables containers to correctly receive exit signals and exit immediately. * update changelog --------- Co-authored-by: aisi-inspect <[email protected]>

fix broken anchor (#258)

e824d8a

max-kaufmann added 5 commits September 11, 2024 12:35

adding the trajectories

e6c1e29

gitignore

1efd15e

ruff

098aa34

ruff

4a48422

delete baselines

19d5632

max-kaufmann added 3 commits September 11, 2024 13:00

delete bottom bit of PR

5c91048

dataset single instance

c4efa66

run swe-bench

420df88

max-kaufmann added 3 commits September 12, 2024 13:59

change README

89488e6

fixing json

372ad2a

some edits

fba66c0

jjallaire self-requested a review September 12, 2024 18:15

Merge remote-tracking branch 'origin/main' into max/swe_bench

409de1a

jjallaire-aisi self-requested a review September 12, 2024 19:25

jjallaire-aisi approved these changes Sep 12, 2024

View reviewed changes

max-kaufmann added 11 commits September 12, 2024 23:09

repo specific scikit thing

86ffe2e

add baselines to gitignore

e02a2e0

fixing up the edit

65e4a89

removing ERROR

d332ce2

create_test_repos

6cc1517

ruff

f4f7cb9

ruff

dbb00a3

vscode revert

953eb1a

create test repos

f829289

Merge remote-tracking branch 'origin/main' into max/swe_bench

1c61bfe

requests

d771f45

max-kaufmann merged commit 50b2f36 into main Sep 12, 2024
9 checks passed

jjallaire-aisi deleted the max/swe_bench branch September 13, 2024 00:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding SWE-bench #268

Adding SWE-bench #268

max-kaufmann commented Aug 22, 2024 •

edited

Loading

max-kaufmann commented Sep 11, 2024

jjallaire commented Sep 11, 2024

Adding SWE-bench #268

Adding SWE-bench #268

Conversation

max-kaufmann commented Aug 22, 2024 • edited Loading

This PR contains:

What is the current behavior? (You can also link to an open issue here)

What is the new behavior?

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

Other information:

max-kaufmann commented Sep 11, 2024

jjallaire commented Sep 11, 2024

max-kaufmann commented Aug 22, 2024 •

edited

Loading