Mikec/upgrade graders #77

mikecann · 2025-08-27T01:30:46Z

This one builds upon the work in this PR:
#76

Once I was able to run evals locally and view their run logs I noticed that often we were failing the eval because of a number of reasons that I don't think were actual valid reasons to fail the eval.

The grader was checking to literal schema and function shape and the task did not specify the expected shape thus there was wiggle room in the interpretability of the task and thus different models would output different code and thus would fail even thought the answer is correct when you compare it against the task.

So the bulk of the work in this PR is to "make the grading fair".

So that is to go through every single eval and compare the task to what we are grading it on in the grader.test.ts files. If the tests are expecting certain functions with certain arguments then we should be explicit in the task that the AI should generate with those exact names.

One I did that I went through them all again and looked to see if there were more tests we could add that would cover the task that handnt been written. This was an important thing to do because I removed the too-strict compareFunctionSpec and compareSchema from all the evals so now there was potential API surface that we needed to grade against.

There are however circumstances when the task specifies something that we are unable to test for using unit tests. For example 001/008 asks the AI to use "helper functions" that it then calls from its queries. This is not testable via unit testing, so thats what the next PR is about, it allows us to test those tasks by introducing "AI Grading"

You will note that there are some other changes in here in addition to the above that kind of slipped in:

added gpt-5-mini and nano as models for when I was testing this
added a couple more "helper" functions for graders
scoring changed to be a binary / pass / fail instead of a graduated based upon the number of tests in a grader. Feel free to tell me to change that back if you still feel strongly about this tho.
added a helper script to run the grader against the actual answers to make sure that the answer and the grader agree.
added a helper script to run through every eval and generate the _generated directory which is needed in the graders for type-checking
added some more context to the local results reporting for use in:
visualiseResults.ts this is a local dashboard that I vibed in an hour or so to visualise the results. Its all in one file which is ugly as sin. This is very much a "nice to have" thing that I added tho and I dont expect it to ever be maintained. Feel free to tell me to remove this

mikecann added 30 commits August 13, 2025 15:53

version

57ce94e

wip

0d563c2

seems to work and output results

55db9ea

more logs

e286408

wip

128e15e

pretty printing to console

c841b72

pulled out some reporting stuff into its own file

fdd16f0

logging pass or vail to console

3699d07

fixed test run

308249f

updated readme

ab955c0

more minial logging

4573cb4

more wip

dd1b219

local mode and braintrust mode

4378b8e

makeing local mode null out the key

c4ab0b8

updated readme again

790f8b2

catching potential future fails

f32c004

wip

5f41834

wip

ebeb9ca

grade last

e9a8fff

wip

2885d59

tidied up grade_last

a3ade49

adding gpt5 mini

651617a

ifxed 000 and 001

8640e8e

significantly improved graders

0a74958

making it bomb out early

de25ab7

making it a little more explicit so graders pass

492aad1

updated tasks to be more explicit regarding function argument names

d556060

fixed this eval

58d44a7

more expliti

06c987f

updgrading 001

85fed2b

mikecann added 25 commits August 25, 2025 09:48

adding the verbose logging back in

f1525cf

now grades answers

3a24f52

wip better idioms

8d9b710

these arent needed

f37953b

fixed this task

33524c9

upgrading a few more tasks

824bb91

model optionsal

a2f0fbb

removing that testtidying

145fb68

specifying in the task that it needs to throw

3870f06

removing some amiguity in the tasks

8ecf174

refined some tasks

87aa49c

added script to codegen all

0b87e68

making sure I can run tsc on the graders

1db3ee2

all graders now pass ts checks

7dbdd92

eval summary report a little nicer

4ddb8d2

visualising local results

6353e59

wip visualise

d7e63eb

added more to the visualiser

33705f0

wip

fadf63a

wip

65e75ae

nicer

e99868e

wip

de2ec18

wip

d744756

wip

5aaa468

wip

c39412f

mikecann requested a review from jordanhunt22 August 27, 2025 01:30

mikecann mentioned this pull request Aug 27, 2025

Mikec/ai grader #78

Open

mikecann added 2 commits August 29, 2025 10:13

reventing some log lines

887ffe2

Merge branch 'main' into mikec/upgrade_graders

60a1093

mikecann changed the base branch from mikec/run_locally to main August 29, 2025 02:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mikec/upgrade graders #77

Mikec/upgrade graders #77

mikecann commented Aug 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Mikec/upgrade graders #77

Are you sure you want to change the base?

Mikec/upgrade graders #77

Conversation

mikecann commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mikecann commented Aug 27, 2025 •

edited

Loading