Skip to content

Conversation

mikecann
Copy link
Contributor

@mikecann mikecann commented Aug 27, 2025

This one builds upon the work in this PR:
#76

Once I was able to run evals locally and view their run logs I noticed that often we were failing the eval because of a number of reasons that I don't think were actual valid reasons to fail the eval.

The grader was checking to literal schema and function shape and the task did not specify the expected shape thus there was wiggle room in the interpretability of the task and thus different models would output different code and thus would fail even thought the answer is correct when you compare it against the task.

So the bulk of the work in this PR is to "make the grading fair".

So that is to go through every single eval and compare the task to what we are grading it on in the grader.test.ts files. If the tests are expecting certain functions with certain arguments then we should be explicit in the task that the AI should generate with those exact names.

One I did that I went through them all again and looked to see if there were more tests we could add that would cover the task that handnt been written. This was an important thing to do because I removed the too-strict compareFunctionSpec and compareSchema from all the evals so now there was potential API surface that we needed to grade against.

There are however circumstances when the task specifies something that we are unable to test for using unit tests. For example 001/008 asks the AI to use "helper functions" that it then calls from its queries. This is not testable via unit testing, so thats what the next PR is about, it allows us to test those tasks by introducing "AI Grading"

You will note that there are some other changes in here in addition to the above that kind of slipped in:

  • added gpt-5-mini and nano as models for when I was testing this
  • added a couple more "helper" functions for graders
  • scoring changed to be a binary / pass / fail instead of a graduated based upon the number of tests in a grader. Feel free to tell me to change that back if you still feel strongly about this tho.
  • added a helper script to run the grader against the actual answers to make sure that the answer and the grader agree.
  • added a helper script to run through every eval and generate the _generated directory which is needed in the graders for type-checking
  • added some more context to the local results reporting for use in:
  • visualiseResults.ts this is a local dashboard that I vibed in an hour or so to visualise the results. Its all in one file which is ugly as sin. This is very much a "nice to have" thing that I added tho and I dont expect it to ever be maintained. Feel free to tell me to remove this

@mikecann mikecann requested a review from jordanhunt22 August 27, 2025 01:30
@mikecann mikecann mentioned this pull request Aug 27, 2025
@mikecann mikecann changed the base branch from mikec/run_locally to main August 29, 2025 02:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant