-
Notifications
You must be signed in to change notification settings - Fork 6
Mikec/upgrade graders #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mikecann
wants to merge
71
commits into
main
Choose a base branch
from
mikec/upgrade_graders
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+6,274
−1,349
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This one builds upon the work in this PR:
#76
Once I was able to run evals locally and view their run logs I noticed that often we were failing the eval because of a number of reasons that I don't think were actual valid reasons to fail the eval.
The grader was checking to literal schema and function shape and the task did not specify the expected shape thus there was wiggle room in the interpretability of the task and thus different models would output different code and thus would fail even thought the answer is correct when you compare it against the task.
So the bulk of the work in this PR is to "make the grading fair".
So that is to go through every single eval and compare the task to what we are grading it on in the
grader.test.ts
files. If the tests are expecting certain functions with certain arguments then we should be explicit in the task that the AI should generate with those exact names.One I did that I went through them all again and looked to see if there were more tests we could add that would cover the task that handnt been written. This was an important thing to do because I removed the too-strict
compareFunctionSpec
andcompareSchema
from all the evals so now there was potential API surface that we needed to grade against.There are however circumstances when the task specifies something that we are unable to test for using unit tests. For example 001/008 asks the AI to use "helper functions" that it then calls from its queries. This is not testable via unit testing, so thats what the next PR is about, it allows us to test those tasks by introducing "AI Grading"
You will note that there are some other changes in here in addition to the above that kind of slipped in:
_generated
directory which is needed in the graders for type-checkingvisualiseResults.ts
this is a local dashboard that I vibed in an hour or so to visualise the results. Its all in one file which is ugly as sin. This is very much a "nice to have" thing that I added tho and I dont expect it to ever be maintained. Feel free to tell me to remove this