-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add proposed exact
and f1
scorers.
#306
Conversation
`exact()` Scorer which will normalize the text of the answer and target(s) and perform an exact matching comparison of the text. This scorer will return `CORRECT` when the answer is an exact match to one or more targets. `f1()` Scorer which computes the `F1` score for the answer (which balances recall precision by taking the harmonic mean between recall and precision).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Reminder to update the drop benchmark w/ this + add a changelog entry.
…ature/f1 # Conflicts: # CHANGELOG.md # src/inspect_ai/_view/www/dist/assets/index.js
Tested against main baseline using 100 samples ### Gpt-4o | Branch | Accuracy | StdErr | | main | 0.856 | 0.0257 | | f1 | 0.879 | 0.0229 | ### Gpt-4o-mini | Branch | Accuracy | StdErr | | main | 0.868 | 0.031 | | f1 | 0.877 | 0.0285 |
@xeon27 I made a change to the DROP benchmark implementation to allow it to use our new built in inspect_ai/benchmarks/drop/drop.py Line 175 in b8d60fa
Rather than using Tuples, I'm just forcing the lower level answers to strings which can then be easily handled by the rest of the code (with no special processing required when scoring). If you have a moment to review this, I'd appreciate any thoughts or alternatives if you think there is a better way! |
* Add proposed `exact` and `f1` scorers. `exact()` Scorer which will normalize the text of the answer and target(s) and perform an exact matching comparison of the text. This scorer will return `CORRECT` when the answer is an exact match to one or more targets. `f1()` Scorer which computes the `F1` score for the answer (which balances recall precision by taking the harmonic mean between recall and precision). * Add changelog entry * Round values to match f1 behavior exactly * Improve list of targets display * Convert the drop benchmark to using build in f1 * Optionally show error card * cleanup drop imports * correct test for rounding * Improve F1 scorer to accept an answer extraction function * Update drop benchmark to simplify data reading Tested against main baseline using 100 samples ### Gpt-4o | Branch | Accuracy | StdErr | | main | 0.856 | 0.0257 | | f1 | 0.879 | 0.0229 | ### Gpt-4o-mini | Branch | Accuracy | StdErr | | main | 0.868 | 0.031 | | f1 | 0.877 | 0.0285 |
exact()
Scorer which will normalize the text of the answer and target(s) and perform an exact matching comparison of the text. This scorer will return
CORRECT
when the answer is an exact match to one or more targets.f1()
Scorer which computes the
F1
score for the answer (which balances recall precision by taking the harmonic mean between recall and precision).