Add proposed `exact` and `f1` scorers. #306

dragonstyle · 2024-08-31T21:37:21Z

exact()
Scorer which will normalize the text of the answer and target(s) and perform an exact matching comparison of the text. This scorer will return CORRECT when the answer is an exact match to one or more targets.

f1()
Scorer which computes the F1 score for the answer (which balances recall precision by taking the harmonic mean between recall and precision).

`exact()` Scorer which will normalize the text of the answer and target(s) and perform an exact matching comparison of the text. This scorer will return `CORRECT` when the answer is an exact match to one or more targets. `f1()` Scorer which computes the `F1` score for the answer (which balances recall precision by taking the harmonic mean between recall and precision).

jjallaire-aisi

Looks good! Reminder to update the drop benchmark w/ this + add a changelog entry.

…ature/f1 # Conflicts: # CHANGELOG.md # src/inspect_ai/_view/www/dist/assets/index.js

Tested against main baseline using 100 samples ### Gpt-4o | Branch | Accuracy | StdErr | | main | 0.856 | 0.0257 | | f1 | 0.879 | 0.0229 | ### Gpt-4o-mini | Branch | Accuracy | StdErr | | main | 0.868 | 0.031 | | f1 | 0.877 | 0.0285 |

dragonstyle · 2024-09-04T21:12:25Z

@xeon27 I made a change to the DROP benchmark implementation to allow it to use our new built in f1 scorer (which takes a function to extract the answer). The biggest area of change is dealing with how we form the targets when reading the sample data here:

inspect_ai/benchmarks/drop/drop.py

Line 175 in b8d60fa

def parse_answer(answer: Dict) -> str:

Rather than using Tuples, I'm just forcing the lower level answers to strings which can then be easily handled by the rest of the code (with no special processing required when scoring). If you have a moment to review this, I'd appreciate any thoughts or alternatives if you think there is a better way!

* Add proposed `exact` and `f1` scorers. `exact()` Scorer which will normalize the text of the answer and target(s) and perform an exact matching comparison of the text. This scorer will return `CORRECT` when the answer is an exact match to one or more targets. `f1()` Scorer which computes the `F1` score for the answer (which balances recall precision by taking the harmonic mean between recall and precision). * Add changelog entry * Round values to match f1 behavior exactly * Improve list of targets display * Convert the drop benchmark to using build in f1 * Optionally show error card * cleanup drop imports * correct test for rounding * Improve F1 scorer to accept an answer extraction function * Update drop benchmark to simplify data reading Tested against main baseline using 100 samples ### Gpt-4o | Branch | Accuracy | StdErr | | main | 0.856 | 0.0257 | | f1 | 0.879 | 0.0229 | ### Gpt-4o-mini | Branch | Accuracy | StdErr | | main | 0.868 | 0.031 | | f1 | 0.877 | 0.0285 |

dragonstyle requested a review from jjallaire August 31, 2024 21:37

jjallaire-aisi approved these changes Aug 31, 2024

View reviewed changes

alexandraabbas mentioned this pull request Sep 1, 2024

SQuAD v2 Benchmark Implementation | ASET - Arcadia Impact #291

Merged

5 tasks

dragonstyle added 6 commits September 4, 2024 09:45

Add changelog entry

1c6d34f

Round values to match f1 behavior exactly

4de8a11

Improve list of targets display

08def6e

Convert the drop benchmark to using build in f1

80019f2

Merge branch 'main' of github.com:UKGovernmentBEIS/inspect_ai into fe…

17fcee2

…ature/f1 # Conflicts: # CHANGELOG.md # src/inspect_ai/_view/www/dist/assets/index.js

Optionally show error card

3da106a

dragonstyle marked this pull request as ready for review September 4, 2024 14:44

dragonstyle added 6 commits September 4, 2024 10:49

cleanup drop imports

0850a6d

correct test for rounding

3f0c5c2

Merge branch 'main' into feature/f1

51676ab

Improve F1 scorer to accept an answer extraction function

09c4351

Update drop benchmark to simplify data reading

b98c011

Tested against main baseline using 100 samples ### Gpt-4o | Branch | Accuracy | StdErr | | main | 0.856 | 0.0257 | | f1 | 0.879 | 0.0229 | ### Gpt-4o-mini | Branch | Accuracy | StdErr | | main | 0.868 | 0.031 | | f1 | 0.877 | 0.0285 |

Merge branch 'main' into feature/f1

3ef7c6c

jjallaire merged commit b8d60fa into main Sep 4, 2024
9 checks passed

jjallaire deleted the feature/f1 branch September 4, 2024 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add proposed `exact` and `f1` scorers. #306

Add proposed `exact` and `f1` scorers. #306

dragonstyle commented Aug 31, 2024

jjallaire-aisi left a comment

dragonstyle commented Sep 4, 2024

Add proposed exact and f1 scorers. #306

Add proposed exact and f1 scorers. #306

Conversation

dragonstyle commented Aug 31, 2024

jjallaire-aisi left a comment

Choose a reason for hiding this comment

dragonstyle commented Sep 4, 2024

Add proposed `exact` and `f1` scorers. #306

Add proposed `exact` and `f1` scorers. #306