Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proposed exact and f1 scorers. #306

Merged
merged 13 commits into from
Sep 4, 2024
Merged

Add proposed exact and f1 scorers. #306

merged 13 commits into from
Sep 4, 2024

Conversation

dragonstyle
Copy link
Collaborator

exact()
Scorer which will normalize the text of the answer and target(s) and perform an exact matching comparison of the text. This scorer will return CORRECT when the answer is an exact match to one or more targets.

f1()
Scorer which computes the F1 score for the answer (which balances recall precision by taking the harmonic mean between recall and precision).

`exact()`
Scorer which will normalize the text of the answer and target(s) and perform an exact matching comparison of the text. This scorer will return `CORRECT` when the answer is an exact match to one or more targets.

`f1()`
Scorer which computes the `F1` score for the answer (which balances recall precision by taking the harmonic mean between recall and precision).
@dragonstyle dragonstyle requested a review from jjallaire August 31, 2024 21:37
Copy link
Collaborator

@jjallaire-aisi jjallaire-aisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Reminder to update the drop benchmark w/ this + add a changelog entry.

@dragonstyle dragonstyle marked this pull request as ready for review September 4, 2024 14:44
Tested against main baseline using 100 samples

### Gpt-4o
| Branch | Accuracy | StdErr |
| main | 0.856 | 0.0257 |
| f1 | 0.879 | 0.0229 |

### Gpt-4o-mini
| Branch | Accuracy | StdErr |
| main | 0.868 | 0.031 |
| f1 | 0.877 | 0.0285 |
@jjallaire jjallaire merged commit b8d60fa into main Sep 4, 2024
9 checks passed
@jjallaire jjallaire deleted the feature/f1 branch September 4, 2024 21:08
@dragonstyle
Copy link
Collaborator Author

@xeon27 I made a change to the DROP benchmark implementation to allow it to use our new built in f1 scorer (which takes a function to extract the answer). The biggest area of change is dealing with how we form the targets when reading the sample data here:

def parse_answer(answer: Dict) -> str:

Rather than using Tuples, I'm just forcing the lower level answers to strings which can then be easily handled by the rest of the code (with no special processing required when scoring). If you have a moment to review this, I'd appreciate any thoughts or alternatives if you think there is a better way!

max-kaufmann pushed a commit that referenced this pull request Sep 9, 2024
* Add proposed `exact` and `f1` scorers.

`exact()`
Scorer which will normalize the text of the answer and target(s) and perform an exact matching comparison of the text. This scorer will return `CORRECT` when the answer is an exact match to one or more targets.

`f1()`
Scorer which computes the `F1` score for the answer (which balances recall precision by taking the harmonic mean between recall and precision).

* Add changelog entry

* Round values to match f1 behavior exactly

* Improve list of targets display

* Convert the drop benchmark to using build in f1

* Optionally show error card

* cleanup drop imports

* correct test for rounding

* Improve F1 scorer to accept an answer extraction function

* Update drop benchmark to simplify data reading

Tested against main baseline using 100 samples

### Gpt-4o
| Branch | Accuracy | StdErr |
| main | 0.856 | 0.0257 |
| f1 | 0.879 | 0.0229 |

### Gpt-4o-mini
| Branch | Accuracy | StdErr |
| main | 0.868 | 0.031 |
| f1 | 0.877 | 0.0285 |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants