-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extractive Match metric #495
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nice and in depth work - I think if you have the time, a small schematic of the different steps of the logic would be great (and we could use it in a blog)
|
||
|
||
def safe_sympy_doit(a: Basic | MatrixBase): | ||
"""Safely execute doit() on a sympy expression, catching exceptions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a line to explain that doit evaluates complex expressions if possible. (It's not a common fn so some doc will be useful).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if we might need deep = False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nn, otherwise 1+1+1 != 3. It would only call evalute on first epression node resulting in 2+1 or 1+2 (depends on expr three)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue with is that this will also evaluate integrals and you might want to prevent that so that if the models answers \int_1^3 x to question waht is integral from 1 to 3 of x it will get 0 points (now it would get 1 because it expands).
I think there is way to prevent this by a) only evaluting certain type of nodes (sub/div etc...) b) only doing numeric comparisson if it doesn't contain integral and other stuff (because calling evalf will again expand the integral) c) somehow making sure that the symbolic equivalance will not evaluate it to 1 (this one seems super super hard). TLDR it's too much work
Co-authored-by: Clémentine Fourrier <[email protected]>
…to math_extraction
Co-authored-by: Clémentine Fourrier <[email protected]>
…to math_extraction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm
What is this PR about
This PR adds new metric called extractive match, which is useful for evaluating models in generative mode. Precisely it can evaluate following type of outputs:
It partially supports multiple languages, but compared to english it's limited and should be further extended in future.
How it works
Each output type has it's regex answer search, the extractor then cycles over these regexes based on their priority and tries to extract and parse the answer for model generation. On first succesfull parse and extraction it stops. The parsing is done with my extension of latex2sympy2 library.
After extraction strings are simply compared but for sympy (latex and expr parsing) there is a lot of logic to make sure comparisson is done correctly:
Features
Extraction:
Parsing:
Comparison:
How it's tested
There are multitudes of tests, which I collected during creation of the metric.
Additionaly I tested the math extraction workflow on MATH-HARD from leaderbaord generations (607 models).
With following results:
The new extraction workflow is the most accurate after manual inspection and based on quantified results