Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extractive Match metric #495

Merged
merged 10 commits into from
Jan 15, 2025
Merged

Extractive Match metric #495

merged 10 commits into from
Jan 15, 2025

Conversation

hynky1999
Copy link
Collaborator

@hynky1999 hynky1999 commented Jan 11, 2025

What is this PR about

This PR adds new metric called extractive match, which is useful for evaluating models in generative mode. Precisely it can evaluate following type of outputs:

  • Math (both latex and plain expressions)
  • Indices (A/B/C)

It partially supports multiple languages, but compared to english it's limited and should be further extended in future.

How it works

Each output type has it's regex answer search, the extractor then cycles over these regexes based on their priority and tries to extract and parse the answer for model generation. On first succesfull parse and extraction it stops. The parsing is done with my extension of latex2sympy2 library.

After extraction strings are simply compared but for sympy (latex and expr parsing) there is a lot of logic to make sure comparisson is done correctly:

Features

Extraction:

  • Supports extraction of indices (A/B/C/D), latex expressions ($1$) and numerical expressions (1/2), those can be combined together for better retrieval
  • Only extracts answers that has been successfully parsed, which allows for better retrieval
  • Support for most of the common latex formats ($$, [ ], $$ $$ etc...)
  • Very Robust heuristics for answer retrieval despite format missaligment
  • Uses translation literals, so supports multilinguality with minimal code changes

Parsing:

  • Only parser supporting proper set theory (Intervals, sets [with very lenient parsing], finiteSets) and set expressions [union, intersection etc....])
  • Able to parse \text and other non-numerical expressions and converting them to symbols.
  • Support for all kinds of equations, while correctly maintaining assignment resolution (k=1 -> 1).
  • Normalizes latex and fixes common latex malformations (removal of left/right, frac/sqrt fixes, parenthesis fixes etc...)
  • Unicode symbols support instead of proper latex commands
  • Correctly handles percentages and units in most of cases

Comparison:

  • Supports both numerical (with epsilon tolerance) and symbolic comparison
  • More strict numerical comparison (no precision tolerance for numbers [rationals/floats/integers] on both sides) + rounding support -> 1/3 == 0.333333
  • Full support for matrix expressions equivalences
  • Full support sets and intervals comparisons
  • Relations are fully supported and correctly evaluated even considering flipped relations (e.g a < 2 == 2 > a)

How it's tested

There are multitudes of tests, which I collected during creation of the metric.
Additionaly I tested the math extraction workflow on MATH-HARD from leaderbaord generations (607 models).
With following results:

harness: 0.0808
qwen: 0.1271
lighteval-0f21c935: 0.1327

The new extraction workflow is the most accurate after manual inspection and based on quantified results

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@clefourrier clefourrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nice and in depth work - I think if you have the time, a small schematic of the different steps of the logic would be great (and we could use it in a blog)

src/lighteval/metrics/utils/math_comparisson.py Outdated Show resolved Hide resolved


def safe_sympy_doit(a: Basic | MatrixBase):
"""Safely execute doit() on a sympy expression, catching exceptions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a line to explain that doit evaluates complex expressions if possible. (It's not a common fn so some doc will be useful).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if we might need deep = False?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nn, otherwise 1+1+1 != 3. It would only call evalute on first epression node resulting in 2+1 or 1+2 (depends on expr three)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue with is that this will also evaluate integrals and you might want to prevent that so that if the models answers \int_1^3 x to question waht is integral from 1 to 3 of x it will get 0 points (now it would get 1 because it expands).

I think there is way to prevent this by a) only evaluting certain type of nodes (sub/div etc...) b) only doing numeric comparisson if it doesn't contain integral and other stuff (because calling evalf will again expand the integral) c) somehow making sure that the symbolic equivalance will not evaluate it to 1 (this one seems super super hard). TLDR it's too much work

src/lighteval/metrics/utils/math_comparisson.py Outdated Show resolved Hide resolved
src/lighteval/metrics/utils/math_comparisson.py Outdated Show resolved Hide resolved
src/lighteval/metrics/utils/math_comparisson.py Outdated Show resolved Hide resolved
src/lighteval/metrics/utils/math_comparisson.py Outdated Show resolved Hide resolved
src/lighteval/metrics/utils/math_comparisson.py Outdated Show resolved Hide resolved
src/lighteval/metrics/dynamic_metrics.py Outdated Show resolved Hide resolved
src/lighteval/metrics/dynamic_metrics.py Outdated Show resolved Hide resolved
@hynky1999 hynky1999 requested a review from clefourrier January 13, 2025 14:23
Copy link
Member

@clefourrier clefourrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm

@hynky1999 hynky1999 merged commit 59624c8 into main Jan 15, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants