Extractive Match metric #495

hynky1999 · 2025-01-11T19:03:23Z

What is this PR about

This PR adds new metric called extractive match, which is useful for evaluating models in generative mode. Precisely it can evaluate following type of outputs:

Math (both latex and plain expressions)
Indices (A/B/C)

It partially supports multiple languages, but compared to english it's limited and should be further extended in future.

How it works

Each output type has it's regex answer search, the extractor then cycles over these regexes based on their priority and tries to extract and parse the answer for model generation. On first succesfull parse and extraction it stops. The parsing is done with my extension of latex2sympy2 library.

After extraction strings are simply compared but for sympy (latex and expr parsing) there is a lot of logic to make sure comparisson is done correctly:

Features

Extraction:

Supports extraction of indices (A/B/C/D), latex expressions ($1$) and numerical expressions (1/2), those can be combined together for better retrieval
Only extracts answers that has been successfully parsed, which allows for better retrieval
Support for most of the common latex formats ($$, [ ], $$ $$ etc...)
Very Robust heuristics for answer retrieval despite format missaligment
Uses translation literals, so supports multilinguality with minimal code changes

Parsing:

Only parser supporting proper set theory (Intervals, sets [with very lenient parsing], finiteSets) and set expressions [union, intersection etc....])
Able to parse \text and other non-numerical expressions and converting them to symbols.
Support for all kinds of equations, while correctly maintaining assignment resolution (k=1 -> 1).
Normalizes latex and fixes common latex malformations (removal of left/right, frac/sqrt fixes, parenthesis fixes etc...)
Unicode symbols support instead of proper latex commands
Correctly handles percentages and units in most of cases

Comparison:

Supports both numerical (with epsilon tolerance) and symbolic comparison
More strict numerical comparison (no precision tolerance for numbers [rationals/floats/integers] on both sides) + rounding support -> 1/3 == 0.333333
Full support for matrix expressions equivalences
Full support sets and intervals comparisons
Relations are fully supported and correctly evaluated even considering flipped relations (e.g a < 2 == 2 > a)

How it's tested

There are multitudes of tests, which I collected during creation of the metric.
Additionaly I tested the math extraction workflow on MATH-HARD from leaderbaord generations (607 models).
With following results:

harness: 0.0808
qwen: 0.1271
lighteval-0f21c935: 0.1327

The new extraction workflow is the most accurate after manual inspection and based on quantified results

HuggingFaceDocBuilderDev · 2025-01-11T19:05:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

clefourrier

Super nice and in depth work - I think if you have the time, a small schematic of the different steps of the logic would be great (and we could use it in a blog)

src/lighteval/metrics/utils/math_comparisson.py

clefourrier · 2025-01-13T09:34:12Z

src/lighteval/metrics/utils/math_comparisson.py

+
+
+def safe_sympy_doit(a: Basic | MatrixBase):
+    """Safely execute doit() on a sympy expression, catching exceptions.


You can add a line to explain that doit evaluates complex expressions if possible. (It's not a common fn so some doc will be useful).

Do you know if we might need deep = False?

Nn, otherwise 1+1+1 != 3. It would only call evalute on first epression node resulting in 2+1 or 1+2 (depends on expr three)

One issue with is that this will also evaluate integrals and you might want to prevent that so that if the models answers \int_1^3 x to question waht is integral from 1 to 3 of x it will get 0 points (now it would get 1 because it expands).

I think there is way to prevent this by a) only evaluting certain type of nodes (sub/div etc...) b) only doing numeric comparisson if it doesn't contain integral and other stuff (because calling evalf will again expand the integral) c) somehow making sure that the symbolic equivalance will not evaluate it to 1 (this one seems super super hard). TLDR it's too much work

src/lighteval/metrics/utils/math_comparisson.py

src/lighteval/tasks/templates/utils/translation_literals.py

src/lighteval/metrics/utils/math_comparisson.py

src/lighteval/metrics/dynamic_metrics.py

Co-authored-by: Clémentine Fourrier <[email protected]>

…to math_extraction

Co-authored-by: Clémentine Fourrier <[email protected]>

…to math_extraction

clefourrier

overall lgtm

hynky1999 added 4 commits January 11, 2025 18:48

extract matching

eeaceaf

better docstring

4f1f85a

lazy imports

bc45376

bump up math

c24870e

clefourrier reviewed Jan 13, 2025

View reviewed changes

hynky1999 and others added 6 commits January 13, 2025 14:14

Update src/lighteval/metrics/dynamic_metrics.py

2086fe7

Co-authored-by: Clémentine Fourrier <[email protected]>

fix pr commnets

88b9f8c

Merge branch 'math_extraction' of github.com:huggingface/lighteval in…

ebd695c

…to math_extraction

Apply suggestions from code review

a02bb15

Co-authored-by: Clémentine Fourrier <[email protected]>

Merge branch 'math_extraction' of github.com:huggingface/lighteval in…

5aa6789

…to math_extraction

rename comparisson -> comparison

bb19ebb

hynky1999 requested a review from clefourrier January 13, 2025 14:23

clefourrier approved these changes Jan 14, 2025

View reviewed changes

hynky1999 merged commit 59624c8 into main Jan 15, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extractive Match metric #495

Extractive Match metric #495

hynky1999 commented Jan 11, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 11, 2025

clefourrier left a comment

clefourrier Jan 13, 2025

clefourrier Jan 13, 2025

hynky1999 Jan 13, 2025

hynky1999 Jan 13, 2025

clefourrier left a comment



		def safe_sympy_doit(a: Basic \| MatrixBase):
		"""Safely execute doit() on a sympy expression, catching exceptions.

Extractive Match metric #495

Extractive Match metric #495

Conversation

hynky1999 commented Jan 11, 2025 • edited Loading

What is this PR about

How it works

Features

How it's tested

HuggingFaceDocBuilderDev commented Jan 11, 2025

clefourrier left a comment

Choose a reason for hiding this comment

clefourrier Jan 13, 2025

Choose a reason for hiding this comment

clefourrier Jan 13, 2025

Choose a reason for hiding this comment

hynky1999 Jan 13, 2025

Choose a reason for hiding this comment

hynky1999 Jan 13, 2025

Choose a reason for hiding this comment

clefourrier left a comment

Choose a reason for hiding this comment

hynky1999 commented Jan 11, 2025 •

edited

Loading