add agieval #1259

Sparkier · 2024-01-09T11:46:55Z

Adds the AGIEval benchmark.

lm_eval/tasks/agieval/gaokao-chemnistry.yaml

haileyschoelkopf · 2024-01-09T14:01:29Z

Thanks very much for the PR! Would you be able to compare any numbers from this implementation and the paper, and between

this fork of master that people have been using for agieval?

Sparkier · 2024-01-09T15:21:29Z

I basically converted the code in the fork you mentioned to the new structure of .yaml files. I would be surprised if the results were any different to that (barring the metric code, which I did not check).

That being said, what would be needed to get meaningful numbers? I don't really have the capacity to run a ton of models right now.

haileyschoelkopf · 2024-01-09T15:40:07Z

Even just comparing on gpt2 and on something else small like Qwen/Qwen-1_8B or tinyllama would be useful!

If you don't have the bandwidth though no worries we can pick it up from here--appreciate the contribution!

haileyschoelkopf · 2024-01-09T20:14:15Z

Testing this now.

haileyschoelkopf · 2024-01-09T21:03:54Z

for TinyLlama: comparing against https://github.com/teknium1/LLM-Benchmark-Logs/blob/main/benchmark-logs/TinyLlama-1.1B-intermediate-step-1431k-3T.md :

|             Task             |Version| Metric |Value |   |Stderr|
|------------------------------|------:|--------|-----:|---|-----:|
|agieval_aqua_rat              |      0|acc     |0.1575|±  |0.0229|
|                              |       |acc_norm|0.1693|±  |0.0236|
|agieval_logiqa_en             |      0|acc     |0.2488|±  |0.0170|
|                              |       |acc_norm|0.2934|±  |0.0179|
|agieval_lsat_ar               |      0|acc     |0.2304|±  |0.0278|
|                              |       |acc_norm|0.2043|±  |0.0266|
|agieval_lsat_lr               |      0|acc     |0.2059|±  |0.0179|
|                              |       |acc_norm|0.2353|±  |0.0188|
|agieval_lsat_rc               |      0|acc     |0.1970|±  |0.0243|
|                              |       |acc_norm|0.1710|±  |0.0230|
|agieval_sat_en                |      0|acc     |0.2427|±  |0.0299|
|                              |       |acc_norm|0.1893|±  |0.0274|
|agieval_sat_en_without_passage|      0|acc     |0.2136|±  |0.0286|
|                              |       |acc_norm|0.1942|±  |0.0276|
|agieval_sat_math              |      0|acc     |0.3045|±  |0.0311|
|                              |       |acc_norm|0.2273|±  |0.0283|

And in this PR:

hf (pretrained=TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (2)
|          Tasks          |Version|Filter|n-shot| Metric |Value |   |Stderr|
|-------------------------|-------|------|-----:|--------|-----:|---|-----:|
|agieval                  |N/A    |none  |     0|acc     |0.2275|±  |0.0341|
|                         |       |none  |     0|acc_norm|0.2407|±  |0.0464|
| - aqua-rat              |Yaml   |none  |     0|acc     |0.1575|±  |0.0229|
|                         |       |none  |     0|acc_norm|0.1732|±  |0.0238|
| - gaokao-biology        |Yaml   |none  |     0|acc     |0.1952|±  |0.0274|
|                         |       |none  |     0|acc_norm|0.2381|±  |0.0295|
| - gaokao-chemistry      |Yaml   |none  |     0|acc     |0.2464|±  |0.0300|
|                         |       |none  |     0|acc_norm|0.3043|±  |0.0321|
| - gaokao-chinese        |Yaml   |none  |     0|acc     |0.2236|±  |0.0266|
|                         |       |none  |     0|acc_norm|0.2073|±  |0.0259|
| - gaokao-english        |Yaml   |none  |     0|acc     |0.2582|±  |0.0251|
|                         |       |none  |     0|acc_norm|0.2288|±  |0.0241|
| - gaokao-geography      |Yaml   |none  |     0|acc     |0.2161|±  |0.0292|
|                         |       |none  |     0|acc_norm|0.2111|±  |0.0290|
| - gaokao-history        |Yaml   |none  |     0|acc     |0.2170|±  |0.0269|
|                         |       |none  |     0|acc_norm|0.2340|±  |0.0277|
| - gaokao-mathqa         |Yaml   |none  |     0|acc     |0.2564|±  |0.0233|
|                         |       |none  |     0|acc_norm|0.2821|±  |0.0241|
| - gaokao-physics        |Yaml   |none  |     0|acc     |0.2350|±  |0.0301|
|                         |       |none  |     0|acc_norm|0.1850|±  |0.0275|
| - logiqa-en             |Yaml   |none  |     0|acc     |0.2488|±  |0.0170|
|                         |       |none  |     0|acc_norm|0.2919|±  |0.0178|
| - logiqa-zh             |Yaml   |none  |     0|acc     |0.2212|±  |0.0163|
|                         |       |none  |     0|acc_norm|0.3057|±  |0.0181|
| - lsat-ar               |Yaml   |none  |     0|acc     |0.2261|±  |0.0276|
|                         |       |none  |     0|acc_norm|0.2000|±  |0.0264|
| - lsat-lr               |Yaml   |none  |     0|acc     |0.2039|±  |0.0179|
|                         |       |none  |     0|acc_norm|0.2333|±  |0.0187|
| - lsat-rc               |Yaml   |none  |     0|acc     |0.1970|±  |0.0243|
|                         |       |none  |     0|acc_norm|0.1710|±  |0.0230|
| - sat-en                |Yaml   |none  |     0|acc     |0.2427|±  |0.0299|
|                         |       |none  |     0|acc_norm|0.1893|±  |0.0274|
| - sat-en-without-passage|Yaml   |none  |     0|acc     |0.2136|±  |0.0286|
|                         |       |none  |     0|acc_norm|0.1942|±  |0.0276|
| - sat-math              |Yaml   |none  |     0|acc     |0.3000|±  |0.0310|
|                         |       |none  |     0|acc_norm|0.2273|±  |0.0283|

|Groups |Version|Filter|n-shot| Metric |Value |   |Stderr|
|-------|-------|------|-----:|--------|-----:|---|-----:|
|agieval|N/A    |none  |     0|acc     |0.2275|±  |0.0341|
|       |       |none  |     0|acc_norm|0.2407|±  |0.0464|

Score deviations could potentially be due to differences in batch size / floating point errors, want to look slightly harder though.

I also realized that the AGIEval dataset had some examples cleaned up (e.g., for a few answer D is correct but no answer D exists)--see history in https://github.com/ruixiangcui/AGIEval/commits/main/data/v1 . I therefore want to reupload the data to HF with these errant docs fixed.

baberabb · 2024-01-10T06:39:48Z

should also make a group just for the english subset as most papers only provide that!

StellaAthena · 2024-01-10T21:30:01Z

@teknium1 Alerting you to incoming official support for AGI Eval. We would very much appreciate it if you could run a couple models through it and tell us if you're content with the new implementation.

@haileyschoelkopf Since some numbers match exactly and some don't, it's probably a good idea to run a diff on which questions get a different grade between the two implementations.

gblazex · 2024-01-11T17:57:53Z

this is great PR!

One thing I noticed is that Nous / @teknium1 kind of AGI Eval,
as well as the fork of this repo by @dmahan93 all seem to be missing a Math test with 1000 questions (AMC/AIME english).

fork list of tests: https://github.com/dmahan93/lm-evaluation-harness/blob/add-agieval/lm_eval/tasks/agieval.py#L193

Missing 1,000 questions (from original repo):
https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/math.jsonl

Original table from paper (bottom one is the missing Math test with 1000 questions)

Now this might've been on purpose because of the size of 1,000 questions (to cut down on running time).

But makes comparing AGI Eval scores to AGI Eval scores out there problematic.

(please note that there are two english math tests, one is SAT and other is the missing AMC/AIME)

haileyschoelkopf · 2024-01-11T18:15:28Z

Thanks for pointing this out! I also noticed this when I was looking at @dmahan93 's AGIEval upload-to-HF utility.

I believe the reason this was probably left out originally was because the MATH eval is not multiple-choice, but agree that it should be included in our upstreaming.

dmahan93 · 2024-01-11T18:17:41Z

Yup, intentionally skipped it since it wasn't multiple choice and ends up with a direct string comparison which I have bad vibes on.

gblazex · 2024-01-12T01:23:47Z

@dmahan93 totally makes sense

@haileyschoelkopf
yes it would make the test complete, and completeness is the best part of this awesome project.

Then people can decide to run it or not.

Maybe a --multi-choice-only flag can help Nous (and others) run their version easily.
But custom topic list also covers this use-case.

StellaAthena · 2024-01-16T13:49:38Z

Maybe a --multi-choice-only flag can help Nous (and others) run their version easily.
But custom topic list also covers this use-case.

Let's make an agi-eval-nous group to run it easily.

gblazex · 2024-01-20T00:24:44Z

I can live with that. And LDJ can too.. :)

haileyschoelkopf · 2024-01-22T23:34:48Z

Working on this in https://github.com/EleutherAI/lm-evaluation-harness/tree/agieval ! added the missing exact-match tasks and their postprocessing logic from the original AGIEval repo, still need to test a bit more.

teknium1 · 2024-03-11T06:31:32Z

Hey all is there still a need for me? I just saw this.

haileyschoelkopf · 2024-03-11T14:07:36Z

We just merged in #1359 ! Thanks @Sparkier for this.

teknium1 · 2024-03-14T07:05:22Z

Thanks!

add agieval

a13c12c

Sparkier requested review from haileyschoelkopf and lintangsutawika as code owners January 9, 2024 11:46

lintangsutawika reviewed Jan 9, 2024

View reviewed changes

lm_eval/tasks/agieval/gaokao-chemnistry.yaml Outdated Show resolved Hide resolved

fix typo

91a7778

Sparkier requested a review from lintangsutawika January 9, 2024 12:01

haileyschoelkopf closed this Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add agieval #1259

add agieval #1259

Sparkier commented Jan 9, 2024

haileyschoelkopf commented Jan 9, 2024

Sparkier commented Jan 9, 2024

haileyschoelkopf commented Jan 9, 2024

haileyschoelkopf commented Jan 9, 2024

haileyschoelkopf commented Jan 9, 2024

baberabb commented Jan 10, 2024 •

edited

Loading

StellaAthena commented Jan 10, 2024

gblazex commented Jan 11, 2024 •

edited

Loading

haileyschoelkopf commented Jan 11, 2024

dmahan93 commented Jan 11, 2024

gblazex commented Jan 12, 2024 •

edited

Loading

StellaAthena commented Jan 16, 2024

gblazex commented Jan 20, 2024

haileyschoelkopf commented Jan 22, 2024

teknium1 commented Mar 11, 2024

haileyschoelkopf commented Mar 11, 2024

teknium1 commented Mar 14, 2024

add agieval #1259

add agieval #1259

Conversation

Sparkier commented Jan 9, 2024

haileyschoelkopf commented Jan 9, 2024

Sparkier commented Jan 9, 2024

haileyschoelkopf commented Jan 9, 2024

haileyschoelkopf commented Jan 9, 2024

haileyschoelkopf commented Jan 9, 2024

baberabb commented Jan 10, 2024 • edited Loading

StellaAthena commented Jan 10, 2024

gblazex commented Jan 11, 2024 • edited Loading

haileyschoelkopf commented Jan 11, 2024

dmahan93 commented Jan 11, 2024

gblazex commented Jan 12, 2024 • edited Loading

StellaAthena commented Jan 16, 2024

gblazex commented Jan 20, 2024

haileyschoelkopf commented Jan 22, 2024

teknium1 commented Mar 11, 2024

haileyschoelkopf commented Mar 11, 2024

teknium1 commented Mar 14, 2024

baberabb commented Jan 10, 2024 •

edited

Loading

gblazex commented Jan 11, 2024 •

edited

Loading

gblazex commented Jan 12, 2024 •

edited

Loading