Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add agieval #1259

Closed
wants to merge 2 commits into from
Closed

add agieval #1259

wants to merge 2 commits into from

Conversation

Sparkier
Copy link
Contributor

@Sparkier Sparkier commented Jan 9, 2024

Adds the AGIEval benchmark.

@haileyschoelkopf
Copy link
Collaborator

Thanks very much for the PR! Would you be able to compare any numbers from this implementation and the paper, and between

this fork of master that people have been using for agieval?

@Sparkier
Copy link
Contributor Author

Sparkier commented Jan 9, 2024

I basically converted the code in the fork you mentioned to the new structure of .yaml files. I would be surprised if the results were any different to that (barring the metric code, which I did not check).

That being said, what would be needed to get meaningful numbers? I don't really have the capacity to run a ton of models right now.

@haileyschoelkopf
Copy link
Collaborator

Even just comparing on gpt2 and on something else small like Qwen/Qwen-1_8B or tinyllama would be useful!

If you don't have the bandwidth though no worries we can pick it up from here--appreciate the contribution!

@haileyschoelkopf
Copy link
Collaborator

Testing this now.

@haileyschoelkopf
Copy link
Collaborator

for TinyLlama: comparing against https://github.com/teknium1/LLM-Benchmark-Logs/blob/main/benchmark-logs/TinyLlama-1.1B-intermediate-step-1431k-3T.md :

|             Task             |Version| Metric |Value |   |Stderr|
|------------------------------|------:|--------|-----:|---|-----:|
|agieval_aqua_rat              |      0|acc     |0.1575|±  |0.0229|
|                              |       |acc_norm|0.1693|±  |0.0236|
|agieval_logiqa_en             |      0|acc     |0.2488|±  |0.0170|
|                              |       |acc_norm|0.2934|±  |0.0179|
|agieval_lsat_ar               |      0|acc     |0.2304|±  |0.0278|
|                              |       |acc_norm|0.2043|±  |0.0266|
|agieval_lsat_lr               |      0|acc     |0.2059|±  |0.0179|
|                              |       |acc_norm|0.2353|±  |0.0188|
|agieval_lsat_rc               |      0|acc     |0.1970|±  |0.0243|
|                              |       |acc_norm|0.1710|±  |0.0230|
|agieval_sat_en                |      0|acc     |0.2427|±  |0.0299|
|                              |       |acc_norm|0.1893|±  |0.0274|
|agieval_sat_en_without_passage|      0|acc     |0.2136|±  |0.0286|
|                              |       |acc_norm|0.1942|±  |0.0276|
|agieval_sat_math              |      0|acc     |0.3045|±  |0.0311|
|                              |       |acc_norm|0.2273|±  |0.0283|

And in this PR:

hf (pretrained=TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (2)
|          Tasks          |Version|Filter|n-shot| Metric |Value |   |Stderr|
|-------------------------|-------|------|-----:|--------|-----:|---|-----:|
|agieval                  |N/A    |none  |     0|acc     |0.2275|±  |0.0341|
|                         |       |none  |     0|acc_norm|0.2407|±  |0.0464|
| - aqua-rat              |Yaml   |none  |     0|acc     |0.1575|±  |0.0229|
|                         |       |none  |     0|acc_norm|0.1732|±  |0.0238|
| - gaokao-biology        |Yaml   |none  |     0|acc     |0.1952|±  |0.0274|
|                         |       |none  |     0|acc_norm|0.2381|±  |0.0295|
| - gaokao-chemistry      |Yaml   |none  |     0|acc     |0.2464|±  |0.0300|
|                         |       |none  |     0|acc_norm|0.3043|±  |0.0321|
| - gaokao-chinese        |Yaml   |none  |     0|acc     |0.2236|±  |0.0266|
|                         |       |none  |     0|acc_norm|0.2073|±  |0.0259|
| - gaokao-english        |Yaml   |none  |     0|acc     |0.2582|±  |0.0251|
|                         |       |none  |     0|acc_norm|0.2288|±  |0.0241|
| - gaokao-geography      |Yaml   |none  |     0|acc     |0.2161|±  |0.0292|
|                         |       |none  |     0|acc_norm|0.2111|±  |0.0290|
| - gaokao-history        |Yaml   |none  |     0|acc     |0.2170|±  |0.0269|
|                         |       |none  |     0|acc_norm|0.2340|±  |0.0277|
| - gaokao-mathqa         |Yaml   |none  |     0|acc     |0.2564|±  |0.0233|
|                         |       |none  |     0|acc_norm|0.2821|±  |0.0241|
| - gaokao-physics        |Yaml   |none  |     0|acc     |0.2350|±  |0.0301|
|                         |       |none  |     0|acc_norm|0.1850|±  |0.0275|
| - logiqa-en             |Yaml   |none  |     0|acc     |0.2488|±  |0.0170|
|                         |       |none  |     0|acc_norm|0.2919|±  |0.0178|
| - logiqa-zh             |Yaml   |none  |     0|acc     |0.2212|±  |0.0163|
|                         |       |none  |     0|acc_norm|0.3057|±  |0.0181|
| - lsat-ar               |Yaml   |none  |     0|acc     |0.2261|±  |0.0276|
|                         |       |none  |     0|acc_norm|0.2000|±  |0.0264|
| - lsat-lr               |Yaml   |none  |     0|acc     |0.2039|±  |0.0179|
|                         |       |none  |     0|acc_norm|0.2333|±  |0.0187|
| - lsat-rc               |Yaml   |none  |     0|acc     |0.1970|±  |0.0243|
|                         |       |none  |     0|acc_norm|0.1710|±  |0.0230|
| - sat-en                |Yaml   |none  |     0|acc     |0.2427|±  |0.0299|
|                         |       |none  |     0|acc_norm|0.1893|±  |0.0274|
| - sat-en-without-passage|Yaml   |none  |     0|acc     |0.2136|±  |0.0286|
|                         |       |none  |     0|acc_norm|0.1942|±  |0.0276|
| - sat-math              |Yaml   |none  |     0|acc     |0.3000|±  |0.0310|
|                         |       |none  |     0|acc_norm|0.2273|±  |0.0283|

|Groups |Version|Filter|n-shot| Metric |Value |   |Stderr|
|-------|-------|------|-----:|--------|-----:|---|-----:|
|agieval|N/A    |none  |     0|acc     |0.2275|±  |0.0341|
|       |       |none  |     0|acc_norm|0.2407|±  |0.0464|

Score deviations could potentially be due to differences in batch size / floating point errors, want to look slightly harder though.

I also realized that the AGIEval dataset had some examples cleaned up (e.g., for a few answer D is correct but no answer D exists)--see history in https://github.com/ruixiangcui/AGIEval/commits/main/data/v1 . I therefore want to reupload the data to HF with these errant docs fixed.

@baberabb
Copy link
Contributor

baberabb commented Jan 10, 2024

should also make a group just for the english subset as most papers only provide that!

@StellaAthena
Copy link
Member

@teknium1 Alerting you to incoming official support for AGI Eval. We would very much appreciate it if you could run a couple models through it and tell us if you're content with the new implementation.

@haileyschoelkopf Since some numbers match exactly and some don't, it's probably a good idea to run a diff on which questions get a different grade between the two implementations.

@gblazex
Copy link

gblazex commented Jan 11, 2024

this is great PR!

One thing I noticed is that Nous / @teknium1 kind of AGI Eval,
as well as the fork of this repo by @dmahan93 all seem to be missing a Math test with 1000 questions (AMC/AIME english).

fork list of tests: https://github.com/dmahan93/lm-evaluation-harness/blob/add-agieval/lm_eval/tasks/agieval.py#L193

Missing 1,000 questions (from original repo):
https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/math.jsonl

Original table from paper (bottom one is the missing Math test with 1000 questions)
image

Now this might've been on purpose because of the size of 1,000 questions (to cut down on running time).

But makes comparing AGI Eval scores to AGI Eval scores out there problematic.

(please note that there are two english math tests, one is SAT and other is the missing AMC/AIME)

@haileyschoelkopf
Copy link
Collaborator

Thanks for pointing this out! I also noticed this when I was looking at @dmahan93 's AGIEval upload-to-HF utility.

I believe the reason this was probably left out originally was because the MATH eval is not multiple-choice, but agree that it should be included in our upstreaming.

@dmahan93
Copy link

Yup, intentionally skipped it since it wasn't multiple choice and ends up with a direct string comparison which I have bad vibes on.

@gblazex
Copy link

gblazex commented Jan 12, 2024

@dmahan93 totally makes sense

@haileyschoelkopf
yes it would make the test complete, and completeness is the best part of this awesome project.

Then people can decide to run it or not.

Maybe a --multi-choice-only flag can help Nous (and others) run their version easily.
But custom topic list also covers this use-case.

@StellaAthena
Copy link
Member

Maybe a --multi-choice-only flag can help Nous (and others) run their version easily.
But custom topic list also covers this use-case.

Let's make an agi-eval-nous group to run it easily.

@gblazex
Copy link

gblazex commented Jan 20, 2024

I can live with that. And LDJ can too.. :)

Screenshot 2024-01-19 at 20 53 53

@haileyschoelkopf
Copy link
Collaborator

Working on this in https://github.com/EleutherAI/lm-evaluation-harness/tree/agieval ! added the missing exact-match tasks and their postprocessing logic from the original AGIEval repo, still need to test a bit more.

@teknium1
Copy link

Hey all is there still a need for me? I just saw this.

@haileyschoelkopf
Copy link
Collaborator

We just merged in #1359 ! Thanks @Sparkier for this.

@teknium1
Copy link

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

8 participants