-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add agieval #1259
add agieval #1259
Conversation
Thanks very much for the PR! Would you be able to compare any numbers from this implementation and the paper, and between this fork of master that people have been using for agieval? |
I basically converted the code in the fork you mentioned to the new structure of That being said, what would be needed to get meaningful numbers? I don't really have the capacity to run a ton of models right now. |
Even just comparing on If you don't have the bandwidth though no worries we can pick it up from here--appreciate the contribution! |
Testing this now. |
for TinyLlama: comparing against https://github.com/teknium1/LLM-Benchmark-Logs/blob/main/benchmark-logs/TinyLlama-1.1B-intermediate-step-1431k-3T.md :
And in this PR:
Score deviations could potentially be due to differences in batch size / floating point errors, want to look slightly harder though. I also realized that the AGIEval dataset had some examples cleaned up (e.g., for a few answer D is correct but no answer D exists)--see history in https://github.com/ruixiangcui/AGIEval/commits/main/data/v1 . I therefore want to reupload the data to HF with these errant docs fixed. |
should also make a group just for the english subset as most papers only provide that! |
@teknium1 Alerting you to incoming official support for AGI Eval. We would very much appreciate it if you could run a couple models through it and tell us if you're content with the new implementation. @haileyschoelkopf Since some numbers match exactly and some don't, it's probably a good idea to run a diff on which questions get a different grade between the two implementations. |
this is great PR! One thing I noticed is that Nous / @teknium1 kind of AGI Eval, fork list of tests: https://github.com/dmahan93/lm-evaluation-harness/blob/add-agieval/lm_eval/tasks/agieval.py#L193 Missing 1,000 questions (from original repo): Original table from paper (bottom one is the missing Math test with 1000 questions) Now this might've been on purpose because of the size of 1,000 questions (to cut down on running time). But makes comparing AGI Eval scores to AGI Eval scores out there problematic. (please note that there are two english math tests, one is SAT and other is the missing AMC/AIME) |
Thanks for pointing this out! I also noticed this when I was looking at @dmahan93 's AGIEval upload-to-HF utility. I believe the reason this was probably left out originally was because the MATH eval is not multiple-choice, but agree that it should be included in our upstreaming. |
Yup, intentionally skipped it since it wasn't multiple choice and ends up with a direct string comparison which I have bad vibes on. |
@dmahan93 totally makes sense @haileyschoelkopf Then people can decide to run it or not. Maybe a --multi-choice-only flag can help Nous (and others) run their version easily. |
Let's make an |
Working on this in https://github.com/EleutherAI/lm-evaluation-harness/tree/agieval ! added the missing exact-match tasks and their postprocessing logic from the original AGIEval repo, still need to test a bit more. |
Hey all is there still a need for me? I just saw this. |
Thanks! |
Adds the AGIEval benchmark.