-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anomalously small values gemma-2b-it
on GMS8k
#82
Comments
FYI for the 7B model I am seeing a lot of truncated responses like |
For GSM8K, we constrained the evaluation to match the harness we are launching on the leaderboard - changing the prompt would change this. However, maybe we could add an "instruct" parameter? It's fascinating that instruct models become less good at following few shots formatting! |
Edit: checked and the truncation used in the harness evolved since the above version. I'm going to edit the allowed eos token to fix this. |
The truncation used in the harness evolved quite a lot in different versions. The most up to date one is: [ |
I noticed that the instruct version of
gemma-2b
gets anomalously small values on GSM8k. Here's the command I'm running:with --use_chat_template
without --use_chat_template
For reference, the base model gets ~0.174 which is far better.
I think part of the problem is that GSM8k expect the answer to be formatted with
#### {ANSWER}
and the instruct models are quite inconsistent in this respect because they haven't been told to do so.Here's an instructive example where the model produces the correct answer, but would be scored 0 because it didn't predict
#### {ANSWER}
:Perhaps one solution would be to format the input like GPQA does:
You can see in this example the the 7B instruct model formats the answer correctly: https://hf.co/chat/r/ltNE54h
The text was updated successfully, but these errors were encountered: