Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anomalously small values gemma-2b-it on GMS8k #82

Closed
lewtun opened this issue Mar 2, 2024 · 4 comments · Fixed by #85
Closed

Anomalously small values gemma-2b-it on GMS8k #82

lewtun opened this issue Mar 2, 2024 · 4 comments · Fixed by #85
Labels
bug Something isn't working

Comments

@lewtun
Copy link
Member

lewtun commented Mar 2, 2024

I noticed that the instruct version of gemma-2b gets anomalously small values on GSM8k. Here's the command I'm running:

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --tasks="lighteval|gsm8k|5|0" \
    --output_dir "./scratch/evals" \
    --model_args "pretrained=google/gemma-2b-it"    
    --override_batch_size 1

with --use_chat_template

Task Version Metric Value Stderr
lighteval:gsm8k:5 0 qem 0.0341 ± 0.005

without --use_chat_template

Task Version Metric Value Stderr
lighteval:gsm8k:5 0 qem 0.0553 ± 0.0063

For reference, the base model gets ~0.174 which is far better.

I think part of the problem is that GSM8k expect the answer to be formatted with #### {ANSWER} and the instruct models are quite inconsistent in this respect because they haven't been told to do so.

Here's an instructive example where the model produces the correct answer, but would be scored 0 because it didn't predict #### {ANSWER}:

Prompt Completion Ground truth
Question: A pet store currently has 5 dogs, 2 cats, and 10 birds. How many legs in total do the pets in the store have? Answer: There are 5 dogs * 4 legs/dog + 2 cats * 4 legs/cat + 10 birds * 2 legs/bird = 20 legs/dog + 8 legs/cat + 20 legs/bird. So, the total number of legs in the store is 20 + 8 + 20 = 48 legs. The dogs have 5 dogs * 4 legs/dog = <<54=20>>20 legs. The cats have 2 cats * 4 legs/cat = <<24=8>>8 legs. The birds have 10 birds * 2 legs/bird = <<10*2=20>>20 legs. The pets have 20 legs + 8 legs + 20 legs = <<20+8+20=48>>48 legs. #### 48

Perhaps one solution would be to format the input like GPQA does:

Here are some example questions from experts. Format your final response with: "#### {insert answer here}

Question: {few_shot_q}
Answer: {few_shot_a}
#### {answer}

... N few shot examples

What is the correct answer to this question: A pet store currently has 5 dogs, 2 cats, and 10 birds. How many legs in total do the pets in the store have?

You can see in this example the the 7B instruct model formats the answer correctly: https://hf.co/chat/r/ltNE54h

@lewtun
Copy link
Member Author

lewtun commented Mar 2, 2024

FYI for the 7B model I am seeing a lot of truncated responses like Sure, here is the solution which also suggests we are losing candidate answers in the parsing

@clefourrier
Copy link
Member

For GSM8K, we constrained the evaluation to match the harness we are launching on the leaderboard - changing the prompt would change this. However, maybe we could add an "instruct" parameter?

It's fascinating that instruct models become less good at following few shots formatting!

@clefourrier
Copy link
Member

Edit: checked and the truncation used in the harness evolved since the above version. I'm going to edit the allowed eos token to fix this.

@NathanHB
Copy link
Member

NathanHB commented Mar 4, 2024

The truncation used in the harness evolved quite a lot in different versions. The most up to date one is: [\n\n, Question:]
See here

@NathanHB NathanHB added the bug Something isn't working label Mar 4, 2024
@NathanHB NathanHB linked a pull request Mar 4, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants