Anomalously small values `gemma-2b-it` on GMS8k #82

lewtun · 2024-03-02T20:23:56Z

I noticed that the instruct version of gemma-2b gets anomalously small values on GSM8k. Here's the command I'm running:

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --tasks="lighteval|gsm8k|5|0" \
    --output_dir "./scratch/evals" \
    --model_args "pretrained=google/gemma-2b-it"    
    --override_batch_size 1

with --use_chat_template

Task	Version	Metric	Value		Stderr
lighteval:gsm8k:5	0	qem	0.0341	±	0.005

without --use_chat_template

Task	Version	Metric	Value		Stderr
lighteval:gsm8k:5	0	qem	0.0553	±	0.0063

For reference, the base model gets ~0.174 which is far better.

I think part of the problem is that GSM8k expect the answer to be formatted with #### {ANSWER} and the instruct models are quite inconsistent in this respect because they haven't been told to do so.

Here's an instructive example where the model produces the correct answer, but would be scored 0 because it didn't predict #### {ANSWER}:

Prompt	Completion	Ground truth
Question: A pet store currently has 5 dogs, 2 cats, and 10 birds. How many legs in total do the pets in the store have? Answer:	There are 5 dogs * 4 legs/dog + 2 cats * 4 legs/cat + 10 birds * 2 legs/bird = 20 legs/dog + 8 legs/cat + 20 legs/bird. So, the total number of legs in the store is 20 + 8 + 20 = 48 legs.	The dogs have 5 dogs * 4 legs/dog = <<54=20>>20 legs. The cats have 2 cats * 4 legs/cat = <<24=8>>8 legs. The birds have 10 birds * 2 legs/bird = <<10*2=20>>20 legs. The pets have 20 legs + 8 legs + 20 legs = <<20+8+20=48>>48 legs. #### 48

Perhaps one solution would be to format the input like GPQA does:

Here are some example questions from experts. Format your final response with: "#### {insert answer here}

Question: {few_shot_q}
Answer: {few_shot_a}
#### {answer}

... N few shot examples

What is the correct answer to this question: A pet store currently has 5 dogs, 2 cats, and 10 birds. How many legs in total do the pets in the store have?

You can see in this example the the 7B instruct model formats the answer correctly: https://hf.co/chat/r/ltNE54h

The text was updated successfully, but these errors were encountered:

lewtun · 2024-03-02T20:44:28Z

FYI for the 7B model I am seeing a lot of truncated responses like Sure, here is the solution which also suggests we are losing candidate answers in the parsing

clefourrier · 2024-03-04T07:45:57Z

For GSM8K, we constrained the evaluation to match the harness we are launching on the leaderboard - changing the prompt would change this. However, maybe we could add an "instruct" parameter?

It's fascinating that instruct models become less good at following few shots formatting!

clefourrier · 2024-03-04T09:54:40Z

Edit: checked and the truncation used in the harness evolved since the above version. I'm going to edit the allowed eos token to fix this.

NathanHB · 2024-03-04T12:46:45Z

The truncation used in the harness evolved quite a lot in different versions. The most up to date one is: [\n\n, Question:]
See here

clefourrier mentioned this issue Mar 4, 2024

Change the eos condition for GSM8K #85

Merged

NathanHB added the bug Something isn't working label Mar 4, 2024

NathanHB linked a pull request Mar 4, 2024 that will close this issue

Change the eos condition for GSM8K #85

Merged

clefourrier closed this as completed in #85 Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anomalously small values `gemma-2b-it` on GMS8k #82

Anomalously small values `gemma-2b-it` on GMS8k #82

lewtun commented Mar 2, 2024 •

edited

Loading

lewtun commented Mar 2, 2024

clefourrier commented Mar 4, 2024

clefourrier commented Mar 4, 2024

NathanHB commented Mar 4, 2024

Anomalously small values gemma-2b-it on GMS8k #82

Anomalously small values gemma-2b-it on GMS8k #82

Comments

lewtun commented Mar 2, 2024 • edited Loading

lewtun commented Mar 2, 2024

clefourrier commented Mar 4, 2024

clefourrier commented Mar 4, 2024

NathanHB commented Mar 4, 2024

Anomalously small values `gemma-2b-it` on GMS8k #82

Anomalously small values `gemma-2b-it` on GMS8k #82

lewtun commented Mar 2, 2024 •

edited

Loading