Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eval_mmlu ? #136

Open
mistletoe1024 opened this issue Dec 16, 2024 · 11 comments
Open

eval_mmlu ? #136

mistletoe1024 opened this issue Dec 16, 2024 · 11 comments

Comments

@mistletoe1024
Copy link

mistletoe1024 commented Dec 16, 2024

No description provided.

@mistletoe1024
Copy link
Author

bug fixed.

@mobicham
Copy link
Collaborator

Hi, sorry for the delay. What was the problem again ?

@mistletoe1024
Copy link
Author

Hi, I'd like to evaluate the performance of hqq+ on mmlu,
so I writed a eval_mmlu() behind eval_wikitext2() in hqq_plus.py
referred to https://github.com/FranxYao/chain-of-thought-hub/blob/main/MMLU/run_mmlu_llama.py,
but I got a bad avg score (only 0.0438:(, but FP16 is 0.4583),
my quantization setting is "nbits=2, group_size=8, quant_scale=False, quant_zero=False, axis=0",
lora_params and training_args were not change.
Could you offer a better solution or any advice? thank you!


Today, I printed the pred_answer of eval_mmlu() after eval_wikitext2() , and found lots of '\n' in the list of pred_answer, I thought i had solved the problem, but still got bad score of mmlu (0.0342, and ppl is 5.601, seems normal). So could you offer your solution?

@mobicham
Copy link
Collaborator

Oh I see, which model exactly? Is this an issue with MMLU only or also the other metrics?

@mistletoe1024
Copy link
Author

Oh I see, which model exactly? Is this an issue with MMLU only or also the other metrics?

Llama-2-7b-hf, only MMLU at present, the ppl is normal.

@mobicham
Copy link
Collaborator

Did you train the model yourself? How did you train it (what dataset) etc.

@mistletoe1024
Copy link
Author

Did you train the model yourself? How did you train it (what dataset) etc.

yes, but i just ran the hqq_puls.py and did not change any params. So the dataset is dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')

@mistletoe1024
Copy link
Author

mistletoe1024 commented Dec 18, 2024

the key to the question is the eval_mmlu() function, I printed the element of answers and saw "[", it should have been "A" or "B" or "C" or "D". ( https://github.com/FranxYao/chain-of-thought-hub/blob/main/MMLU/run_mmlu_llama.py, line 172)

@mobicham
Copy link
Collaborator

Sounds like it's overfiting to wikitext, that might explain why the ppl is good but mmlu is bad. You need proper instruct data (and the instruct model not base model) to get good mmlu performance, which is outside the scope of the hqq+ script

@mistletoe1024
Copy link
Author

Sounds like it's overfiting to wikitext, that might explain why the ppl is good but mmlu is bad. You need proper instruct data (and the instruct model not base model) to get good mmlu performance, which is outside the scope of the hqq+ script

Ok, I am going to try other dataset or instruct model, thank you :)

@mobicham
Copy link
Collaborator

It's normally a mix of datasets not just 1. It's an art to figure out the right datasets :D
Let me know how that works out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants