Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the eos condition for GSM8K #85

Merged
merged 10 commits into from
Mar 6, 2024
Merged

Conversation

clefourrier
Copy link
Member

@clefourrier clefourrier commented Mar 4, 2024

Will likely require updating the test suite - pending till we figure out the datasets bug in the CI.
Linked to #82

@clefourrier
Copy link
Member Author

@NathanHB wdyt of having 6 tasks called leaderboard|task|... ?
That way we could differentiate the modifs we make for more general setups from the pinned leaderboard versions.

@NathanHB
Copy link
Member

NathanHB commented Mar 4, 2024

@NathanHB wdyt of having 6 tasks called leaderboard|task|... ? That way we could differentiate the modifs we make for more general setups from the pinned leaderboard versions.

Oh good idea, though i'm not sure anyone will use other versions if we have the leaderboard versions. We want to be able to compare results with as many models as possible.

@clefourrier
Copy link
Member Author

Well, atm, the leaderboard versions use a pinned very old version of the harness - which led to the problems mentioned by @lewtun (for EOS tokens for ex).
I think we should both adress these problems and provide a cool version of our evals, but also allow people to reproduce leaderboard scores, wdyt?

@NathanHB
Copy link
Member

NathanHB commented Mar 4, 2024

I agree !

@NathanHB NathanHB linked an issue Mar 4, 2024 that may be closed by this pull request
@clefourrier clefourrier requested a review from NathanHB March 4, 2024 15:20
@lewtun
Copy link
Member

lewtun commented Mar 4, 2024

Well, atm, the leaderboard versions use a pinned very old version of the harness - which led to the problems mentioned by @lewtun (for EOS tokens for ex). I think we should both adress these problems and provide a cool version of our evals, but also allow people to reproduce leaderboard scores, wdyt?

Just so I understand: in the new format, there is leaderboard|tasks|num_fewshot|0 but will lighteval still be a valid suite for e.g. gsm8k?

In other words, the leaderboard suite => same logic as old pinned version of harness, but lighteval will have various improvements etc?

@clefourrier
Copy link
Member Author

@lewtun You understood perfectly! leaderboard|task should allow you to reproduce the current scores of the Open LLM Leaderboard. lighteval|task will follow our own logic for the task, in terms of EOS tokens and generation length.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@clefourrier clefourrier merged commit 9b3813f into main Mar 6, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Anomalously small values gemma-2b-it on GMS8k
3 participants