-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix reproducibility issues, save metrics to disk and cleanup scripts #67
Conversation
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Now, try to solve the following question through the above guidelines:" | ||
Now, try to solve the following question through the above guidelines." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming this was just testing, but just confirming we are going to do just "guidelines:"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we will retain the original prompt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally all prompt changes should be tested on a validation set, and used as is during evaluation. I was playing around with this but realized we should just use the original prompt.
Now, try to solve the following question through the above guidelines:" | ||
Now, try to solve the following question through the above guidelines." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming this was just testing, but just confirming we are going to do just "guidelines:"?
Signed-off-by: SumanthRH <[email protected]>
87ba4ab
to
b2befb8
Compare
I got the following results on AIME and GPQA Diamond at temperature 0:
I'm gonna evaluate at t=0.7, n=8 now to see if I can match our original results. |
For t=0.7, n=8, here are the results I got: AIME: pass@1 is 36.25. Other metrics: "pass_at_k": {
"temp=0.7": {
"k=8": 60.0,
"k=4": 52.571,
"k=2": 44.762,
"k=1": 36.25
}
},
"accuracy": {
"temp=0.7": 0.3625
} GPQA Diamond: pass@1 is 54.92 . Other metrics; "pass_at_k": {
"temp=0.7": {
"k=8": 82.828,
"k=4": 74.993,
"k=2": 66.216,
"k=1": 54.924
}
},
"accuracy": {
"temp=0.7": 0.5492
} Note that pass@1 according to HumanEval's formula is expected to match accuracy . |
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm! just leave some small comments and questions
What does this PR do?
This PR does a few things:
half precision. More details to follow. We now use float32 by default.
tee
. Saving metrics explicitly also eliminates this.New Metrics File: Example
TODO:
Should fix: #66, #48