Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reproducibility issues, save metrics to disk and cleanup scripts #67

Merged
merged 22 commits into from
Feb 13, 2025

Conversation

SumanthRH
Copy link
Collaborator

@SumanthRH SumanthRH commented Feb 6, 2025

What does this PR do?

This PR does a few things:

  • Fixes reproducibility issues in skythought. The core issue is performing inference in
    half precision. More details to follow. We now use float32 by default.
  • Adds support for saving metrics to disk (along with token usage statistics).
  • Adds support for computing pass@k for args.n > 1. Automatically computes pass@k for different powers of 2 less than args.n
  • Cleans up minor bugs after merging [evals] Add support for scaling evals and inference with ray #63 and Refactor model-specific configs and move data curation scripts #60.
  • Removes reading stdout output in eval.py. This is extremely costly for large datasets or for args.n > 1 because we simply iterate over all the logs in a for loop. But default, child process logs get streamed to the stdout of the parent process. If we want to save all logs to disk, we can just use tee . Saving metrics explicitly also eliminates this.

New Metrics File: Example

{
    "completion_tokens": 4265446,
    "prompt_tokens": 176656,
    "avg_completion_tokens": 8886.346,
    "avg_prompt_tokens": 368.033,
    "pass_at_k": {
        "temp=0.7": {
            "k=16": 66.667,
            "k=8": 60.569,
            "k=4": 52.879,
            "k=2": 44.028,
            "k=1": 35.417
        }
    },
    "accuracy": {
        "temp=0.7": 0.3542
    }
}

TODO:

  • Re-compute AIME and GPQA Diamond scores
  • Improve pass@k calculation

Should fix: #66, #48

Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
@SumanthRH SumanthRH marked this pull request as ready for review February 7, 2025 00:15
Now, try to solve the following question through the above guidelines:"
Now, try to solve the following question through the above guidelines."
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming this was just testing, but just confirming we are going to do just "guidelines:"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we will retain the original prompt

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally all prompt changes should be tested on a validation set, and used as is during evaluation. I was playing around with this but realized we should just use the original prompt.

@lynnliu030 lynnliu030 self-requested a review February 7, 2025 00:17
skythought/skythought_evals/README.md Outdated Show resolved Hide resolved
Now, try to solve the following question through the above guidelines:"
Now, try to solve the following question through the above guidelines."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming this was just testing, but just confirming we are going to do just "guidelines:"?

skythought/skythought_evals/util/response.py Show resolved Hide resolved
Signed-off-by: SumanthRH <[email protected]>
@SumanthRH SumanthRH force-pushed the sumanthrh/fix-repro-issues branch from 87ba4ab to b2befb8 Compare February 7, 2025 01:06
@SumanthRH
Copy link
Collaborator Author

I got the following results on AIME and GPQA Diamond at temperature 0:

  • AIME: 36.67 (11/30)
  • GPQA-Diamond: 53.03 (105/198)

I'm gonna evaluate at t=0.7, n=8 now to see if I can match our original results.

@SumanthRH
Copy link
Collaborator Author

For t=0.7, n=8, here are the results I got:

AIME: pass@1 is 36.25. Other metrics:

"pass_at_k": {
        "temp=0.7": {
            "k=8": 60.0,
            "k=4": 52.571,
            "k=2": 44.762,
            "k=1": 36.25
        }
    },
    "accuracy": {
        "temp=0.7": 0.3625
    }

GPQA Diamond: pass@1 is 54.92 . Other metrics;

"pass_at_k": {
        "temp=0.7": {
            "k=8": 82.828,
            "k=4": 74.993,
            "k=2": 66.216,
            "k=1": 54.924
        }
    },
    "accuracy": {
        "temp=0.7": 0.5492
    }

Note that pass@1 according to HumanEval's formula is expected to match accuracy .

x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
Copy link
Member

@lynnliu030 lynnliu030 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm! just leave some small comments and questions

skythought/skythought_evals/README.md Show resolved Hide resolved
skythought/skythought_evals/eval.py Show resolved Hide resolved
@SumanthRH SumanthRH merged commit 18cd85e into main Feb 13, 2025
4 checks passed
@SumanthRH SumanthRH deleted the sumanthrh/fix-repro-issues branch February 13, 2025 01:21
@SumanthRH SumanthRH mentioned this pull request Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

can't reproduce the accuracy data
3 participants