Fix reproducibility issues, save metrics to disk and cleanup scripts #67

SumanthRH · 2025-02-06T22:31:08Z

What does this PR do?

This PR does a few things:

Fixes reproducibility issues in skythought. The core issue is performing inference in
half precision. More details to follow. We now use float32 by default.
Adds support for saving metrics to disk (along with token usage statistics).
Adds support for computing pass@k for args.n > 1. Automatically computes pass@k for different powers of 2 less than args.n
Cleans up minor bugs after merging [evals] Add support for scaling evals and inference with ray #63 and Refactor model-specific configs and move data curation scripts #60.
Removes reading stdout output in eval.py. This is extremely costly for large datasets or for args.n > 1 because we simply iterate over all the logs in a for loop. But default, child process logs get streamed to the stdout of the parent process. If we want to save all logs to disk, we can just use tee . Saving metrics explicitly also eliminates this.

New Metrics File: Example

{
    "completion_tokens": 4265446,
    "prompt_tokens": 176656,
    "avg_completion_tokens": 8886.346,
    "avg_prompt_tokens": 368.033,
    "pass_at_k": {
        "temp=0.7": {
            "k=16": 66.667,
            "k=8": 60.569,
            "k=4": 52.879,
            "k=2": 44.028,
            "k=1": 35.417
        }
    },
    "accuracy": {
        "temp=0.7": 0.3542
    }
}

TODO:

Re-compute AIME and GPQA Diamond scores
Improve pass@k calculation

Should fix: #66, #48

Signed-off-by: SumanthRH <[email protected]>

skythought/skythought_evals/util/response.py

SumanthRH · 2025-02-07T00:17:01Z

skythought/skythought_evals/models/model_configs.yaml

-        Now, try to solve the following question through the above guidelines:"
+        Now, try to solve the following question through the above guidelines."


assuming this was just testing, but just confirming we are going to do just "guidelines:"?

Yes we will retain the original prompt

Ideally all prompt changes should be tested on a validation set, and used as is during evaluation. I was playing around with this but realized we should just use the original prompt.

skythought/skythought_evals/README.md

erictang000 · 2025-02-07T00:52:30Z

skythought/skythought_evals/models/model_configs.yaml

-        Now, try to solve the following question through the above guidelines:"
+        Now, try to solve the following question through the above guidelines."


assuming this was just testing, but just confirming we are going to do just "guidelines:"?

skythought/skythought_evals/util/response.py

Signed-off-by: SumanthRH <[email protected]>

SumanthRH · 2025-02-07T01:24:35Z

I got the following results on AIME and GPQA Diamond at temperature 0:

AIME: 36.67 (11/30)
GPQA-Diamond: 53.03 (105/198)

I'm gonna evaluate at t=0.7, n=8 now to see if I can match our original results.

SumanthRH · 2025-02-07T05:47:44Z

For t=0.7, n=8, here are the results I got:

AIME: pass@1 is 36.25. Other metrics:

"pass_at_k": {
        "temp=0.7": {
            "k=8": 60.0,
            "k=4": 52.571,
            "k=2": 44.762,
            "k=1": 36.25
        }
    },
    "accuracy": {
        "temp=0.7": 0.3625
    }

GPQA Diamond: pass@1 is 54.92 . Other metrics;

"pass_at_k": {
        "temp=0.7": {
            "k=8": 82.828,
            "k=4": 74.993,
            "k=2": 66.216,
            "k=1": 54.924
        }
    },
    "accuracy": {
        "temp=0.7": 0.5492
    }

Note that pass@1 according to HumanEval's formula is expected to match accuracy .

Signed-off-by: SumanthRH <[email protected]>

lynnliu030

overall lgtm! just leave some small comments and questions

skythought/skythought_evals/README.md

skythought/skythought_evals/eval.py

SumanthRH added 12 commits February 6, 2025 07:51

clean up entrypoint

24f0f14

Signed-off-by: SumanthRH <[email protected]>

cleanup entrypoint

c80c3c1

Signed-off-by: SumanthRH <[email protected]>

x

9746924

Signed-off-by: SumanthRH <[email protected]>

x

d68824e

Signed-off-by: SumanthRH <[email protected]>

x

9c714ec

Signed-off-by: SumanthRH <[email protected]>

x

7a4ee28

Signed-off-by: SumanthRH <[email protected]>

x

33b24fb

Signed-off-by: SumanthRH <[email protected]>

x

959869d

Signed-off-by: SumanthRH <[email protected]>

x

b9c0260

Signed-off-by: SumanthRH <[email protected]>

x

7afb49f

Signed-off-by: SumanthRH <[email protected]>

x

24961c0

Signed-off-by: SumanthRH <[email protected]>

x

b726e1f

Signed-off-by: SumanthRH <[email protected]>

SumanthRH commented Feb 7, 2025

View reviewed changes

skythought/skythought_evals/util/response.py Show resolved Hide resolved

SumanthRH marked this pull request as ready for review February 7, 2025 00:15

SumanthRH assigned lynnliu030 Feb 7, 2025

SumanthRH commented Feb 7, 2025

View reviewed changes

lynnliu030 self-requested a review February 7, 2025 00:17

erictang000 reviewed Feb 7, 2025

View reviewed changes

fix openai model issues

b2befb8

Signed-off-by: SumanthRH <[email protected]>

SumanthRH force-pushed the sumanthrh/fix-repro-issues branch from 87ba4ab to b2befb8 Compare February 7, 2025 01:06

SumanthRH added 5 commits February 7, 2025 06:35

x

a194037

Signed-off-by: SumanthRH <[email protected]>

x

e8f1e27

Signed-off-by: SumanthRH <[email protected]>

x

327ed8f

Signed-off-by: SumanthRH <[email protected]>

provide optimized settings

55692a2

Signed-off-by: SumanthRH <[email protected]>

x

d796ceb

Signed-off-by: SumanthRH <[email protected]>

SumanthRH mentioned this pull request Feb 7, 2025

Add code for evaluating pass @ k to inference_and_check #64 #65

Closed

SumanthRH added 2 commits February 9, 2025 00:56

add top_p

c5f7c88

Signed-off-by: SumanthRH <[email protected]>

fix check

ed082d9

Signed-off-by: SumanthRH <[email protected]>

x

86da514

Signed-off-by: SumanthRH <[email protected]>

kouroshHakha mentioned this pull request Feb 10, 2025

can't reproduce the accuracy data #66

Closed

x

71ab5a1

Signed-off-by: SumanthRH <[email protected]>

lynnliu030 approved these changes Feb 12, 2025

View reviewed changes

skythought/skythought_evals/README.md Show resolved Hide resolved

skythought/skythought_evals/eval.py Show resolved Hide resolved

SumanthRH merged commit 18cd85e into main Feb 13, 2025
4 checks passed

SumanthRH deleted the sumanthrh/fix-repro-issues branch February 13, 2025 01:21

SumanthRH mentioned this pull request Feb 13, 2025

Add Sky-T1-7B #72

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reproducibility issues, save metrics to disk and cleanup scripts #67

Fix reproducibility issues, save metrics to disk and cleanup scripts #67

SumanthRH commented Feb 6, 2025 •

edited

Loading

SumanthRH Feb 7, 2025

erictang000 Feb 7, 2025

SumanthRH Feb 7, 2025

SumanthRH Feb 7, 2025

erictang000 Feb 7, 2025

SumanthRH commented Feb 7, 2025

SumanthRH commented Feb 7, 2025

lynnliu030 left a comment

		Now, try to solve the following question through the above guidelines:"
		Now, try to solve the following question through the above guidelines."

Fix reproducibility issues, save metrics to disk and cleanup scripts #67

Fix reproducibility issues, save metrics to disk and cleanup scripts #67

Conversation

SumanthRH commented Feb 6, 2025 • edited Loading

What does this PR do?

New Metrics File: Example

SumanthRH Feb 7, 2025

Choose a reason for hiding this comment

erictang000 Feb 7, 2025

Choose a reason for hiding this comment

SumanthRH Feb 7, 2025

Choose a reason for hiding this comment

SumanthRH Feb 7, 2025

Choose a reason for hiding this comment

erictang000 Feb 7, 2025

Choose a reason for hiding this comment

SumanthRH commented Feb 7, 2025

SumanthRH commented Feb 7, 2025

lynnliu030 left a comment

Choose a reason for hiding this comment

SumanthRH commented Feb 6, 2025 •

edited

Loading