-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine grained metrics #308
Fine grained metrics #308
Conversation
I tested this locally with some ppl suite v3 data that has subdomains just on gpt-tiny. Everything works okay, but the perplexities are hard to assess because gpt-tiny just gets bad ppl on everything. I could compare this to some past numbers I've gotten on trained OLMo checkpoints, but its unclear to me how I'm supposed to used tango-in-beaker until the commit has been merged into main (since tango-in-beaker retrieves code from GitHub). |
I don't know how to satisfy mypy on this one:
Yes ideally this new row writer would have the same return type, but because we're now writing out named tables as opposed to just one table by itself it has to have a different data type. I tried to just tell mypy to ignore it, but that doesn't seem to work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_write_to_gsheet
does not really need to be a method of the step class -- let's break it out into its own function, and use it in both classes. This way, the new step class does not need to be a subclass of the old one, and we also make mypy happy.
Great idea! I've changed it as you proposed. |
@epwalsh Something strange seems to have been added in the last few commits on main. When I merged in main and went from commit Thanks for anything you can do to help resolve this! Here is the tail of the log with errors:
|
Double checked an this branch works again with #333 |
This PR creates a new
create_fine_grained_pipeline
which will:process-outputs
step on the result ofoutputs
steps that computes additional metrics based on per instance informationwrite-outputs-as-rows-multiple-metrics
step to write out separate sheets for each metric_type in the metrics dict of output step, rather than just using the primary metric.Right now I've just included a per subdomain perplexity as a minimal example of the kinds of metrics that could go into
process-outputs
. Additional metrics will eventually include things like metrics just over non-contaminated documents.One thing that would be good to add as there are more of these post-processing metrics would be a way to specify which of these metrics to include from the evaluation configuration file.