-
Notifications
You must be signed in to change notification settings - Fork 157
Check test cases with measurements #2161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Check test cases with measurements #2161
Conversation
4aa85e6
to
6e4c94c
Compare
6e4c94c
to
b4ebd59
Compare
Haven't looked at code yet, but can you clarify how we avoid trying to benchmark e.g. 1000s of artifacts when a new benchmark or new scenario is added? In other words, IIUC the backfill being referenced shouldn't always apply, right? (At least not on the official collector)? |
It's a bit complicated 😅 Details are in https://hackmd.io/wq30YNEIQMSFLWWcWDSI9A. The TLDR is:
Note that this PR does not actually allow backfill with the old system, it should work the same as before. It just allows backfill in the future with the new system, while being compatible with both systems. |
.into_iter() | ||
.map(|test_case| (*scenario, test_case)) | ||
}) | ||
.filter(|(_, test_case)| !already_computed.contains(test_case)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this condition be more complex? For example, if we add a new incr-patched scenario, we can't run the build for that without going through incr-clean (at least) - and probably the preceding patches, too.
backend: &CodegenBackend, | ||
target: &Target, | ||
) -> Vec<CompileTestCase> { | ||
self.patches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we expecting this to get more complicated later? It seems a bit wasteful (I guess fine in practice...) to produce a relatively large vec and then dedup most of it away.
if remaining_scenarios.is_empty() { | ||
continue; | ||
} | ||
remaining_scenarios.sort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if we're going to sort, maybe just use that for the dedup rather than intermediating through the HashSet?
With the new design, it will be possible to backfill results into the DB, for example if you ask on a PR that you want to see results for the cranelift backend (which is not benchmarked by default), the collector will go back and actually backfill cranelift backend data for the parent master commit.
To support that, we need to expand the notion of a benchmark being "done". Right now, we record a
(artifact, benchmark_name)
tuple into the DB (called a step) when a benchmark begins, and then if we ever encounter the same tuple again, we don't benchmark it again. That's not ideal, because if an error happened and no data was generated, you won't be able to retry the collection without removing everything for the given artifact from the DB. And mainly, you cannot backfill more results (e.g. by running only Debug first, and then backfilling Opt, which is useful also for local experiments).This PR expands the concept of a benchmark being done by actually checking which compile-time test cases are present in the DB. We cheat a bit to have better perf - if there is at least one recorded statistic in the DB for a given test case, we consider it to be done (so we essentially ignore missing iterations, but that should be a niche edge case).
Even though this logic is mostly useful for the new scheme, which is not implemented yet, I decided to also implement it for the current benchmarking logic, because it's useful for local experiments.
Best reviewed commit by commit.