Add a script to find regression test threshold for harnesses #4724
+117
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of changes:
This change adds a find_threshold script which finds an appropriate threshold for test harnesses with non-determinism between runs taken into account. By running the test 100 times, recording the instruction counts for the test, and then finding the range of results as a percentage, this test outputs the recommended threshold to ensure confidence that when the threshold is exceeded, the result can be attributed to a performance regression. This ensures that our tests are not flaky while still providing accuracy in regression detection.
This change adds a contribution section to the README which describes the process for running the script to contribute to new test harnesses in the regression crate.
This change also includes setting the thresholds according to the output of this script for the existing tests. The results are included here:
Call-outs:
find_instruction_count
function implemented in the regression tests. This is because it scrapes the test artifacts upon completion to store the instruction count results. The regression test is run 100 times and on each iteration the instruction counts are stored in a .csv file which gets stored intests/regression
target/regression_artifacts/#commit_id/#test_name.annotated
which vary across tests and commits.diff_percentage
variable inDiffProfile::assert_performance
by multiplying the previous value by a 100. The previous value did not accurately reflect the change as a percentage since it was dividing the difference by the total count which only gives a fraction of the actual percent value to compare against and to output in case of a regression.Testing:
This change has not been formally tested. I ran the script to find the threshold values for existing tests and have included those changes in this PR
Is this a refactor change? If so, how have you proved that the intended behavior hasn't changed?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.