benchstats
is a Python 3.10+ package that lets you compare two sets of benchmark results* using proper statistical tests and make a readable report on that. This lets remove guesswork and WAGs from interpretation of benchmarking results, and obtain a solid** answer to a question "is the algorithm implementation A is faster than the algorithm implementation B".
Code that read input data, perform statistical testing and then visualize results is coupled only by a shared data types and it could easily be used separately if needed (for example, to build a performance regression testing pipeline).
* - while currently benchstats
has a built-in parser only for Google Benchmark-produced .json
files, it's very simple to make and use a parser for any other data source as long as you know how to interpret it. See a "Custom data sources" paragraph and --files_parser
CLI argument below.
** - the solidness of the answer still depends heavily on how well the underlying hardware and OS are quiesced, and how inherently non-deterministic the algorithms under benchmark are. It isn't covered here how to achieve these benchmarking basics, however, benchstats
does have a mode to alleviate testing if the benchmarking machine is stable enough, see --expect_same
CLI argument below.
- Really, why bother? Can't you just measure and compare the run time of code variant A to the run time of code variant B?
Well, you can, but to see that this is meaningless, just run your benchmark several times. The numbers will never be the same, they'll fluctuate. Which particular number should you use for the comparison then?
- Ok, then why not take an average number over a certain amount of benchmark executions and compare these averages?
If you repeat that experiment several times, you'll see this suffers from the very same problem as above - the number you get still fluctuates, even though usually it fluctuates less. You'll obtain two such numbers for benchmark alternatives and one will be bigger then the other. How would you know that the difference is large enough to be attributed to real differences in the code instead of a random non-determinism?
- Ah, but that's simple! In addition to the average value, also compute a standard deviation and then if averages are within 1-2-whatever standard deviations apart one from the other, declare the results aren't different. Otherwise one is significantly less or greater than the other. If you want to be extra sure, go with a "6 sigma golden standard" just like they do it in Physics! Google Benchmark even computes standard deviation on its own when you run benchmark with repetitions! Easy!
... and even more wrong than you'd be, should you just compare two random measurements directly. Standard deviation is a measure of variance applicable (read: that make sense to apply) only to random variables that are instances of Normal/Gaussian distribution. Measurements you get when benchmarking are anything but normally distributed, and a value computed as their "standard deviation" just doesn't describe what is expected - a real variance of distribution of measurements.
None of above is a good idea. But what is? Let's step back and think what do we want to learn from comparing benchmark results? We want to learn if a code variant A is faster, i.e. have smaller run time, than a code variant B. But we can't just measure their run times and compare them directly, because each measurement is essentially a random variable. Even if algorithms are perfectly deterministic, the hardware inherently isn't, so instead of measuring a run time precisely, we always sample a random value from some unknown distribution instead. Hence, the only correct way to go is to sample a lot of these random variables for distributions A and B, and then compare these sets of samples against each other. If it's more likely to find a smaller value in the set A than in the set B, then A is faster than B.
But how to do that? It's a well known problem in Mathematics and several solutions exists, such as Mann–Whitney U test and a newer more generic Brunner Munzel Test. benchstats
is essentially a convenience wrapper around these tests with fully customizable inputs and results visualization/exporting, letting you get the answers effortlessly.
pip install benchstats
For command line use see section "Examples" and "Command Line Interface Reference" below. For API overview see a corresponding section below.
(I'll use a Google Benchmark-derived binary as a data source for the examples, but just reminding that any data source could be easily plugged-in and used instead. For conciseness, I'll shorten "Google Benchmark" to simply "GBench")
Note that in all cases, you should always use at least 10 repetitions of a benchmark (--benchmark_repetitions=10
). Even 2 repetitions would work, but the more data you have, the more reliable results are, and at least 10 is a reasonable minimum, especially for Brunner Munzel test, which is a default test. --benchmark_enable_random_interleaving
flag is highly recommended too.
Lets consider two use-cases to illustrate how to use benchstats
.
For example, you already have a benchmark in the main
code branch and want to ensure that a feature
branch, which modifies some underlying code, doesn't contain unexpected performance regressions. How to do that?
- Run the benchmark from the
main
branch with--benchmark_out=./my_main.json
argument to save results to./my_main.json
file:
benchmark_binary --benchmark_repetitions=10 \
--benchmark_enable_random_interleaving \
--benchmark_out=./my_main.json
-
Checkout the
feature
code branch and similarly run the same benchmark saving results to./my_feature.json
file. -
Let the
benchstats
analyze the results:
benchstats ./my_main.json ./my_feature.json real_time
The first two positional arguments of benchstats
module are just file paths and the third argument is a json field name that contain a metric value for comparison. (real_time
is the default field/metric, so it doesn't even have to be specified, btw). GBench typically saves two metrics: real_time
(shown as just Time
in reports) which is an average time of one iteration, and a cpu_time
(shown as CPU
in reports) which is an average time CPU spent executing the user-mode code. GBench also supports custom timers, and you can use them here as well. Typically, custom timers have the same field name as the name that was given to a counter (in case of doubts, just peek into a .json
file).
If the benchmark_binary
had only a single benchmark, the result of the command could look something like this:
benchstats
present results as a table: the first column correspond to a benchmark name and all other columns shows results of applying a statistical test to a corresponding metric (yes, you can supply many metrics at once and all of them will be analyzed. Try to append cpu_time
to the command line above).
Here's what a data cell of the real_time
column mean: there were 10 samples of an alternative A (from the first positional argument - my_main.json
) and 10 samples of an alternative B (from the second positional argument my_feature.json
), - hence this (10 vs 10)
text in the end. Mean value (hence (means)
in the column title) for alternative A over its 10 samples was 516.5 microseconds, while mean for B was 502.8 microseconds. The test was using significance level 0.001 (hence the string alpha=0.00100
in the table title) and on that level test declared that both sets aren't different from each other (hence symbol ~
between the mean values). Actual p-value for tests that didn't find a difference is not shown by default, only blank space is allocated to report it (so no text shift happens when some benchmark has differences and others don't) but you can pass --always_show_pval
flag to show it.
To show how benchstats
report could look like when a significant difference is found, I'll just rerun the same analysis with less restrictive significance level (and will use the opportunity to show effects of some above-mentioned flags)
benchstats ./my_main.json ./my_feature.json real_time cpu_time \
--always_show_pvalues --alpha=0.03
The main change is that now the test reports that according to real_time
metric, the benchmark that generated ./my_main.json
run significantly slower than the benchmark that made ./my_feature.json
. Due to that the script has exited with exit code 1, letting one to detect that a human intervention might be needed. But what are these "main metrics" that the warning string refers to? Main metrics are such metrics, any difference in which makes the script to exit with error code 1. By default, a metric mentioned the first in the command line list of metrics is the main metric, but this could be changed with --main_metrics
argument, which accepts a list of indices in the metrics list. Let's make cpu_time
the main metric and see what would change:
benchstats ./my_main.json ./my_feature.json real_time cpu_time \
--always_show_pvalues --alpha=0.03 --main_metrics 1
Now cpu_time
is the main metric and the difference in it isn't significant. real_time
now is a secondary metric, which is only reported, but doesn't influence the script exit code.
For example, we are optimizing a function and want to compare how a new implementation compares to the old. We've created an individual benchmark for both implementations and for convenience have placed them in the same binary. That implies the benchmark functions must have different names. Now, if we run the benchmark binary and obtain ./my.json
results file and then will follow the approach above blindly, we won't get what we need. benchstats
compares a benchmark from the source 1 to a benchmark with the same name in the source 2, so we'll end up comparing old vs old in one row, and new vs new in the other, just like this:
benchstats ./my.json ./my.json real_time
To solve that, benchstats
has two features based on the Python standard regular expression engine:
- filtering benchmarks by name when reading a source file using a regexp in
--filter1
or--filter2
flags correspondingly. - a benchmark name rewrite feature, based on regexp search and replace.
--from
(also--from1
for source 1 only and--from2
for source 2 only) flag sets a regular expression pattern to find in each benchmark name, and an optional--to
(also--to1
and--to2
) flag sets replacement string to substitute the pattern found.
This allows us to make a plan:
- source 1 will represent the new implementation. For that we'll read
./my.json
and take only benchmark names that doesn't haveOld
substring using a regexp^(?!.*?Old).*$
- source 2 will represent the old implementation. We'll read
./my.json
and take only benchmarks with names that containOld
substring. The regexp is trivial literal stringOld
. - For source 2 we'll rewrite benchmark names to remove the
Old
prefix
benchstats ./my.json ./my.json real_time \
--filter1='^(?!.*?Old).*$' --filter2=Old \
--from2='(?:Old)(.*?)' --to2=\\1
Note the single '
quotes around --filter1
and --from2
regexps. They contain special characters that bash will try interpret spawning an error, if we won't escape them. Single quotes prevent bash expansion (double quotes "
don't!) and pass strings as is to the program. Also note that string replacement pattern for --to2
has a back slash escaped for that very reason too: \\1
instead of \1
. Single quotes is an option here too.
Now we get a proper result:
While benchstats
today support one Google Benchmark -produced .json
files out of the box, it's intended to be able to run statistical tests for any data sources. If you know how to parse your custom file format, connect to a remote database, generate data on the fly, or obtain data any other way, you can easily let benchstats
to run statistical tests on it by writing a custom parser.
Any python class (a parser) named X, that:
- inherits from
benchstats.common.ParserBase
class - can be instantiated with the following 4 parameters:
source_id
- a string identifier, describing a data source within parser's controlled namespace. It's is obtained as the first or the second positional argument ofbenchstats
module CLI.- In the examples above, for the built-in
parser_GbenchJson
this is a source file name.
- In the examples above, for the built-in
filter
- a string identifier of a filter, used to restrict a set of benchmarks to load. Obtained as a value of--filter1
or--filter2
CLI flags.- for the built-in
parser_GbenchJson
it's a string with regular expression to use for filtering benchmark names.
- for the built-in
metrics
- is a non-empty list or a tuple of string identifiers, corresponding to metrics to read for benchmarks. Obtained from third and subsequent CLI positional arguments.debug_log
- an instance ofbenchstats.common.LoggingConsole
class that provides a console logging interface to report errors if any (built-in parsers should also supportNone
values andbool
values and instantiate aLoggingConsole
on its own, but support for this isn't needed for a custom parser)
- has a
def getStats(self)
method returning a dictionary that maps strings describing benchmark names to dictionaries linking metric names withnp.ndarray()
of metric values. - is saved to a file
X.py
can be used as a custom data source for benchstats
CLI. Just pass a path to the X.py
as value of --files_parser
argument (or --file1_parser
or --file2_parser
if you want to use different parsers for source1 and source2)
Imagine you have a one-column CSV with data on which you want to run a statistical significance test. Save the following to ./myCSV.py
:
import numpy as np
from benchstats.common import ParserBase
class myCSV(ParserBase):
def __init__(self, fpath, filter, metrics, debug_log=None) -> None:
self.stats = np.loadtxt(fpath, dtype=np.float64)
def getStats(self) -> dict[str, dict[str, np.ndarray]]:
return {"bm": {"real_time": self.stats}}
There's always only one benchmark name bm
hardcoded, and only one metric real_time
that is read from CSV, but you get the idea. It's that simple.
Now just run benchstats
passing a path to the parser in --files_parser
flag:
benchstats ./src1.csv ./src2.csv --files_parser ./myCSV.py
If you make a parser that might be usable by other people, please consider adding it to the project's built-in parsers set by opening a thread with a suggestion in https://github.com/Arech/benchstats/issues or by making a PR into the repo.
You can always get actual CLI help output by passing --help
flag to the invocation of benchstats
. Note that --help
output has precedence over the description below.
usage: __main__.py [-h] [--show_debug | --no-show_debug]
[--files_parser <files parser class or path>]
[--file1_parser <file1 parser class or path>]
[--filter1 <reg expr>]
[--file2_parser <file2 parser class or path>]
[--filter2 <reg expr>]
[--from <reg expr>] [--to <replacement string>]
[--from1 <reg expr>] [--to1 <replacement string>]
[--from2 <reg expr>] [--to2 <replacement string>]
[--main_metrics <metric idx> [<metric idx> ...]]
[--method <method id>]
[--alpha <positive float less than 0.5>]
[--bonferroni]
[--always_show_pvalues]
[--sample_sizes | --no-sample_sizes]
[--sample_stats <stat ids> [<stat ids> ...]]
[--metric_precision <int>=3>]
[--expect_same]
[--colors | --no-colors]
[--multiline | --no-multiline]
[--export_to <path/to/export_file>]
[--export_fmt <format id>]
[--export_light]
<path/to/file1> <path/to/file2> [metrics ...]
-h, --help
show this help message and exit--show_debug, --no-show_debug
Shows some additional debugging info. Default: True
Arguments describing how a source data sets are obtained and are optionally transformed.
<path/to/file1>
Path to the first data file with benchmark results. See also--file1_parser
and--filter1
arguments.<path/to/file2>
Path to the second data file with benchmark results. See also--file2_parser
and--filter2
arguments--files_parser <files parser class or path>
Sets files parser class identifier, if a built-in parser is used (options are:GbenchJson
). Or sets a path to.py
file defining a custom parser, inherited frombenchstats.common.ParserBase
class. The parser class name must be the same as the file name.--file1_parser <file1 parser class or path>
Same as--files_parser
, but only applies tofile1
.--filter1 <reg expr>
If specified, sets a Python regular expression to select benchmarks by name from<file1>
--file2_parser <file2 parser class or path>
Same as --files_parser, but only applies tofile2
.--filter2 <reg expr>
Same as--filter1
, but for<file2>
--from <reg expr>
Arguments--from
and--to
works in a pair and are used to describe benchmark names transformation.--from
sets a regular expression (Python re module flavor) to define a pattern to replace to--to
replacement string. Typical use-cases include removing unnecessary parts of benchmark names (for example, Google Benchmark adds some suffixes that might not convey useful information) or "glueing" results of different benchmarks so they become comparable by the tool (for example, you have two benchmarks, the first is 'old_foo' with old algorithm implementation and the other is 'foo' with a new competing algorithm implementation, - to compare their performance against each other, you need to remove 'old_' prefix from the first benchmark with--from old_
and apply--filter1
and--filter2
to restrict loading only corresponding data for old or new variations). Remember to escape characters that have special meaning for the shell.--to <replacement string>
See help for--from
.--from1 <reg expr>
Like--from
, but for<file1>
only.--to1 <replacement string>
Like--to
, but for<file1>
only--from2 <reg expr>
Like--from
, but for<file2>
only--to2 <replacement string>
Like--to
, but for<file2>
onlymetrics
List of metric identifiers to use in tests for each benchmark. Default:real_time
. For deterministic algorithms highly recommend to measure and to use a minimum latency per repetition.--main_metrics <metric idx> [<metric idx> ...]
Indexes inmetrics
list specifying the main metrics. These are displayed differently and differences detected cause script to exit(1). Default:0
.
--method <method id>
Selects a method of statistical testing. Possible values are are:brunnermunzel
for Brunner Munzel test andmannwhitneyu
for Mann-Whitney U test. Default isbrunnermunzel
--alpha <positive float less than 0.5>
Set statistical significance level (a desired mistake probability level). Default0.001
- Note! This is an actual probability of a mistake only when all preconditions of the chosen statistical test are met. One of the most important preconditions like independence of individual measurements, or constancy of underlying distribution parameters are almost never met in a real-life benchmarking, even on a properly quiesced hardware. Real mistake rates are higher.
--bonferroni
If set, applies a Bonferroni multiple comparisons correction.
Controls how results are rendered
--always_show_pvalues
If set, always show p-values. By default it shows p-values only for significant differences. Note that when two sets are compared stochastically same (~
), or more precisely, not stochastically less and not stochastically greater (see Stochastic ordering), p-value shown is a minimum of two p-values for less and greater comparison.--sample_sizes, --no-sample_sizes
Controls whether to show sizes of datasets used in a test. DefaultTrue
--sample_stats <stat ids> [<stat ids> ...]
Sets which additional statistics about compared samples sets to show. Could be any sequence of floats (must be in range[0,100]
, designating a percentile value to calculate), or aliasesextremums
,median
,iqr
(interquartile range),std
(standard deviation). Shortenings are accepted. Use ofstd
, though isn't recommended as standard deviation doesn't really mean anything for most distributions beyond Normal.--metric_precision <int>=3>
Total number of digits in a metric value reported. Minimum is 3. Default is 4.--expect_same
If set, assumes that distributions are the same (i.e.H0
hypothesis is true) and shows some additional statistics useful for ensuring that a benchmark code is stable enough, or the machine is quiesced enough. One good example is when two different source<file1>
and<file2>
were made with exactly the same binary running on exactly the same machine in the same state - on a machine with a proper setup tests shouldn't find a difference.--colors, --no-colors
Controls if the output should be colored. Default:True
--multiline, --no-multiline
Controls if benchmark results could span several lines. Default:False
Controls how to export results.
--export_to <path/to/export_file>
Path to file to store comparison results to.--export_fmt <format id>
Format of the export file. Options are:txt
,svg
,html
. If not set, inferred from--export_to
file extension.--export_light
If set, uses light theme instead of dark
The code is pretty trivial, so I'll post only a high-level overview: there are 3 major components interacting in benchstats
when it runs in CLI mode:
- one or two source data parsers, depending on value of
--files_parser
,file1_parser
orfile2_parser
.- built-in parsers are stored in files starting with
parser_
prefix. What goes after the prefix is a parser identifier for CLI--file?_parser
flag. - all parsers obey requirements listed above in "Custom Data Sources" section.
- essentially a parser could be just a single function returning
dict[str, dict[str, np.ndarray]]
mapping from benchmark names to metric names to metric values. It's a class only for the simplicity of writing actual code and validating user-supplied parsers.
- built-in parsers are stored in files starting with
- a comparison method
compareStats()
frombenchstats.compare
. It:- takes two dictionaries
dict[str, dict[str, np.ndarray]]
with source data plus some control arguments, run stat tests on all metric values of the dictionaries that have the same benchmark name (key of the outer-most dictionary) and a metric name (key of the inner-most dictionary), and then - produces
benchstats.compare.CompareStatsResult
namedtuple object with comparison results for that benchmarks+metrics, common to both source data sets.
- takes two dictionaries
- a rendering method
renderComparisonResults()
frombenchstats.render
:- takes
benchstats.compare.CompareStatsResult
namedtuple produced bycompareStats()
, plus various control parameters, and renders it according to the rules and styles.
- takes
The project uses semantic versioning scheme.
Depending on a feedback the project or its individual components might get breaking changes, so prefer to use version pinning to prevent unexpected breakages. See CHANGELOG.md for details.
There are several things to unpack here.
First, just to remind, --alpha
CLI argument sets any value one wants.
Second, p-value=0.05 is just a 1 mistake per 20 tests. This isn't a lot even for a single test, but in the process of optimizing the code you might want to run a benchmark dozens of times. 20 runs isn't much, but it's almost certain that with a default alpha/p-value=0.05 you'll get at least one false positive (i.e. the test will erroneously detect a change when there's no real difference). Also it's a common practice to have dozens benchmarks in a single binary. That means dozens tests will happen in parallel and also just due to a bad luck one of them almost certainly will produce a false positive. --bonferroni
flag could help in this case though, but it won't help against many re-runs. So just because of these practical considerations it makes sense to use smaller p-values.
But there also exist a much bigger fundamental reason to use smaller p-value/alpha. The problem is - p-value is really a probability of mistake if and only if all preconditions of the statistical test are fulfilled. One of such requirements in case of Brunner Munzel and Mann-Whitney U tests is about different measurements being independent one from the other. The problem is - in case of benchmarking this requirements is (almost?) always broken, because most of the times we aren't using fully properly quiesced machines. Instead, there's almost always a ton of sh*t happening in the background interrupting a benchmark here and there, corrupting CPU caches, and so on, in a totally unpredictable, but systematic manner. This brings subtle hidden dependencies into measurements benchmarks take, and that breaks the important precondition of a statistical test. The result is, - if you run the same benchmark two times, it's probable (unless you've invested a notable conscious effort in quiescing the machine), that the test will detect differences. This is especially acute on Windows, btw, though Linux could be prone too.
So using 0.001
is a good and safe enough compromise to accommodate the fact that real false positive rate is much higher than the p-value used for the statistical test.
And by the way, don't worry much that lowering p-value makes the test less sensitive. It's already sensitive enough to detect differences in latencies that are much smaller than a clock granularity due to use of all data samples, especially when benchmarks repetition number is high.