Statistical Testing for Benchmark Results Comparison

benchstats is a Python 3.10+ package that lets you compare two sets of benchmark results^* using proper statistical tests and make a readable report on that. This lets remove guesswork and WAGs from interpretation of benchmarking results, and obtain a solid^** answer to a question "is the algorithm implementation A is faster than the algorithm implementation B".

Code that read input data, perform statistical testing and then visualize results is coupled only by a shared data types and it could easily be used separately if needed (for example, to build a performance regression testing pipeline).

^* - while currently benchstats has a built-in parser only for Google Benchmark-produced .json files, it's very simple to make and use a parser for any other data source as long as you know how to interpret it. See a "Custom data sources" paragraph and --files_parser CLI argument below.

^** - the solidness of the answer still depends heavily on how well the underlying hardware and OS are quiesced, and how inherently non-deterministic the algorithms under benchmark are. It isn't covered here how to achieve these benchmarking basics, however, benchstats does have a mode to alleviate testing if the benchmarking machine is stable enough, see --expect_same CLI argument below.

Rationale: why should you even bother to use statistical tests to compare benchmarking results?

- Really, why bother? Can't you just measure and compare the run time of code variant A to the run time of code variant B?

Well, you can, but to see that this is meaningless, just run your benchmark several times. The numbers will never be the same, they'll fluctuate. Which particular number should you use for the comparison then?

- Ok, then why not take an average number over a certain amount of benchmark executions and compare these averages?

If you repeat that experiment several times, you'll see this suffers from the very same problem as above - the number you get still fluctuates, even though usually it fluctuates less. You'll obtain two such numbers for benchmark alternatives and one will be bigger then the other. How would you know that the difference is large enough to be attributed to real differences in the code instead of a random non-determinism?

- Ah, but that's simple! In addition to the average value, also compute a standard deviation and then if averages are within 1-2-whatever standard deviations apart one from the other, declare the results aren't different. Otherwise one is significantly less or greater than the other. If you want to be extra sure, go with a "6 sigma golden standard" just like they do it in Physics! Google Benchmark even computes standard deviation on its own when you run benchmark with repetitions! Easy!

... and even more wrong than you'd be, should you just compare two random measurements directly. Standard deviation is a measure of variance applicable (read: that make sense to apply) only to random variables that are instances of Normal/Gaussian distribution. Measurements you get when benchmarking are anything but normally distributed, and a value computed as their "standard deviation" just doesn't describe what is expected - a real variance of distribution of measurements.

None of above is a good idea. But what is? Let's step back and think what do we want to learn from comparing benchmark results? We want to learn if a code variant A is faster, i.e. have smaller run time, than a code variant B. But we can't just measure their run times and compare them directly, because each measurement is essentially a random variable. Even if algorithms are perfectly deterministic, the hardware inherently isn't, so instead of measuring a run time precisely, we always sample a random value from some unknown distribution instead. Hence, the only correct way to go is to sample a lot of these random variables for distributions A and B, and then compare these sets of samples against each other. If it's more likely to find a smaller value in the set A than in the set B, then A is faster than B.

But how to do that? It's a well known problem in Mathematics and several solutions exists, such as Mann–Whitney U test and a newer more generic Brunner Munzel Test. benchstats is essentially a convenience wrapper around these tests with fully customizable inputs and results visualization/exporting, letting you get the answers effortlessly.

Installation

pip install benchstats

For command line use see section "Examples" and "Command Line Interface Reference" below. For API overview see a corresponding section below.

Examples of Benchmark Comparison

(I'll use a Google Benchmark-derived binary as a data source for the examples, but just reminding that any data source could be easily plugged-in and used instead. For conciseness, I'll shorten "Google Benchmark" to simply "GBench")

Note that in all cases, you should always use at least 10 repetitions of a benchmark (--benchmark_repetitions=10). Even 2 repetitions would work, but the more data you have, the more reliable results are, and at least 10 is a reasonable minimum, especially for Brunner Munzel test, which is a default test. --benchmark_enable_random_interleaving flag is highly recommended too.

Lets consider two use-cases to illustrate how to use benchstats.

How to compare two different implementations of the same benchmark?

For example, you already have a benchmark in the main code branch and want to ensure that a feature branch, which modifies some underlying code, doesn't contain unexpected performance regressions. How to do that?

Run the benchmark from the main branch with --benchmark_out=./my_main.json argument to save results to ./my_main.json file:

benchmark_binary --benchmark_repetitions=10 \
    --benchmark_enable_random_interleaving \
    --benchmark_out=./my_main.json

Checkout the feature code branch and similarly run the same benchmark saving results to ./my_feature.json file.
Let the benchstats analyze the results:

benchstats ./my_main.json ./my_feature.json real_time

The first two positional arguments of benchstats module are just file paths and the third argument is a json field name that contain a metric value for comparison. (real_time is the default field/metric, so it doesn't even have to be specified, btw). GBench typically saves two metrics: real_time (shown as just Time in reports) which is an average time of one iteration, and a cpu_time (shown as CPU in reports) which is an average time CPU spent executing the user-mode code. GBench also supports custom timers, and you can use them here as well. Typically, custom timers have the same field name as the name that was given to a counter (in case of doubts, just peek into a .json file).

If the benchmark_binary had only a single benchmark, the result of the command could look something like this:

benchstats present results as a table: the first column correspond to a benchmark name and all other columns shows results of applying a statistical test to a corresponding metric (yes, you can supply many metrics at once and all of them will be analyzed. Try to append cpu_time to the command line above).

Here's what a data cell of the real_time column mean: there were 10 samples of an alternative A (from the first positional argument - my_main.json) and 10 samples of an alternative B (from the second positional argument my_feature.json), - hence this (10 vs 10) text in the end. Mean value (hence (means) in the column title) for alternative A over its 10 samples was 516.5 microseconds, while mean for B was 502.8 microseconds. The test was using significance level 0.001 (hence the string alpha=0.00100 in the table title) and on that level test declared that both sets aren't different from each other (hence symbol ~ between the mean values). Actual p-value for tests that didn't find a difference is not shown by default, only blank space is allocated to report it (so no text shift happens when some benchmark has differences and others don't) but you can pass --always_show_pval flag to show it.

To show how benchstats report could look like when a significant difference is found, I'll just rerun the same analysis with less restrictive significance level (and will use the opportunity to show effects of some above-mentioned flags)

benchstats ./my_main.json ./my_feature.json real_time cpu_time \
    --always_show_pvalues --alpha=0.03

The main change is that now the test reports that according to real_time metric, the benchmark that generated ./my_main.json run significantly slower than the benchmark that made ./my_feature.json. Due to that the script has exited with exit code 1, letting one to detect that a human intervention might be needed. But what are these "main metrics" that the warning string refers to? Main metrics are such metrics, any difference in which makes the script to exit with error code 1. By default, a metric mentioned the first in the command line list of metrics is the main metric, but this could be changed with --main_metrics argument, which accepts a list of indices in the metrics list. Let's make cpu_time the main metric and see what would change:

benchstats ./my_main.json ./my_feature.json real_time cpu_time \
    --always_show_pvalues --alpha=0.03 --main_metrics 1

Now cpu_time is the main metric and the difference in it isn't significant. real_time now is a secondary metric, which is only reported, but doesn't influence the script exit code.

How to compare two different benchmarks against each other?

For example, we are optimizing a function and want to compare how a new implementation compares to the old. We've created an individual benchmark for both implementations and for convenience have placed them in the same binary. That implies the benchmark functions must have different names. Now, if we run the benchmark binary and obtain ./my.json results file and then will follow the approach above blindly, we won't get what we need. benchstats compares a benchmark from the source 1 to a benchmark with the same name in the source 2, so we'll end up comparing old vs old in one row, and new vs new in the other, just like this:

benchstats ./my.json ./my.json real_time

To solve that, benchstats has two features based on the Python standard regular expression engine:

filtering benchmarks by name when reading a source file using a regexp in --filter1 or --filter2 flags correspondingly.
a benchmark name rewrite feature, based on regexp search and replace. --from (also --from1 for source 1 only and --from2 for source 2 only) flag sets a regular expression pattern to find in each benchmark name, and an optional --to (also --to1 and --to2) flag sets replacement string to substitute the pattern found.

This allows us to make a plan:

source 1 will represent the new implementation. For that we'll read ./my.json and take only benchmark names that doesn't have Old substring using a regexp ^(?!.*?Old).*$
source 2 will represent the old implementation. We'll read ./my.json and take only benchmarks with names that contain Old substring. The regexp is trivial literal string Old.
For source 2 we'll rewrite benchmark names to remove the Old prefix

benchstats ./my.json ./my.json real_time \
    --filter1='^(?!.*?Old).*$' --filter2=Old \
    --from2='(?:Old)(.*?)' --to2=\\1

Note the single ' quotes around --filter1 and --from2 regexps. They contain special characters that bash will try interpret spawning an error, if we won't escape them. Single quotes prevent bash expansion (double quotes " don't!) and pass strings as is to the program. Also note that string replacement pattern for --to2 has a back slash escaped for that very reason too: \\1 instead of \1. Single quotes is an option here too.

Now we get a proper result:

Custom Data Sources

While benchstats today support one Google Benchmark -produced .json files out of the box, it's intended to be able to run statistical tests for any data sources. If you know how to parse your custom file format, connect to a remote database, generate data on the fly, or obtain data any other way, you can easily let benchstats to run statistical tests on it by writing a custom parser.

Any python class (a parser) named X, that:

inherits from benchstats.common.ParserBase class
can be instantiated with the following 4 parameters:
1. source_id - a string identifier, describing a data source within parser's controlled namespace. It's is obtained as the first or the second positional argument of benchstats module CLI.
  - In the examples above, for the built-in parser_GbenchJson this is a source file name.
2. filter - a string identifier of a filter, used to restrict a set of benchmarks to load. Obtained as a value of --filter1 or --filter2 CLI flags.
  - for the built-in parser_GbenchJson it's a string with regular expression to use for filtering benchmark names.
3. metrics - is a non-empty list or a tuple of string identifiers, corresponding to metrics to read for benchmarks. Obtained from third and subsequent CLI positional arguments.
4. debug_log - an instance of benchstats.common.LoggingConsole class that provides a console logging interface to report errors if any (built-in parsers should also support None values and bool values and instantiate a LoggingConsole on its own, but support for this isn't needed for a custom parser)
has a def getStats(self) method returning a dictionary that maps strings describing benchmark names to dictionaries linking metric names with np.ndarray() of metric values.
is saved to a file X.py

can be used as a custom data source for benchstats CLI. Just pass a path to the X.py as value of --files_parser argument (or --file1_parser or --file2_parser if you want to use different parsers for source1 and source2)

The simplest example of a custom data source

Imagine you have a one-column CSV with data on which you want to run a statistical significance test. Save the following to ./myCSV.py:

import numpy as np
from benchstats.common import ParserBase

class myCSV(ParserBase):
    def __init__(self, fpath, filter, metrics, debug_log=None) -> None:
        self.stats = np.loadtxt(fpath, dtype=np.float64)

    def getStats(self) -> dict[str, dict[str, np.ndarray]]:
        return {"bm": {"real_time": self.stats}}

There's always only one benchmark name bm hardcoded, and only one metric real_time that is read from CSV, but you get the idea. It's that simple.

Now just run benchstats passing a path to the parser in --files_parser flag:

benchstats ./src1.csv ./src2.csv --files_parser ./myCSV.py

If you make a parser that might be usable by other people, please consider adding it to the project's built-in parsers set by opening a thread with a suggestion in https://github.com/Arech/benchstats/issues or by making a PR into the repo.

Command Line Interface Reference

You can always get actual CLI help output by passing --help flag to the invocation of benchstats. Note that --help output has precedence over the description below.

usage: __main__.py [-h] [--show_debug | --no-show_debug]
    [--files_parser <files parser class or path>]
    [--file1_parser <file1 parser class or path>]
    [--filter1 <reg expr>]
    [--file2_parser <file2 parser class or path>]
    [--filter2 <reg expr>]
    [--from <reg expr>] [--to <replacement string>]
    [--from1 <reg expr>] [--to1 <replacement string>]
    [--from2 <reg expr>] [--to2 <replacement string>]
    [--main_metrics <metric idx> [<metric idx> ...]]
    [--method <method id>]
    [--alpha <positive float less than 0.5>]
    [--bonferroni]
    [--always_show_pvalues]
    [--sample_sizes | --no-sample_sizes]
    [--sample_stats <stat ids> [<stat ids> ...]]
    [--metric_precision <int>=3>]
    [--expect_same]
    [--colors | --no-colors]
    [--multiline | --no-multiline]
    [--export_to <path/to/export_file>]
    [--export_fmt <format id>]
    [--export_light]
    <path/to/file1> <path/to/file2> [metrics ...]

Options

-h, --help show this help message and exit
--show_debug, --no-show_debug Shows some additional debugging info. Default: True

Input data:

Arguments describing how a source data sets are obtained and are optionally transformed.

<path/to/file1> Path to the first data file with benchmark results. See also --file1_parser and --filter1 arguments.
<path/to/file2> Path to the second data file with benchmark results. See also --file2_parser and --filter2 arguments
--files_parser <files parser class or path> Sets files parser class identifier, if a built-in parser is used (options are: GbenchJson). Or sets a path to .py file defining a custom parser, inherited from benchstats.common.ParserBase class. The parser class name must be the same as the file name.
--file1_parser <file1 parser class or path> Same as --files_parser, but only applies to file1.
--filter1 <reg expr> If specified, sets a Python regular expression to select benchmarks by name from <file1>
--file2_parser <file2 parser class or path> Same as --files_parser, but only applies to file2.
--filter2 <reg expr> Same as --filter1, but for <file2>
--from <reg expr> Arguments --from and --to works in a pair and are used to describe benchmark names transformation. --from sets a regular expression (Python re module flavor) to define a pattern to replace to --to replacement string. Typical use-cases include removing unnecessary parts of benchmark names (for example, Google Benchmark adds some suffixes that might not convey useful information) or "glueing" results of different benchmarks so they become comparable by the tool (for example, you have two benchmarks, the first is 'old_foo' with old algorithm implementation and the other is 'foo' with a new competing algorithm implementation, - to compare their performance against each other, you need to remove 'old_' prefix from the first benchmark with --from old_ and apply --filter1 and --filter2 to restrict loading only corresponding data for old or new variations). Remember to escape characters that have special meaning for the shell.
--to <replacement string> See help for --from.
--from1 <reg expr> Like --from, but for <file1> only.
--to1 <replacement string> Like --to, but for <file1> only
--from2 <reg expr> Like --from, but for <file2> only
--to2 <replacement string> Like --to, but for <file2> only
metrics List of metric identifiers to use in tests for each benchmark. Default: real_time. For deterministic algorithms highly recommend to measure and to use a minimum latency per repetition.
--main_metrics <metric idx> [<metric idx> ...] Indexes in metrics list specifying the main metrics. These are displayed differently and differences detected cause script to exit(1). Default: 0.

Statistical settings:

--method <method id> Selects a method of statistical testing. Possible values are are: brunnermunzel for Brunner Munzel test and mannwhitneyu for Mann-Whitney U test. Default is brunnermunzel
--alpha <positive float less than 0.5> Set statistical significance level (a desired mistake probability level). Default 0.001
- Note! This is an actual probability of a mistake only when all preconditions of the chosen statistical test are met. One of the most important preconditions like independence of individual measurements, or constancy of underlying distribution parameters are almost never met in a real-life benchmarking, even on a properly quiesced hardware. Real mistake rates are higher.
--bonferroni If set, applies a Bonferroni multiple comparisons correction.

Rendering:

Controls how results are rendered

--always_show_pvalues If set, always show p-values. By default it shows p-values only for significant differences. Note that when two sets are compared stochastically same (~), or more precisely, not stochastically less and not stochastically greater (see Stochastic ordering), p-value shown is a minimum of two p-values for less and greater comparison.
--sample_sizes, --no-sample_sizes Controls whether to show sizes of datasets used in a test. Default True
--sample_stats <stat ids> [<stat ids> ...] Sets which additional statistics about compared samples sets to show. Could be any sequence of floats (must be in range [0,100], designating a percentile value to calculate), or aliases extremums, median, iqr (interquartile range), std (standard deviation). Shortenings are accepted. Use of std, though isn't recommended as standard deviation doesn't really mean anything for most distributions beyond Normal.
--metric_precision <int>=3> Total number of digits in a metric value reported. Minimum is 3. Default is 4.
--expect_same If set, assumes that distributions are the same (i.e. H0 hypothesis is true) and shows some additional statistics useful for ensuring that a benchmark code is stable enough, or the machine is quiesced enough. One good example is when two different source <file1> and <file2> were made with exactly the same binary running on exactly the same machine in the same state - on a machine with a proper setup tests shouldn't find a difference.
--colors, --no-colors Controls if the output should be colored. Default: True
--multiline, --no-multiline Controls if benchmark results could span several lines. Default: False

Export:

Controls how to export results.

--export_to <path/to/export_file> Path to file to store comparison results to.
--export_fmt <format id> Format of the export file. Options are: txt, svg, html. If not set, inferred from --export_to file extension.
--export_light If set, uses light theme instead of dark

API overview

The code is pretty trivial, so I'll post only a high-level overview: there are 3 major components interacting in benchstats when it runs in CLI mode:

one or two source data parsers, depending on value of --files_parser, file1_parser or file2_parser.
- built-in parsers are stored in files starting with parser_ prefix. What goes after the prefix is a parser identifier for CLI --file?_parser flag.
- all parsers obey requirements listed above in "Custom Data Sources" section.
- essentially a parser could be just a single function returning dict[str, dict[str, np.ndarray]] mapping from benchmark names to metric names to metric values. It's a class only for the simplicity of writing actual code and validating user-supplied parsers.
a comparison method compareStats() from benchstats.compare. It:
- takes two dictionaries dict[str, dict[str, np.ndarray]] with source data plus some control arguments, run stat tests on all metric values of the dictionaries that have the same benchmark name (key of the outer-most dictionary) and a metric name (key of the inner-most dictionary), and then
- produces benchstats.compare.CompareStatsResult namedtuple object with comparison results for that benchmarks+metrics, common to both source data sets.
a rendering method renderComparisonResults() from benchstats.render:
- takes benchstats.compare.CompareStatsResult namedtuple produced by compareStats(), plus various control parameters, and renders it according to the rules and styles.

Stability and Changelog

The project uses semantic versioning scheme.

Depending on a feedback the project or its individual components might get breaking changes, so prefer to use version pinning to prevent unexpected breakages. See CHANGELOG.md for details.

FAQ

Isn't default value for alpha/p-value = 0.001 too small? I heard people generally use 0.05.

There are several things to unpack here.

First, just to remind, --alpha CLI argument sets any value one wants.

Second, p-value=0.05 is just a 1 mistake per 20 tests. This isn't a lot even for a single test, but in the process of optimizing the code you might want to run a benchmark dozens of times. 20 runs isn't much, but it's almost certain that with a default alpha/p-value=0.05 you'll get at least one false positive (i.e. the test will erroneously detect a change when there's no real difference). Also it's a common practice to have dozens benchmarks in a single binary. That means dozens tests will happen in parallel and also just due to a bad luck one of them almost certainly will produce a false positive. --bonferroni flag could help in this case though, but it won't help against many re-runs. So just because of these practical considerations it makes sense to use smaller p-values.

But there also exist a much bigger fundamental reason to use smaller p-value/alpha. The problem is - p-value is really a probability of mistake if and only if all preconditions of the statistical test are fulfilled. One of such requirements in case of Brunner Munzel and Mann-Whitney U tests is about different measurements being independent one from the other. The problem is - in case of benchmarking this requirements is (almost?) always broken, because most of the times we aren't using fully properly quiesced machines. Instead, there's almost always a ton of sh*t happening in the background interrupting a benchmark here and there, corrupting CPU caches, and so on, in a totally unpredictable, but systematic manner. This brings subtle hidden dependencies into measurements benchmarks take, and that breaks the important precondition of a statistical test. The result is, - if you run the same benchmark two times, it's probable (unless you've invested a notable conscious effort in quiescing the machine), that the test will detect differences. This is especially acute on Windows, btw, though Linux could be prone too.

So using 0.001 is a good and safe enough compromise to accommodate the fact that real false positive rate is much higher than the p-value used for the statistical test.

And by the way, don't worry much that lowering p-value makes the test less sensitive. It's already sensitive enough to detect differences in latencies that are much smaller than a clock granularity due to use of all data samples, especially when benchmarks repetition number is high.

Name	Name	Last commit message	Last commit date
Latest commit Arech Text updates Apr 12, 2025 3bc5163 · Apr 12, 2025 History 21 Commits
docs/imgs	docs/imgs	Improve README, lots of bugfixes	Mar 29, 2025
src/benchstats	src/benchstats	Improve description of compareStats()	Apr 5, 2025
tests	tests	Fix tests to refactoring	Mar 24, 2025
.gitignore	.gitignore	finalize v1	Mar 30, 2025
CHANGELOG.md	CHANGELOG.md	v1.1.0 Add `--metric_precision`, add no_wrap for the first column	Apr 4, 2025
LICENSE	LICENSE	finalize v1	Mar 30, 2025
MANIFEST.in	MANIFEST.in	finalize v1	Mar 30, 2025
README.md	README.md	Text updates	Apr 12, 2025
pyproject.toml	pyproject.toml	v1.1.0 Add `--metric_precision`, add no_wrap for the first column	Apr 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Statistical Testing for Benchmark Results Comparison

Rationale: why should you even bother to use statistical tests to compare benchmarking results?

Installation

Examples of Benchmark Comparison

How to compare two different implementations of the same benchmark?

How to compare two different benchmarks against each other?

Custom Data Sources

The simplest example of a custom data source

Command Line Interface Reference

Options

Input data:

Statistical settings:

Rendering:

Export:

API overview

Stability and Changelog

FAQ

Isn't default value for alpha/p-value = 0.001 too small? I heard people generally use 0.05.

About

Releases

Languages

License

Arech/benchstats

Folders and files

Latest commit

History

Repository files navigation

Statistical Testing for Benchmark Results Comparison

Rationale: why should you even bother to use statistical tests to compare benchmarking results?

Installation

Examples of Benchmark Comparison

How to compare two different implementations of the same benchmark?

How to compare two different benchmarks against each other?

Custom Data Sources

The simplest example of a custom data source

Command Line Interface Reference

Options

Input data:

Statistical settings:

Rendering:

Export:

API overview

Stability and Changelog

FAQ

Isn't default value for alpha/p-value = 0.001 too small? I heard people generally use 0.05.

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages