Taming variance of UnixBench results when comparing systems #23

meteorfox · 2015-06-06T23:43:15Z

UnixBench has been shown to be very sensitive to compiler versions [1], compiles with out-dated [2] performance compiler optimizations, and includes hacks to avoid dead-code elimination by the compiler [3]. It seems one of the original intentions of the UnixBench was to have the ability to measure compiler performance[4], as a measurement of the overall system performance, reduced into a single metric[5].

Today, UnixBench it still being used to compare performance between systems, but most people have forgotten to read the warnings, and caveats included in both in the README.md and USAGE files included with this benchmark, which even the authors warned about the pitfalls when interpreting the results of different systems[6]. Even more worrying is how this benchmark is promoted[7] as 'the' single metric to look at when comparing different systems, even when these systems use different OSes, compiler, virtualization technologies, and even different architectures.

A lot of these problems seems to stem from the variability introduced by the compiler, and different versions of the linked libraries used by the benchmark.

Since UnixBench, Today, is mostly used as a benchmark to compare the performance across different infrastructure providers, I propose as a way to reduce the variability introduced by the factors mentioned above, to move UnixBench included benchmarks to be statically, binary-reproducible compiled binaries, which can be verified by means of a hash, for each major architecture out there.

The benefits of statically, binary-reproducible binaries would mean that compiler effects would be minimized, since binaries are distributed pre-compiled, it would also mean that because it is statically compiled, different versions of the dynamically linked libraries would not introduce variability. Finally because it is binary-reproducible, we can cross-verify, and compare results of the identical copies of the benchmark being executed, by means of hashing the binaries.

What are your thoughts? Do you think this a bad idea?

Thanks,
Carlos

[1] Compilers Love Messing With Benchmarks
[2] Issue #17
[3] Issue #10
[4] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/USAGE#L351-L358
[5] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/USAGE#L174
[6] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/USAGE#L348-L349
[7] http://serverbear.com/benchmarks/vps

centminmod · 2015-06-19T21:09:04Z

yeah definitely there's a huge difference i.e. XEN vs KVM systems on same hardware

meteorfox · 2015-06-20T01:31:50Z

@centminmod About the XEN vs KVM differences, about +6 months ago, I did some comparisons, and dug a little with linux-perf and perf_events, and it seems, at least on what I saw, that XEN gives a lower UnixBench score because of bigger overhead with the fork() and exec() syscalls[1][2] compared to KVM. At least that's what I experienced.

[1] http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1815
[2] http://events.linuxfoundation.org/sites/events/files/slides/pvh-linux-collab-summit_0.pdf

centminmod · 2015-06-23T13:34:03Z

@meteorfox thanks for that info. I also found different kernel versions affected the scores/subtests https://community.centminmod.com/posts/6190/. On a dedicated server switching between 2.6.32.* and 3.x kernel on CentOS had dramatically lower scores on the exact same server which was non-virtualized.

gstrauss · 2016-09-18T05:33:46Z

@meteorfox: you have raised lots of valid concerns with benchmarking in general, and there are even more concerns and caveats that have not been mentioned (but let's not try to to do that here).

Some of the concerns you raised above have hopefully been addressed with Makefile updates earlier this year which used more consistent compiler optimizations native to the platform on which UnixBench is being built. (Of course, distros would have to rebuild UnixBench optimized for lowest common denominator of supported hardware, definitely affecting micro-benchmark results.)

Now, please forgive me expanding the issues you raised into a more general conversation.

Summarizing things into a single number is fraught with dangers of losing relevant contextual information. This applies to any micro-benchmark. If a benchmark (micro or not) does not at least somewhat accurately reflect or model your typical workload (or worst case scenarios if you need that), then it is not very useful to you, regardless of how the benchmark performs on various systems.

Separately, benchmarks are affected by hardware, virtualization, compiler version and optimizations, and more. Additionally, benchmarking runs are susceptible to caching and other concurrent usage on same machine.

So what is UnixBench nowadays? UnixBench is probably becoming less relevant since it does not have any web server or benchmarks using interpreted languages (Perl, Python, Node.js, etc). Also, some of UnixBench micro-benchmarks no longer measure what they were originally purported to measure. For example, #8 notes that the sqrt() test is now actually testing fork()/exec() performance and not sqrt(). #19 notes that the fscopy is now often mostly testing CPU caches, not disk access. #13 notes that the 3dinfo program is not available on most modern systems and has not been available for some time.

Is UnixBench still relevant for comparing its numbers against systems that ran UnixBench over a decade ago? (Rhetorical question: was comparing these benchmark numbers against old results ever relevant?) Should #8, #13, and #19 be updated to have more relevant benchmarking parameters/programs for test runs on modern systems? Should UnixBench be left as-is and some tests noted in the documentation that these tests no longer produce useful results to compare with other modern systems?

So, again, what is UnixBench nowadays? (This is an open question -- I don't have an answer)

meteorfox mentioned this issue Jun 23, 2015

Concerns with assumptions and limitations of UnixBench GoogleCloudPlatform/PerfKitBenchmarker#125

Open

gstrauss mentioned this issue Dec 5, 2016

Math behind unixbench tests #45

Closed

gstrauss mentioned this issue Oct 3, 2021

some problems on Ubuntu 20.04 #73

Open

This was referenced Feb 26, 2023

fstime.c - Seperate r/w files for each parallel #85

Merged

Optionally expand the size of sort.src for tst.sh shell workload #86

Merged

ThomasKaiser mentioned this issue Jun 9, 2024

Option to run a specific command with the controlled environment ThomasKaiser/sbc-bench#94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taming variance of UnixBench results when comparing systems #23

Taming variance of UnixBench results when comparing systems #23

meteorfox commented Jun 6, 2015

centminmod commented Jun 19, 2015

meteorfox commented Jun 20, 2015

centminmod commented Jun 23, 2015

gstrauss commented Sep 18, 2016 •

edited

Loading

Taming variance of UnixBench results when comparing systems #23

Taming variance of UnixBench results when comparing systems #23

Comments

meteorfox commented Jun 6, 2015

centminmod commented Jun 19, 2015

meteorfox commented Jun 20, 2015

centminmod commented Jun 23, 2015

gstrauss commented Sep 18, 2016 • edited Loading

gstrauss commented Sep 18, 2016 •

edited

Loading