Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taming variance of UnixBench results when comparing systems #23

Open
meteorfox opened this issue Jun 6, 2015 · 4 comments
Open

Taming variance of UnixBench results when comparing systems #23

meteorfox opened this issue Jun 6, 2015 · 4 comments

Comments

@meteorfox
Copy link

UnixBench has been shown to be very sensitive to compiler versions [1], compiles with out-dated [2] performance compiler optimizations, and includes hacks to avoid dead-code elimination by the compiler [3]. It seems one of the original intentions of the UnixBench was to have the ability to measure compiler performance[4], as a measurement of the overall system performance, reduced into a single metric[5].

Today, UnixBench it still being used to compare performance between systems, but most people have forgotten to read the warnings, and caveats included in both in the README.md and USAGE files included with this benchmark, which even the authors warned about the pitfalls when interpreting the results of different systems[6]. Even more worrying is how this benchmark is promoted[7] as 'the' single metric to look at when comparing different systems, even when these systems use different OSes, compiler, virtualization technologies, and even different architectures.

A lot of these problems seems to stem from the variability introduced by the compiler, and different versions of the linked libraries used by the benchmark.

Since UnixBench, Today, is mostly used as a benchmark to compare the performance across different infrastructure providers, I propose as a way to reduce the variability introduced by the factors mentioned above, to move UnixBench included benchmarks to be statically, binary-reproducible compiled binaries, which can be verified by means of a hash, for each major architecture out there.

The benefits of statically, binary-reproducible binaries would mean that compiler effects would be minimized, since binaries are distributed pre-compiled, it would also mean that because it is statically compiled, different versions of the dynamically linked libraries would not introduce variability. Finally because it is binary-reproducible, we can cross-verify, and compare results of the identical copies of the benchmark being executed, by means of hashing the binaries.

What are your thoughts? Do you think this a bad idea?

Thanks,
Carlos

[1] Compilers Love Messing With Benchmarks
[2] Issue #17
[3] Issue #10
[4] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/USAGE#L351-L358
[5] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/USAGE#L174
[6] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/USAGE#L348-L349
[7] http://serverbear.com/benchmarks/vps

@centminmod
Copy link

yeah definitely there's a huge difference i.e. XEN vs KVM systems on same hardware

@meteorfox
Copy link
Author

@centminmod About the XEN vs KVM differences, about +6 months ago, I did some comparisons, and dug a little with linux-perf and perf_events, and it seems, at least on what I saw, that XEN gives a lower UnixBench score because of bigger overhead with the fork() and exec() syscalls[1][2] compared to KVM. At least that's what I experienced.

[1] http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1815
[2] http://events.linuxfoundation.org/sites/events/files/slides/pvh-linux-collab-summit_0.pdf

@centminmod
Copy link

@meteorfox thanks for that info. I also found different kernel versions affected the scores/subtests https://community.centminmod.com/posts/6190/. On a dedicated server switching between 2.6.32.* and 3.x kernel on CentOS had dramatically lower scores on the exact same server which was non-virtualized.

@gstrauss
Copy link
Collaborator

gstrauss commented Sep 18, 2016

@meteorfox: you have raised lots of valid concerns with benchmarking in general, and there are even more concerns and caveats that have not been mentioned (but let's not try to to do that here).

Some of the concerns you raised above have hopefully been addressed with Makefile updates earlier this year which used more consistent compiler optimizations native to the platform on which UnixBench is being built. (Of course, distros would have to rebuild UnixBench optimized for lowest common denominator of supported hardware, definitely affecting micro-benchmark results.)

Now, please forgive me expanding the issues you raised into a more general conversation.

Summarizing things into a single number is fraught with dangers of losing relevant contextual information. This applies to any micro-benchmark. If a benchmark (micro or not) does not at least somewhat accurately reflect or model your typical workload (or worst case scenarios if you need that), then it is not very useful to you, regardless of how the benchmark performs on various systems.

Separately, benchmarks are affected by hardware, virtualization, compiler version and optimizations, and more. Additionally, benchmarking runs are susceptible to caching and other concurrent usage on same machine.

So what is UnixBench nowadays? UnixBench is probably becoming less relevant since it does not have any web server or benchmarks using interpreted languages (Perl, Python, Node.js, etc). Also, some of UnixBench micro-benchmarks no longer measure what they were originally purported to measure. For example, #8 notes that the sqrt() test is now actually testing fork()/exec() performance and not sqrt(). #19 notes that the fscopy is now often mostly testing CPU caches, not disk access. #13 notes that the 3dinfo program is not available on most modern systems and has not been available for some time.

Is UnixBench still relevant for comparing its numbers against systems that ran UnixBench over a decade ago? (Rhetorical question: was comparing these benchmark numbers against old results ever relevant?) Should #8, #13, and #19 be updated to have more relevant benchmarking parameters/programs for test runs on modern systems? Should UnixBench be left as-is and some tests noted in the documentation that these tests no longer produce useful results to compare with other modern systems?

So, again, what is UnixBench nowadays? (This is an open question -- I don't have an answer)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants