Add JMH microbenchmarks #35

szpak · 2019-06-12T17:34:24Z

General description

As discussed with @bsideup at GeeCON, I benchmarked the inpact of Bloud Hound integration to check how (un)safe it could be to enable it in production. I started with the following cases:

non blocking Mono call (simple block() on the single() scheduler)
non blocking Flux call (100K elements, simple blockLast() on the single() scheduler)
blocking Mono call (simple block() on the single() scheduler)
blocking Flux call (100K elements, simple blockLast() on the single() scheduler)

I tried to provide a baseline to a variant without the integration enabled (installed). As I haven't found a sensible reason to test blocking Flux which throws exception and aborts execution, so I applied a custom blockingMethodCallback() to simulate a warning printed on every blocking method detection (e.g. during the execution in production).

In my tests, I took the worst scenario where there is no custom logic in Flux elements processing. In many cases the relative Block Hound impact will be lower.

All benchmarks are performed in a single thread environment. It could be potentially extended to execute multiple streams processing in multiple thread, but then more external factors can make the results less reliable.

Some Reactor-related question

Talking about reliability and reproducibility ocasionally I've been getting some variation in results - even for Reactor itself (without Block Hound). For example, for the naive baseline benchmark:

    @Benchmark
    public Integer baselineNonBlockingFluxSubscribedOnSingle() {
        return numbers.hide().subscribeOn(single()).blockLast(Duration.ofSeconds(1));
    }

with numbers being non-blocking Fluxnumbers = Flux.range(0, 100_000); I receive result with quite large standard deviation (~20% of the average). It's all with Block Hound disabled. I chose single() to play with just one thread.

Result "reactor.blockhound.BlockHoundNonBlockingBenchmark.baselineNonBlockingFluxSubscribedOnSingle":
  1140.786 ±(99.9%) 233.514 us/op [Average]
  (min, avg, max) = (981.872, 1140.786, 1448.813), stdev = 218.429
  CI (99.9%): [907.272, 1374.299] (assumes normal distribution)

Benchmark                                                                 Mode  Cnt     Score     Error  Units
BlockHoundNonBlockingBenchmark.baselineNonBlockingFluxSubscribedOnSingle  avgt   15  1140.786 ± 233.514  us/op

It flattens nicely with multiple JVM forks, but anyway having:

# Fork: 1 of 3
Iteration   1: 1927.348 us/op
Iteration   2: 1906.449 us/op
Iteration   3: 1922.533 us/op
...

# Fork: 2 of 3
Iteration   1: 1310.879 us/op
Iteration   2: 1313.213 us/op
Iteration   3: 1329.674 us/op

for the same operation puzzles me a little bit. It can be an impact of my environment, but maybe you have some idea what in Reactor itself could make the passes so distinct? Maybe there is something I could tune in scheduling (or something else) to make the result more similar?

Summary

To conclude, I've got some raw results, but before posting it here and drawing some conclusions it would be good to test more cases where you suspect Bloud Hound impact could be possibly high (and also repeat my tests with higher number of iterations/forks). @bsideup what would you like to test?

Btw, the tests can be execute with gw jmh or (with some minimal configuration tuning) with the JMH plugin for Idea (less accurate, but more convinient when working of test cases).

bsideup · 2019-06-22T19:28:59Z

Hi @szpak,

Thanks a lot for finding time to work on it! 👍

Since BlockHound works with any JVM code, not just Reactor, I believe Reactor is not necessary here. We should measure the overhead of calling a blocking method in the following scenarios:

blocking-friendly thread
non-blocking thread, no "markers" in the stacktrace (aka allowBlockingCallsInside)
non-blocking thread, but allowBlockingCallsInside in the stack
non-blocking thread, but disallowBlockingCallsInside in the stack

Also, make sure that you override the callback (it throws an exception by default, we don't need it in the benchmarks)

.travis.yml

example/build.gradle

example/src/jmh/java/reactor/blockhound/BlockHoundBlockingBenchmark.java

szpak · 2019-06-22T21:07:01Z

Thanks @bsideup for your comments. I will take a position on them after Devoxx Poland (which starts the following week).

No Reactor code. Dummy blocking calls. And more.

szpak · 2019-07-06T21:48:35Z

I've reworked the benchmarks to cover mentioned cases. I removed all reference to Reactor and changed (potentially) blocking yield() to a custom non-blocking method marked as blocking (due to #39 I have to use existing method from JDK).

Because of #38 it is problematic to have only non-blocking calls with BlockHound installed from a thread that should be used for non-blocking operations. Without ThreadPoolExecutor reusing one thread across all measured calls is problematic.
For the same reason a baseline for non-blocking calls is not fully reliable as there is a hidden blocking call. Nevertheless, it shouldn't make real impact.

I made benchmark execution with 5 iterations (+ 3 warmup iterations) in 3 forks. The full results are there. The summary is here:

Benchmark                                                          Mode  Cnt  Score   Error  Units
BHBlockingAllowedThreadBenchmark.baselineBlockingCall              avgt   15  5.124 ± 0.039  us/op
BHBlockingAllowedThreadBenchmark.measureBlockingCall               avgt   15  5.222 ± 0.078  us/op
BHBlockingAllowedThreadBenchmark.baselineNonBlockingCall           avgt   15  5.070 ± 0.045  us/op
BHBlockingAllowedThreadBenchmark.measureNonBlockingCall            avgt   15  5.142 ± 0.076  us/op
BHBlockingNotAllowedThreadBenchmark.baselineBlockingCall           avgt   15  5.102 ± 0.067  us/op
BHBlockingNotAllowedThreadBenchmark.measureAllowedBlockingCall     avgt   15  5.853 ± 0.184  us/op
BHBlockingNotAllowedThreadBenchmark.measureDisallowedBlockingCall  avgt   15  6.334 ± 0.340  us/op

(I rearranged order in the output and shortened BlockHound to BH for readibility)

As expected, the impact on "blocking-friendly" threads is negligible (4 four lines). Difference between baselineBlockingCall and baselineNonBlockingCall in BHBlockingAllowedThreadBenchmark itself is caused by slightly simpler call.
In "non-blocking" threads the impact it a little bit bigger - more call related operations are performed. The disallowed blocking call has bigger impact which can be caused by extra handlers involved. Not performed due to #38 check on non blocking calls would probably have slightly smaller impact than "allowed blocking calls" as a short path should be used.
I could run simulate that case by commenting out Unsafe.park(), but it would be problematic to reproduce in the future and maybe you know some better way (without writing my own thread executor :) ).

We could also perform some test on more complex call to check how that 10-20% overhead (in a negative case) behaves in a function of number of accompanying non blocking calls.

Nevertheless, please write first how you see those use cases and received numbers.

Btw, once merging, I propose to merge it without squashing to keep a reference to the first version with code for Reactor, just in a case we would like to come back to that idea in the future.

bsideup · 2019-08-02T19:41:01Z

Hi @szpak,

Thanks a lot for your work!I pushed a change to your branch that simplifies the benchmarks and focuses on the actual overhead, I hope you don't mind :) I will appreciate your review in case I accidentally broke something.

I changed the "blocking method" to Thread.sleep(0). sleep(1) is a subject to the precision and accuracy of system timers and schedulers, hence zero. It is the easiest "no-op" blocking method in the JVM that does call the native routine and will not be inlined by JIT.

Here are the results I got:

Benchmark                             Mode  Cnt  Score   Error  Units
baselineNonBlockingCall               avgt   15  0,066 ± 0,001  us/op
measureNonBlockingCall                avgt   15  0,066 ± 0,001  us/op

baselineBlockingCallInBlockingThread  avgt   15  0,489 ± 0,006  us/op
measureBlockingCallInBlockingThread   avgt   15  0,807 ± 0,011  us/op

baselineAllowedBlockingCall           avgt   15  0,495 ± 0,015  us/op
measureAllowedBlockingCall            avgt   15  2,836 ± 0,016  us/op

While the percentages may look scary, I would like to focus on the differences instead :)

We have 3 scenarios here:

1. non-blocking call

when the executed method is not marked as blocking.

As we can see from the numbers, there is zero overhead from BlockHound on non-blocking methods. Expected, but good to verify.

2. blocking call but in non-blocking thread

a regular blocking call inside a thread that is not marked as non-blocking.

This one is not a problem yet has to be intercepted, hence the overhead. The benchmark reports an overhead of ~0.3us per blocking method invocation. Given that the call is blocking, most probably it will block for a duration measured by milliseconds, which means that the total overhead is low.

3. blocking call inside a non-blocking thread but marked as "allowed"

The heaviest one. A blocking call is detected but whitelisted (e.g. some logging). The difference is big (~5.8 times), but again, the overhead is ~2.3us per blocking call.
The idea of having the whitelisting is to be able to ignore some blocking calls that are "accepted", but it also means that such blocking calls should happen rarely and not affect the overall performance. 2us for such calls is IMO an acceptable overhead.

The bottom line is:

BlockHound does add the overhead (mostly due to the fact that it is a native agent and comes with the overhead of JNI), but only to the blocking methods.
Such methods are usually measured in milliseconds (IO, sleeping, awaiting, etc), and BlockHound only adds a few microseconds (1 millisecond = 1000 microseconds) to it.
Adding BlockHound should not dramatically reduce the performance, unless the performance is measured at microseconds level. Since the primary goal of BlockHound is to detect the blocking calls during the development & testing, this is more than acceptable. The observed overhead is even acceptable for staging environments.
Production usage is possible, but should be considered carefully in case of the performance critical services. Note that this is not a regular use case of BlockHound.

P.S. after running the benchmarks with async-profiler, I identified a few possible optimizations in the native agent we may consider (e.g. fail fast if the native method is never marked as "allowed in some cases"). This will require some API changes and should be considered separately.

szpak · 2019-08-04T21:57:28Z

benchmarks/src/jmh/java/reactor/blockhound/BlockHoundBenchmark.java

+    }
+
+    @Benchmark
+    public void measureNonBlockingCall(BlockHoundInstalledState state) throws Exception {


According to the JMH documentation calls inside the benchmark that do not return a value are easier to inline the optimalizations in JVM. Therefore, my tests had a return value or were using a BlackHole to prevent that. However, I haven't analyzed the calls using threads or sleep(0)/incrementAndGet() with a profiler, so maybe you were able to ensure that it doesn't happen. Similar situation with optimizations is with running it in a loop.

In general, it's nice that you were able to simplify the benchmarks itself.

the benchmark starts a Thread now, hence the inlining cannot affect the results I guess

bsideup · 2019-08-26T09:01:15Z

@szpak I merged the PR 🎉 Thanks a lot for the initial implementation and the feedback 👍

While evaluating the benchmarks, I already discovered a few easy ways how to reduce the overhead and will provide follow-up PRs with the optimizations 💯

Basic JMH benchmarks

d0fbf92

szpak mentioned this pull request Jun 12, 2019

BlockHound*.[jar|dylib] left in /tmp #36

Closed

bsideup added this to the 1.0.0.M5 milestone Jun 22, 2019

bsideup added area/java-agent area/native-agent type/chore A task not related to code (build, formatting, process, ...) labels Jun 22, 2019

bsideup self-requested a review June 22, 2019 19:29