Fix memoryleak in FluxOnAssembly #3941

travishaagen · 2024-11-22T23:23:29Z

This PR contains one possible fix for a memory leak discovered in FluxOnAssembly.

My colleague, @foo4u discovered a memory leak in one of our applications, and I believe that it is caused by some code in FluxOnAssembly. The nodesPerId HashMap expects the key to be derived from calling hashCode() on an object, but inside FluxOnAssembly.OnAssemblyException#add(Publisher<?>, Publisher<?>, String, String) the debugger is showing that an AssemblyOp implementation is the object type. None of the AssemblyOp implementations implement hashCode(), so we're getting the integer returned by Object.hashCode() instead.

What I'm observing in a Kotlin Spring Webflux project is that every time we make a HTTP request to a Controller, this code path is being hit three times, which results in three objects being added to the HashMap.

FluxOnAssembly hasn't been changed in three years, and to be honest, the nodesPerId and how it is used in this class is odd. It might only be used to increment a counter, but it also is protecting a synchronized block when the root object is accessed. root has a HashSet that also contains the same objects as nodesPerId, and they are traversed in the getMessage() function.

All the AssemblyOp implementations override toString() with something that looks unique, and was probably what the original author intended to base the nodesPerId key on.

Here is an example of three keys, from this PR, attributed to a single HTTP request:

checkpoint("Handler com.xyz.MyController#myFunc(String, String, String, String, String, ServerWebExchange, Continuation) [DispatcherHandler]")

checkpoint("com.xyz.ShutdownWebFilter [DefaultWebFilterChain]")

checkpoint("HTTP POST "/endpoint" [ExceptionHandlingWebHandler]")

Willing to discuss!

Also! There are some other uses of System.identityHashCode in the codebase, but I don't have insights into them at the moment.

travishaagen · 2024-11-23T17:17:20Z

The leak only happens when exceptions are thrown. The reasons why one of our applications are so heavily affected are that:

It gets a lot of traffic
It uses exceptions for flow control in some places

The exceptions are static with pre-allocated stack traces as described in this blog (creator of Netty). This technique is used by Netty as well, and is fully supported by the JVM without any hacks.

travishaagen · 2024-11-23T20:23:34Z

reactor-core/src/main/java/reactor/core/publisher/FluxOnAssembly.java

+						}
+					});
+				}
+


The above change was added, because the following "parallel" test cases were failing, but only with debug mode enabled. Debugging led me to this fix.

travishaagen · 2024-11-23T20:30:16Z

This also fixes an issue with the "counting mechanism", that the "parallel" unit tests were exercising.

For example, FluxOnAssemblyTest.checkpointEmptyForParallel shows in the old.txt stacktrace,

Error has been observed at the following site(s):
	*______________________Flux.map ⇢ at reactor.core.publisher.FluxOnAssemblyTest.lambda$checkpointEmptyForParallel$23(FluxOnAssemblyTest.java:535)
	*__ParallelFlux.transformGroups ⇢ at reactor.core.publisher.FluxOnAssemblyTest.checkpointEmptyForParallel(FluxOnAssemblyTest.java:535)
	|_ ParallelFlux.transformGroups ⇢ at reactor.core.publisher.FluxOnAssemblyTest.checkpointEmptyForParallel(FluxOnAssemblyTest.java:535)
	|_                 checkpoint() ⇢ at reactor.core.publisher.FluxOnAssemblyTest.checkpointEmptyForParallel(FluxOnAssemblyTest.java:536)
	|_      ParallelFlux.sequential ⇢ at reactor.core.publisher.FluxOnAssemblyTest.checkpointEmptyForParallel(FluxOnAssemblyTest.java:537)

While in the new.txt stacktrace, you see observed 2 times instead of a repeated entry.

Error has been observed at the following site(s):
	*______________________Flux.map ⇢ at reactor.core.publisher.FluxOnAssemblyTest.lambda$checkpointEmptyForParallel$23(FluxOnAssemblyTest.java:535)
	*__ParallelFlux.transformGroups ⇢ at reactor.core.publisher.FluxOnAssemblyTest.checkpointEmptyForParallel(FluxOnAssemblyTest.java:535) (observed 2 times)
	|_                 checkpoint() ⇢ at reactor.core.publisher.FluxOnAssemblyTest.checkpointEmptyForParallel(FluxOnAssemblyTest.java:536)
	|_      ParallelFlux.sequential ⇢ at reactor.core.publisher.FluxOnAssemblyTest.checkpointEmptyForParallel(FluxOnAssemblyTest.java:537)

chemicL · 2024-11-26T10:32:48Z

Hey, thanks for the description of your issue. I did look into the implementation to map out the things you raise and I don't see an indication why the mentioned implementation details would lead to a leak in any way. I am not saying there are no leaks, just the guess that the nodesPerId,root, and System.identityHashCode combination you describe does not seem to be the problem in my view.

The intention of using the System.identityHashCode is to identify objects regardless of their actual type by their reference. That makes sense for operators which can be attached multiple times, yet their distinct position in the chain can be pointed to by being a separate object. In your example, three checkpoint() operators are defined in the chain, however each one of them has a different identity.

Unless you are able to provide a reproducer for the leak caused by the above I don't think we will be able to apply the proposed change, as it goes against the original design of this functionality as described above.

However, I think the source of the problem is described in your comment about reusing static instances of exceptions and is triggered by:

		final Throwable fail(Throwable t) {
			boolean lightCheckpoint = snapshotStack.isLight();

			OnAssemblyException onAssemblyException = null;
			for (Throwable e : t.getSuppressed()) {
			// ...
			}

			// ...
			t = Exceptions.addSuppressed(t, onAssemblyException);
			/...
		}

This causes your static exception to accumulate all the checkpoint data.

This is a similar issue to #3371.

I do not have any solution in mind at this time rather than the suggestion to avoid the shared pre-instantiated and static exceptions or avoid using the checkpoint operator in case you're not willing to give that up.
A middle ground is also possible. Norman's article doesn't mention something that can be found in the comments - you can disable stack trace creation for an exception but don't have to share it. That can be achieved using a constructor available since JDK 1.7 or via overriding the fillInStackTrace() method to immediately return this.

Regarding the ParallelFlux* tests, if you are able to enhance the suite to parameterize tests to run with either enabled or disabled debug mode, feel free to submit a separate PR for that change.

travispeloton · 2024-11-26T15:03:51Z

@chemicL , thank you for your response. I found out that I can use AspectJ on the FluxOnAssembly.OnAssemblySubscriber.fail(t) method, and if my static exception type is encountered, it simply returns t, and my program is behaving as expected. In that case OnAssemblyException is never created.

What would you say to a System Property that would enable this behavior via configuration? Completely opt-out. Besides losing some debugging information, what other downside is there?

OnAssemblyException does some potentially expensive things every time an exception is handled:

Creates a HashMap and HashSet (inside ObservedAtInformationNode) under all circumstances
Uses synchronized against HashMap multiple times (in my case 3 to 4) in a single operation
As I have explained above, they can grow uncontrollably when exceptions are static

Some possible options on uncontrolled growth,

WeakHashMap and similar solution for the Set
Limit the number of items that can be added to the Map and Set. Under normal circumstances these should not grow too high should they? This limit could be configurable via a System Property too.

travispeloton · 2024-11-26T16:32:56Z

Setting enableSuppression to false will make it so that my static exception does not trigger the memory leak condition.

RuntimeException(String message, Throwable cause,
                               boolean enableSuppression,
                               boolean writableStackTrace)

chemicL · 2024-11-27T15:12:20Z

Unless this becomes more of a burden, I'd prefer not to add a global property to disable the checkpoint and debug agent functionality. As you noted, disabling suppression on your static exception types addresses the concern. We also investigated reactor-netty's static exceptions and they are also disabling suppression. In the Spring Framework codebase there are no static exceptions in use for these reasons and stack capturing is disabled in some of them on the hot path reducing the impact significantly.

travispeloton · 2024-11-27T16:57:41Z

@chemicL what about limiting the number of objects that can be added to the Map/Set?

Is there any real use case where you would expect many items in them? 25? 50? 100?

The problem with the current code is that its extremely difficult for someone impacted by this to figure out what is wrong. My project has been impacted for 3 years and we finally figured it out, only because someone took Java Flight Recorder snapshots in production over the course of a week and luckily stumbled upon the right clues.

chemicL · 2024-11-28T10:16:42Z

It's really difficult to say what number is adequate as a default limitation. There are users who report expectations of thousands of operators in a reactive chain - should we limit their ability to investigate their chains because we decide on some number? My experience is that such arbitrary limitations are also difficult to uncover without knowing what you're looking for. Perhaps a better approach is to explicitly mention static exceptions in our reference documentations? WDYT?

travispeloton · 2024-11-28T13:47:08Z

Yep, a mention about setting enableSuppression to false for static exceptions would be good to document.

In case of enabled debugging or checkpoints, usage of static exceptions in user code can cause memory leaks. A warning note has been added to the debugging section. Related to #3941

chemicL · 2024-12-02T09:09:11Z

@travispeloton I opened a PR for the documentation note.

In case of enabled debugging or checkpoints, usage of static exceptions in user code can cause memory leaks. A warning note has been added to the debugging section. Related to #3941

chemicL · 2024-12-02T12:41:32Z

Please also take note of reactor/reactor-netty#3529

Fix memoryleak in FluxOnAssembly

b23fd37

travishaagen requested a review from a team as a code owner November 22, 2024 23:23

fix parallel test tests with Hooks.onOperatorDebug() enabled

6df6336

travishaagen commented Nov 23, 2024

View reviewed changes

travishaagen added 2 commits November 23, 2024 12:45

Add contains("(observed 2 times)") checks

786629d

do less work inside synchronized block

4307eea

chemicL added the status/invalid We don't feel this issue is valid, or the root cause was found outside of Reactor label Nov 26, 2024

chemicL closed this Nov 27, 2024

chemicL mentioned this pull request Dec 2, 2024

[docs] Add warning about static exceptions #3944

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memoryleak in FluxOnAssembly #3941

Fix memoryleak in FluxOnAssembly #3941

travishaagen commented Nov 22, 2024 •

edited

Loading

travishaagen commented Nov 23, 2024

travishaagen Nov 23, 2024

travishaagen commented Nov 23, 2024 •

edited

Loading

chemicL commented Nov 26, 2024 •

edited

Loading

travispeloton commented Nov 26, 2024 •

edited

Loading

travispeloton commented Nov 26, 2024

chemicL commented Nov 27, 2024

travispeloton commented Nov 27, 2024

chemicL commented Nov 28, 2024

travispeloton commented Nov 28, 2024

chemicL commented Dec 2, 2024

chemicL commented Dec 2, 2024

Fix memoryleak in FluxOnAssembly #3941

Fix memoryleak in FluxOnAssembly #3941

Conversation

travishaagen commented Nov 22, 2024 • edited Loading

travishaagen commented Nov 23, 2024

travishaagen Nov 23, 2024

Choose a reason for hiding this comment

travishaagen commented Nov 23, 2024 • edited Loading

chemicL commented Nov 26, 2024 • edited Loading

travispeloton commented Nov 26, 2024 • edited Loading

travispeloton commented Nov 26, 2024

chemicL commented Nov 27, 2024

travispeloton commented Nov 27, 2024

chemicL commented Nov 28, 2024

travispeloton commented Nov 28, 2024

chemicL commented Dec 2, 2024

chemicL commented Dec 2, 2024

travishaagen commented Nov 22, 2024 •

edited

Loading

travishaagen commented Nov 23, 2024 •

edited

Loading

chemicL commented Nov 26, 2024 •

edited

Loading

travispeloton commented Nov 26, 2024 •

edited

Loading