Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memoryleak in FluxOnAssembly #3941

Closed

Conversation

travishaagen
Copy link

@travishaagen travishaagen commented Nov 22, 2024

This PR contains one possible fix for a memory leak discovered in FluxOnAssembly.

My colleague, @foo4u discovered a memory leak in one of our applications, and I believe that it is caused by some code in FluxOnAssembly. The nodesPerId HashMap expects the key to be derived from calling hashCode() on an object, but inside FluxOnAssembly.OnAssemblyException#add(Publisher<?>, Publisher<?>, String, String) the debugger is showing that an AssemblyOp implementation is the object type. None of the AssemblyOp implementations implement hashCode(), so we're getting the integer returned by Object.hashCode() instead.

What I'm observing in a Kotlin Spring Webflux project is that every time we make a HTTP request to a Controller, this code path is being hit three times, which results in three objects being added to the HashMap.

FluxOnAssembly hasn't been changed in three years, and to be honest, the nodesPerId and how it is used in this class is odd. It might only be used to increment a counter, but it also is protecting a synchronized block when the root object is accessed. root has a HashSet that also contains the same objects as nodesPerId, and they are traversed in the getMessage() function.

All the AssemblyOp implementations override toString() with something that looks unique, and was probably what the original author intended to base the nodesPerId key on.

Here is an example of three keys, from this PR, attributed to a single HTTP request:

checkpoint("Handler com.xyz.MyController#myFunc(String, String, String, String, String, ServerWebExchange, Continuation) [DispatcherHandler]")

checkpoint("com.xyz.ShutdownWebFilter [DefaultWebFilterChain]")

checkpoint("HTTP POST "/endpoint" [ExceptionHandlingWebHandler]")

Willing to discuss!

Also! There are some other uses of System.identityHashCode in the codebase, but I don't have insights into them at the moment.

@travishaagen travishaagen requested a review from a team as a code owner November 22, 2024 23:23
@travishaagen
Copy link
Author

The leak only happens when exceptions are thrown. The reasons why one of our applications are so heavily affected are that:

  1. It gets a lot of traffic
  2. It uses exceptions for flow control in some places

The exceptions are static with pre-allocated stack traces as described in this blog (creator of Netty). This technique is used by Netty as well, and is fully supported by the JVM without any hacks.

}
});
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above change was added, because the following "parallel" test cases were failing, but only with debug mode enabled. Debugging led me to this fix.

image

@travishaagen
Copy link
Author

travishaagen commented Nov 23, 2024

This also fixes an issue with the "counting mechanism", that the "parallel" unit tests were exercising.

For example, FluxOnAssemblyTest.checkpointEmptyForParallel shows in the old.txt stacktrace,

Error has been observed at the following site(s):
	*______________________Flux.map ⇢ at reactor.core.publisher.FluxOnAssemblyTest.lambda$checkpointEmptyForParallel$23(FluxOnAssemblyTest.java:535)
	*__ParallelFlux.transformGroups ⇢ at reactor.core.publisher.FluxOnAssemblyTest.checkpointEmptyForParallel(FluxOnAssemblyTest.java:535)
	|_ ParallelFlux.transformGroups ⇢ at reactor.core.publisher.FluxOnAssemblyTest.checkpointEmptyForParallel(FluxOnAssemblyTest.java:535)
	|_                 checkpoint() ⇢ at reactor.core.publisher.FluxOnAssemblyTest.checkpointEmptyForParallel(FluxOnAssemblyTest.java:536)
	|_      ParallelFlux.sequential ⇢ at reactor.core.publisher.FluxOnAssemblyTest.checkpointEmptyForParallel(FluxOnAssemblyTest.java:537)

While in the new.txt stacktrace, you see observed 2 times instead of a repeated entry.

Error has been observed at the following site(s):
	*______________________Flux.map ⇢ at reactor.core.publisher.FluxOnAssemblyTest.lambda$checkpointEmptyForParallel$23(FluxOnAssemblyTest.java:535)
	*__ParallelFlux.transformGroups ⇢ at reactor.core.publisher.FluxOnAssemblyTest.checkpointEmptyForParallel(FluxOnAssemblyTest.java:535) (observed 2 times)
	|_                 checkpoint() ⇢ at reactor.core.publisher.FluxOnAssemblyTest.checkpointEmptyForParallel(FluxOnAssemblyTest.java:536)
	|_      ParallelFlux.sequential ⇢ at reactor.core.publisher.FluxOnAssemblyTest.checkpointEmptyForParallel(FluxOnAssemblyTest.java:537)

@chemicL
Copy link
Member

chemicL commented Nov 26, 2024

Hey, thanks for the description of your issue. I did look into the implementation to map out the things you raise and I don't see an indication why the mentioned implementation details would lead to a leak in any way. I am not saying there are no leaks, just the guess that the nodesPerId,root, and System.identityHashCode combination you describe does not seem to be the problem in my view.

The intention of using the System.identityHashCode is to identify objects regardless of their actual type by their reference. That makes sense for operators which can be attached multiple times, yet their distinct position in the chain can be pointed to by being a separate object. In your example, three checkpoint() operators are defined in the chain, however each one of them has a different identity.

Unless you are able to provide a reproducer for the leak caused by the above I don't think we will be able to apply the proposed change, as it goes against the original design of this functionality as described above.

However, I think the source of the problem is described in your comment about reusing static instances of exceptions and is triggered by:

		final Throwable fail(Throwable t) {
			boolean lightCheckpoint = snapshotStack.isLight();

			OnAssemblyException onAssemblyException = null;
			for (Throwable e : t.getSuppressed()) {
			// ...
			}

			// ...
			t = Exceptions.addSuppressed(t, onAssemblyException);
			/...
		}

This causes your static exception to accumulate all the checkpoint data.

This is a similar issue to #3371.

I do not have any solution in mind at this time rather than the suggestion to avoid the shared pre-instantiated and static exceptions or avoid using the checkpoint operator in case you're not willing to give that up.
A middle ground is also possible. Norman's article doesn't mention something that can be found in the comments - you can disable stack trace creation for an exception but don't have to share it. That can be achieved using a constructor available since JDK 1.7 or via overriding the fillInStackTrace() method to immediately return this.

Regarding the ParallelFlux* tests, if you are able to enhance the suite to parameterize tests to run with either enabled or disabled debug mode, feel free to submit a separate PR for that change.

@chemicL chemicL added the status/invalid We don't feel this issue is valid, or the root cause was found outside of Reactor label Nov 26, 2024
@travispeloton
Copy link

travispeloton commented Nov 26, 2024

@chemicL , thank you for your response. I found out that I can use AspectJ on the FluxOnAssembly.OnAssemblySubscriber.fail(t) method, and if my static exception type is encountered, it simply returns t, and my program is behaving as expected. In that case OnAssemblyException is never created.

What would you say to a System Property that would enable this behavior via configuration? Completely opt-out. Besides losing some debugging information, what other downside is there?

OnAssemblyException does some potentially expensive things every time an exception is handled:

  1. Creates a HashMap and HashSet (inside ObservedAtInformationNode) under all circumstances
  2. Uses synchronized against HashMap multiple times (in my case 3 to 4) in a single operation
  3. As I have explained above, they can grow uncontrollably when exceptions are static

Some possible options on uncontrolled growth,

  • WeakHashMap and similar solution for the Set
  • Limit the number of items that can be added to the Map and Set. Under normal circumstances these should not grow too high should they? This limit could be configurable via a System Property too.

@travispeloton
Copy link

Setting enableSuppression to false will make it so that my static exception does not trigger the memory leak condition.

RuntimeException(String message, Throwable cause,
                               boolean enableSuppression,
                               boolean writableStackTrace)

@chemicL
Copy link
Member

chemicL commented Nov 27, 2024

Unless this becomes more of a burden, I'd prefer not to add a global property to disable the checkpoint and debug agent functionality. As you noted, disabling suppression on your static exception types addresses the concern. We also investigated reactor-netty's static exceptions and they are also disabling suppression. In the Spring Framework codebase there are no static exceptions in use for these reasons and stack capturing is disabled in some of them on the hot path reducing the impact significantly.

@chemicL chemicL closed this Nov 27, 2024
@travispeloton
Copy link

@chemicL what about limiting the number of objects that can be added to the Map/Set?

Is there any real use case where you would expect many items in them? 25? 50? 100?

The problem with the current code is that its extremely difficult for someone impacted by this to figure out what is wrong. My project has been impacted for 3 years and we finally figured it out, only because someone took Java Flight Recorder snapshots in production over the course of a week and luckily stumbled upon the right clues.

@chemicL
Copy link
Member

chemicL commented Nov 28, 2024

It's really difficult to say what number is adequate as a default limitation. There are users who report expectations of thousands of operators in a reactive chain - should we limit their ability to investigate their chains because we decide on some number? My experience is that such arbitrary limitations are also difficult to uncover without knowing what you're looking for. Perhaps a better approach is to explicitly mention static exceptions in our reference documentations? WDYT?

@travispeloton
Copy link

Yep, a mention about setting enableSuppression to false for static exceptions would be good to document.

chemicL added a commit that referenced this pull request Dec 2, 2024
In case of enabled debugging or checkpoints, usage of static exceptions
in user code can cause memory leaks. A warning note has been added to
the debugging section.

Related to #3941
@chemicL
Copy link
Member

chemicL commented Dec 2, 2024

@travispeloton I opened a PR for the documentation note.

chemicL added a commit that referenced this pull request Dec 2, 2024
In case of enabled debugging or checkpoints, usage of static exceptions
in user code can cause memory leaks. A warning note has been added to
the debugging section.

Related to #3941
@chemicL
Copy link
Member

chemicL commented Dec 2, 2024

Please also take note of reactor/reactor-netty#3529

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/invalid We don't feel this issue is valid, or the root cause was found outside of Reactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants