Startup RSS regression introduced by #35246 #35406

johnaohara · 2023-08-17T16:00:52Z

Describe the bug

PR: #35246 has introduced a startup RSS regression of ~14-17% (depending on application and environment) in both JVM and Native modes.

It is possible to see the effects in a simple quickstart (getting-started or config-quickstart)

config-quickstart Startup RSS

Commit	Run Mode	RSS (Mb)
`6f55d65`	JVM	117.256
`3875d03`	JVM	133.572
		13.9%

`6f55d65`	Native	76.744
`3875d03`	Native	89.892
		17.1%

Expected behavior

No response

Actual behavior

No response

How to Reproduce?

Steps to reproduce:

Build Quarkus commit 3875d03
Build config-quickstart
Start config-quickstart and measure RSS as described in guide: https://quarkus.io/guides/performance-measure#how-do-we-measure-memory-usage
Build Quarkus commit 6f55d65
Rebuild config-quickstart
Restart config-quickstart and measure RSS as described in guide: https://quarkus.io/guides/performance-measure#how-do-we-measure-memory-usage

Output of `uname -a` or `ver`

No response

Output of `java -version`

No response

GraalVM version (if different from Java)

No response

Quarkus version or git rev

No response

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

No response

Additional information

Tests were performed in docker with 16 cores, however, the increase is also measurable with much less cores, down to 2 vCPU's

The text was updated successfully, but these errors were encountered:

gsmet · 2023-08-17T16:18:00Z

@johnaohara were you able to get memory dumps of before/after?

johnaohara · 2023-08-17T16:24:24Z

@gsmet not yet, have only reported the change that was detected by Horreum. I will get some more detail now

gsmet · 2023-08-17T16:29:38Z

Given we are moving most of the config to this new infrastructure, it's indeed important if it introduces a regression.

I'm surprised our previous work on this didn't show though as we moved quite some config already.

johnaohara · 2023-08-18T16:38:49Z

I have compared this commit (3875d03) with the previous commit (6f55d65)

Heap dumps show the live set size is approximately the same after startup.

In JVM mode without Xmx set

There are ~13.4MB more unreachable Objects for 3875d03 compared to 6f55d65 (prev commit),

3875d03 - Unreachable Objects: 40.10MB ; No. GC’s: 2
6f55d65 - Unreachable Objects: 20.69MB ; No. GC’s: 1

If you limit heap size (i.e. -Xmx64m) the RSS size is comparable after startup is comparable

Looking at JFR data, total allocation during startup;

3875d03 - 73.887MB
6f55d65 - 56.96MB

In Native mode with -Xmx64m

Number of GC’s changes from 3 -> 6

More objects are allocated during startup, so limited heap slows application startup: for config-quickstart (-Xmx64m) the startup times reported are;

3875d03 - 0.023s
6f55d65 - 0.061s

Heap recovered during startup:

3875d03 - 47.104MB; No. GC’s: 6
6f55d65 - 4.088MB; No. GC’s: 3

Quarkus is now allocating more during startup. If you limit the heap size, the RSS will likely stay fairly constant, however startup times will be affected as there is more GC activity. Conversely, if Xmx is not set, the startup times remain consistent, but there is an increase in RSS

I have attached some allocation profiles with async-profiler in JVM mode. I can look at them in more detail next week

alloc_profile.3875d03.zip
alloc_profile.6f55d65.zip

geoand · 2023-08-21T08:42:28Z

I really really think that we should be looking into what @gsmet and I have proposed in the past - that @ConfigMapping class generation happens at extension build time, not application build time.

gsmet · 2023-08-21T08:59:13Z

From what I can see, Config.readConfig() is looking pretty bad and represents 13% of the allocations.

gsmet · 2023-08-21T09:03:55Z

But it's not the only culprit AFAICS: the clinit of generated.Config doesn't look good either.

Also the Netty/Vert.x startup allocation profile looks a bit different, I'm not sure if we don't have a regression there too.

gsmet · 2023-08-21T09:04:28Z

@radcortez I think we will really need you to have a look at the config issues.

gsmet · 2023-08-21T09:10:26Z

@johnaohara btw, we are still very interested in your insights and what we can do to fix theses issues (and if there are others). I just did a very quick analysis.

radcortez · 2023-08-22T11:57:44Z

I'm out until next week. Please revert the commit if this is blocking the release until I can look into it. Thanks... and sorry.

gsmet · 2023-08-22T12:04:17Z

@radcortez yeah, no worries. I assigned two issues to you so that you can find them easily when you're back. Have fun!

geoand · 2023-08-22T12:40:55Z

3.3 is not affected, so we have time to fix things

franz1981 · 2023-08-25T06:17:00Z

Hi all, I've already spoken about it with @johnaohara on gchat, but let me put here some hints to help troubleshooting this...
As John noted, there are both more allocations and more live objects: but liveness of objects can depends by how good is the GC to decide to tenure (or cleanup, if not reacheable) objects, and the overall additional RSS can be a mixure of more temporary allocations and/or more tenured (real!) ones, because the final live set is bigger...

TLAB profiling is not a good idea with so few samples, and risk to be very inaccurate. When we allocate so few, we are more interested in the full spectrum of data, unbiased possibly, even at the cost to not distinguish between live/temp allocations and tenured ones (maybe still temporary but with an extended lifetime, due to the heap capacity).

What to do then?

For startup (more accurate) measurements I suggest 2 experiments.

First, using EpsilonGC:

enable EpsilonGC via -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -XX:-UseTLAB -Xmx<big enough value> -Xms1G<same as Xmx> -XX:+AlwaysPreTouch: this won't let the GC to get rid of any garbage, but require to set the heap capacity big enough to cover for ALL the heap allocations at startup and, by disabling TLABs, will grant that every single allocation (with its size) will be recorded (see https://krzysztofslusarski.github.io/2022/12/12/async-manual.html#alloc for more info about the TLAB mechanism). Leaving TLABs enabled, due to the small number of samples could bias the allocation events within TLAB toward the last unlucky allocation which cause a new TLAB to be created, pointing to the wrong source of allocation and allocation type
attach the profiler at startup adding to the startup JVM args -agentpath:<path to the ap *.so>/libasyncProfiler.so=start,event=alloc,file=alloc.jfr and convert it via java -cp <ap path to the converter>/converter.jar jfr2flame ~/alloc.jfr --alloc --total alloc.html: the --total option is KEY because will force the flamegraphs to report the "bytes" allocated and not the number of events, helping to spot single big allocations. If knowing the line of code in the source files is important, you can add --lines to the converter arguments, and the flamegraph will report the lines of code where the allocation has happened

The second experiment requires a bit more involvement and is meant to detect which of the allocations are alive after startup, contributing to the overall footprint.
In order to do it:

disable TLABs and use G1GC via -XX:+UseG1GC -XX:-UseTLAB: G1, if the TLABs allocations are disable, will capture ALL the allocations, not just biased few samples, improving accuracy (and data, obviously)
attach the profiler at startup adding to the startup JVM args -agentpath:<path to the ap *.so>/libasyncProfiler.so=start,event=alloc,live,file=alloc_live.jfr: the live option is added, it will be evident why, later
when the application is fully started, run jcmd <pid of quarkus> GC.run
stop the application: this will cause the jfr produced to contains all the still alive allocations; convert it via java -cp <ap path to the converter>/converter.jar jfr2flame ~/alloc_live.jfr --alloc --total alloc_live.html or add --lines to know the src lines for such still alive allocations

The second experiment is very similar to collect an heap dump, but will bring the stacktrace along with the allocations, helping to spot where it has happened. Hope this can help (I'll be on PTO from today eod, and still clearing up my backlog)

VERY IMPORTANT NOTE

An easy way to find, for the allocated types, what has changed (the magnitude), is to let the converter to produce reverse flamegraphs, by adding --reverse to the converter parameters.

franz1981 · 2023-08-25T07:16:04Z

One note related to the live option: https://github.com/async-profiler/async-profiler/blob/dcc3ffd083a64d5a1848e79c1bded141295b6e0a/src/objectSampler.cpp#L45 shows that async profiler retain by default 1024 different waek refs to detect leaks, meaning that if the allocations coming from the startup, alive, exceed that capacity, won't be reported. I am currently not aware of any other mechanism which can collect such leaks (apart from jfr oldObject event) with the stack trace.
Hence, can use jfr oldObject events which could be configured with an higher capacity.

gsmet · 2023-08-25T07:31:56Z

@geoand 3.3 might be affected as we started to convert several key areas to config mapping. It might be related (or not, to be verified) to Matt Raible's report that 3.3 is significantly slower.

gsmet · 2023-08-25T07:37:54Z

@franz1981 given it's a major regression, I think it would be very helpful if you could help to pinpoint the problem we have.

johnaohara · 2023-08-25T07:48:01Z

Using the methodology @franz1981 described in method 1 above;

java -agentpath:/path/to/libasyncProfiler.so=start,event=alloc,file=alloc.jfr -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -XX:-UseTLAB -Xmx512m -Xms512m -XX:+AlwaysPreTouch -jar quarkus-app/quarkus-run.jar

and

java -cp /path/to/converter.jar jfr2flame ./alloc.jfr --alloc --total alloc.html

there is a large increase in allocations coming from io..smallrye.config.ConfigMappings.mapConfiguration()

6f55d65

3875d03

geoand · 2023-08-25T08:08:59Z

@gsmet right.

So according to the profile data, the ConfigMapping is the problem.

This is serious and we need to address it ASAP.

One way as @gsmet and I have mentioned in the past would be to move the class generation to extension build time

radcortez · 2023-08-25T10:32:38Z

At a quick glance, it seems that some of the allocation issues are related to NameIterator, which is the class we use to retrieve the segments of each key. Probably we can improve some of the mapping code around that.

radcortez · 2023-08-25T10:36:31Z

One way as @gsmet and I have mentioned in the past would be to move the class generation to extension build time

I don't think it would be any different, for this particular case. All generation happens at build time, this is the mapping part. There are probably some pieces from the mapping side that could be moved to build time.

geoand · 2023-08-25T10:37:44Z

Cool. I'd be glad to help if you need assistance

radcortez · 2023-08-25T10:40:43Z

@johnaohara can you please attach the allocation files? Thanks!

franz1981 · 2023-08-25T10:48:53Z

@johnaohara if you attach the jfr the people can use it to extract the lines too

gsmet · 2023-08-25T10:52:17Z

@radcortez FYI the async profiler output is available in this comment: #35406 (comment)

franz1981 · 2023-08-25T10:53:26Z

@gsmet it wasn't using the mode I have explained later, so it is slightly less accurate, but I didn't verified how much

radcortez · 2023-08-25T11:04:19Z

@radcortez FYI the async profiler output is available in this comment: #35406 (comment)

Thanks. Sorry, I didn't notice they were attached earlier. I was only looking at the latest screenshots.

radcortez · 2023-08-25T11:19:02Z

Drilling down on the graphs, a lot of the allocations are performed by https://github.com/smallrye/smallrye-config/blob/30be0d1def67783b5bdc977d4aad2ce5b82dc186/implementation/src/main/java/io/smallrye/config/ConfigMappingProvider.java#L1053 method.

This method tried to do some matching between environment variables and the mapped properties. Since we now have more mappings, and if there is an issue with this method, it makes sense that we didn't notice it until now since we didn't have that many before.

Also, many things run twice because of the static init config and the runtime init config, which are created separately. For some time, I wanted to reuse some of the static config stuff to feed in the runtime config and reduce the allocations. I guess we need to do that now.

franz1981 · 2023-08-25T11:23:07Z

Looking it quickly I see many hanging fruits there regardless...like, many of the String and builders allocated are not necessary at all...

geoand · 2023-08-25T11:24:44Z

Right, this is how things are meant to work :)
We actually put effort into optimizing stuff when they start causing a problem.

radcortez · 2023-08-25T11:32:28Z

Correct. I'll start the work right away when I get back. Meanwhile, feel free to revert the original commit to avoid affecting the performance further.

We now have a baseline and we can work and improve on it. Thanks!

franz1981 · 2023-08-25T11:32:29Z

That's a fair point indeed @geoand , or we would end up writing ugly code for every single line of code!

johnaohara · 2023-08-25T11:33:40Z

The zipped flame graphs in #35406 (comment) were allocation samples.

We updated the methodology to capture the allocation sizes and give a more accurate picture on what is causing the extra memory pressure, this methodology produced the screenshots in this #35406 (comment)

Attached are the 2nd allocation size profiles

alloc-sizes-6f55d65.zip

alloc-sizes-3875d03.zip

geoand · 2023-08-25T11:34:10Z

feel free to revert the original commit to avoid affecting the performance further

I think we can live with this in 3.3 (especially as 3.2 is not affected) and have a better 3.4

franz1981 · 2023-08-25T11:34:29Z

@radcortez let me know if the method @johnaohara used which I have explained in the comment works for you...
It should be so reliable that just looking at the number of samples (which is the number of totally allocated bytes) you will immediately spot any improvement, but beware...it come with a startup time cost (because is not meant to measure that), like 14 seconds for the test made by John!

radcortez · 2023-08-25T12:54:41Z

Sure. Thanks!

franz1981 · 2023-08-27T15:20:30Z

@radcortez I've sent something at smallrye/smallrye-config#984
Let me know if that can help; my approach has been a bit brutal but should help regardless other optimizations

radcortez · 2023-08-28T14:59:39Z

To add more information:

The issue is caused by special rules to match environment variables and regular properties. The matching is applied to all available environment variables with all properties available in mappings. We hardly noticed this before, because we only had a few extensions migrated, and with the increase in mappings, the issue became more visible.

@franz1981 PR in smallrye/smallrye-config#984 addresses the allocation issues, which should help when we need to match stuff, but adding a filter to the environment variables to consider drastically reduces the number of matching we need to perform (and allocations).

Currently, I'm seeing the following numbers in my box:

Commit	Run Mode	RSS (Mb)
`6f55d65`	JVM	163,87
`3875d03`	JVM	195,02
with PR	JVM	163,94

`6f55d65`	Native	58,52
`3875d03`	Native	69,11
with PR	Native	58,75

The PR seems to return the number to the previously expected RSS. Maybe with a very slight increase, but I'll keep looking.

johnaohara added the kind/bug Something isn't working label Aug 17, 2023

quarkus-bot bot added the triage/needs-triage label Aug 17, 2023

gsmet added this to the 3.4 - main milestone Aug 17, 2023

geoand added area/config and removed triage/needs-triage labels Aug 21, 2023

gsmet assigned radcortez Aug 22, 2023

franz1981 mentioned this issue Aug 27, 2023

Reduce String/StringBuilder allocations smallrye/smallrye-config#984

Merged

franz1981 mentioned this issue Aug 30, 2023

3.2.4.Final is slower compared to 2.15.1.Final #35392

Closed

radcortez mentioned this issue Aug 30, 2023

Update SmallRye Config to 3.3.4 #35643

Merged

geoand closed this as completed in #35643 Aug 31, 2023

gsmet modified the milestones: 3.4 - main, 3.3.2 Sep 4, 2023

This was referenced Sep 11, 2023

Reduce to 0 the required StringBuilder/String allocations on PropertiesUtil::isPropertyInRoot #35853

Closed

Reduce allocations while loading classes #35875

Open

franz1981 mentioned this issue Aug 27, 2024

RSS usage increase between Quarkus 3.11 and 3.13 in MP application in native mode #42506

Closed

Startup RSS regression introduced by #35246 #35406

Startup RSS regression introduced by #35246 #35406

Comments

johnaohara commented Aug 17, 2023 • edited Loading

Describe the bug

Expected behavior

Actual behavior

How to Reproduce?

Output of uname -a or ver

Output of java -version

GraalVM version (if different from Java)

Quarkus version or git rev

Build tool (ie. output of mvnw --version or gradlew --version)

Additional information

gsmet commented Aug 17, 2023

johnaohara commented Aug 17, 2023

gsmet commented Aug 17, 2023

johnaohara commented Aug 18, 2023

geoand commented Aug 21, 2023

gsmet commented Aug 21, 2023

gsmet commented Aug 21, 2023

gsmet commented Aug 21, 2023

gsmet commented Aug 21, 2023

radcortez commented Aug 22, 2023

gsmet commented Aug 22, 2023

geoand commented Aug 22, 2023

franz1981 commented Aug 25, 2023 • edited Loading

franz1981 commented Aug 25, 2023

gsmet commented Aug 25, 2023

gsmet commented Aug 25, 2023

johnaohara commented Aug 25, 2023

geoand commented Aug 25, 2023

radcortez commented Aug 25, 2023

radcortez commented Aug 25, 2023

geoand commented Aug 25, 2023

radcortez commented Aug 25, 2023

franz1981 commented Aug 25, 2023

gsmet commented Aug 25, 2023

franz1981 commented Aug 25, 2023 • edited Loading

radcortez commented Aug 25, 2023

radcortez commented Aug 25, 2023

franz1981 commented Aug 25, 2023

geoand commented Aug 25, 2023

radcortez commented Aug 25, 2023

franz1981 commented Aug 25, 2023

johnaohara commented Aug 25, 2023

geoand commented Aug 25, 2023

franz1981 commented Aug 25, 2023

radcortez commented Aug 25, 2023

franz1981 commented Aug 27, 2023

radcortez commented Aug 28, 2023 • edited Loading

johnaohara commented Aug 17, 2023 •

edited

Loading

Output of `uname -a` or `ver`

Output of `java -version`

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

franz1981 commented Aug 25, 2023 •

edited

Loading

franz1981 commented Aug 25, 2023 •

edited

Loading

radcortez commented Aug 28, 2023 •

edited

Loading