Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Startup RSS regression introduced by #35246 #35406

Closed
johnaohara opened this issue Aug 17, 2023 · 37 comments · Fixed by #35643
Closed

Startup RSS regression introduced by #35246 #35406

johnaohara opened this issue Aug 17, 2023 · 37 comments · Fixed by #35643
Assignees
Labels
area/config kind/bug Something isn't working
Milestone

Comments

@johnaohara
Copy link
Member

johnaohara commented Aug 17, 2023

Describe the bug

PR: #35246 has introduced a startup RSS regression of ~14-17% (depending on application and environment) in both JVM and Native modes.

It is possible to see the effects in a simple quickstart (getting-started or config-quickstart)

config-quickstart Startup RSS

Commit Run Mode RSS (Mb)
6f55d65 JVM 117.256
3875d03 JVM 133.572
13.9%
6f55d65 Native 76.744
3875d03 Native 89.892
17.1%

Expected behavior

No response

Actual behavior

No response

How to Reproduce?

Steps to reproduce:

  1. Build Quarkus commit 3875d03
  2. Build config-quickstart
  3. Start config-quickstart and measure RSS as described in guide: https://quarkus.io/guides/performance-measure#how-do-we-measure-memory-usage
  4. Build Quarkus commit 6f55d65
  5. Rebuild config-quickstart
  6. Restart config-quickstart and measure RSS as described in guide: https://quarkus.io/guides/performance-measure#how-do-we-measure-memory-usage

Output of uname -a or ver

No response

Output of java -version

No response

GraalVM version (if different from Java)

No response

Quarkus version or git rev

No response

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

Tests were performed in docker with 16 cores, however, the increase is also measurable with much less cores, down to 2 vCPU's

@johnaohara johnaohara added the kind/bug Something isn't working label Aug 17, 2023
@gsmet
Copy link
Member

gsmet commented Aug 17, 2023

@johnaohara were you able to get memory dumps of before/after?

@gsmet gsmet added this to the 3.4 - main milestone Aug 17, 2023
@johnaohara
Copy link
Member Author

@gsmet not yet, have only reported the change that was detected by Horreum. I will get some more detail now

@gsmet
Copy link
Member

gsmet commented Aug 17, 2023

Given we are moving most of the config to this new infrastructure, it's indeed important if it introduces a regression.

I'm surprised our previous work on this didn't show though as we moved quite some config already.

@johnaohara
Copy link
Member Author

I have compared this commit (3875d03) with the previous commit (6f55d65)

Heap dumps show the live set size is approximately the same after startup.

In JVM mode without Xmx set

There are ~13.4MB more unreachable Objects for 3875d03 compared to 6f55d65 (prev commit),

3875d03 - Unreachable Objects: 40.10MB ; No. GC’s: 2
6f55d65 - Unreachable Objects: 20.69MB ; No. GC’s: 1

If you limit heap size (i.e. -Xmx64m) the RSS size is comparable after startup is comparable

Looking at JFR data, total allocation during startup;

3875d03 - 73.887MB
6f55d65 - 56.96MB

In Native mode with -Xmx64m

Number of GC’s changes from 3 -> 6

More objects are allocated during startup, so limited heap slows application startup: for config-quickstart (-Xmx64m) the startup times reported are;

3875d03 - 0.023s
6f55d65 - 0.061s

Heap recovered during startup:

3875d03 - 47.104MB; No. GC’s: 6
6f55d65 - 4.088MB; No. GC’s: 3

Quarkus is now allocating more during startup. If you limit the heap size, the RSS will likely stay fairly constant, however startup times will be affected as there is more GC activity. Conversely, if Xmx is not set, the startup times remain consistent, but there is an increase in RSS

I have attached some allocation profiles with async-profiler in JVM mode. I can look at them in more detail next week

alloc_profile.3875d03.zip
alloc_profile.6f55d65.zip

@geoand
Copy link
Contributor

geoand commented Aug 21, 2023

I really really think that we should be looking into what @gsmet and I have proposed in the past - that @ConfigMapping class generation happens at extension build time, not application build time.

@gsmet
Copy link
Member

gsmet commented Aug 21, 2023

From what I can see, Config.readConfig() is looking pretty bad and represents 13% of the allocations.

@gsmet
Copy link
Member

gsmet commented Aug 21, 2023

But it's not the only culprit AFAICS: the clinit of generated.Config doesn't look good either.

Also the Netty/Vert.x startup allocation profile looks a bit different, I'm not sure if we don't have a regression there too.

@gsmet
Copy link
Member

gsmet commented Aug 21, 2023

@radcortez I think we will really need you to have a look at the config issues.

@gsmet
Copy link
Member

gsmet commented Aug 21, 2023

@johnaohara btw, we are still very interested in your insights and what we can do to fix theses issues (and if there are others). I just did a very quick analysis.

@radcortez
Copy link
Member

I'm out until next week. Please revert the commit if this is blocking the release until I can look into it. Thanks... and sorry.

@gsmet
Copy link
Member

gsmet commented Aug 22, 2023

@radcortez yeah, no worries. I assigned two issues to you so that you can find them easily when you're back. Have fun!

@geoand
Copy link
Contributor

geoand commented Aug 22, 2023

3.3 is not affected, so we have time to fix things

@franz1981
Copy link
Contributor

franz1981 commented Aug 25, 2023

Hi all, I've already spoken about it with @johnaohara on gchat, but let me put here some hints to help troubleshooting this...
As John noted, there are both more allocations and more live objects: but liveness of objects can depends by how good is the GC to decide to tenure (or cleanup, if not reacheable) objects, and the overall additional RSS can be a mixure of more temporary allocations and/or more tenured (real!) ones, because the final live set is bigger...

TLAB profiling is not a good idea with so few samples, and risk to be very inaccurate. When we allocate so few, we are more interested in the full spectrum of data, unbiased possibly, even at the cost to not distinguish between live/temp allocations and tenured ones (maybe still temporary but with an extended lifetime, due to the heap capacity).

What to do then?

For startup (more accurate) measurements I suggest 2 experiments.

First, using EpsilonGC:

  • enable EpsilonGC via -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -XX:-UseTLAB -Xmx<big enough value> -Xms1G<same as Xmx> -XX:+AlwaysPreTouch: this won't let the GC to get rid of any garbage, but require to set the heap capacity big enough to cover for ALL the heap allocations at startup and, by disabling TLABs, will grant that every single allocation (with its size) will be recorded (see https://krzysztofslusarski.github.io/2022/12/12/async-manual.html#alloc for more info about the TLAB mechanism). Leaving TLABs enabled, due to the small number of samples could bias the allocation events within TLAB toward the last unlucky allocation which cause a new TLAB to be created, pointing to the wrong source of allocation and allocation type
  • attach the profiler at startup adding to the startup JVM args -agentpath:<path to the ap *.so>/libasyncProfiler.so=start,event=alloc,file=alloc.jfr and convert it via java -cp <ap path to the converter>/converter.jar jfr2flame ~/alloc.jfr --alloc --total alloc.html: the --total option is KEY because will force the flamegraphs to report the "bytes" allocated and not the number of events, helping to spot single big allocations. If knowing the line of code in the source files is important, you can add --lines to the converter arguments, and the flamegraph will report the lines of code where the allocation has happened

The second experiment requires a bit more involvement and is meant to detect which of the allocations are alive after startup, contributing to the overall footprint.
In order to do it:

  • disable TLABs and use G1GC via -XX:+UseG1GC -XX:-UseTLAB: G1, if the TLABs allocations are disable, will capture ALL the allocations, not just biased few samples, improving accuracy (and data, obviously)
  • attach the profiler at startup adding to the startup JVM args -agentpath:<path to the ap *.so>/libasyncProfiler.so=start,event=alloc,live,file=alloc_live.jfr: the live option is added, it will be evident why, later
  • when the application is fully started, run jcmd <pid of quarkus> GC.run
  • stop the application: this will cause the jfr produced to contains all the still alive allocations; convert it via java -cp <ap path to the converter>/converter.jar jfr2flame ~/alloc_live.jfr --alloc --total alloc_live.html or add --lines to know the src lines for such still alive allocations

The second experiment is very similar to collect an heap dump, but will bring the stacktrace along with the allocations, helping to spot where it has happened. Hope this can help (I'll be on PTO from today eod, and still clearing up my backlog)

VERY IMPORTANT NOTE

An easy way to find, for the allocated types, what has changed (the magnitude), is to let the converter to produce reverse flamegraphs, by adding --reverse to the converter parameters.

@franz1981
Copy link
Contributor

One note related to the live option: https://github.com/async-profiler/async-profiler/blob/dcc3ffd083a64d5a1848e79c1bded141295b6e0a/src/objectSampler.cpp#L45 shows that async profiler retain by default 1024 different waek refs to detect leaks, meaning that if the allocations coming from the startup, alive, exceed that capacity, won't be reported. I am currently not aware of any other mechanism which can collect such leaks (apart from jfr oldObject event) with the stack trace.
Hence, can use jfr oldObject events which could be configured with an higher capacity.

@gsmet
Copy link
Member

gsmet commented Aug 25, 2023

@geoand 3.3 might be affected as we started to convert several key areas to config mapping. It might be related (or not, to be verified) to Matt Raible's report that 3.3 is significantly slower.

@gsmet
Copy link
Member

gsmet commented Aug 25, 2023

@franz1981 given it's a major regression, I think it would be very helpful if you could help to pinpoint the problem we have.

@johnaohara
Copy link
Member Author

Using the methodology @franz1981 described in method 1 above;

java -agentpath:/path/to/libasyncProfiler.so=start,event=alloc,file=alloc.jfr -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -XX:-UseTLAB -Xmx512m -Xms512m -XX:+AlwaysPreTouch -jar quarkus-app/quarkus-run.jar

and

java -cp /path/to/converter.jar jfr2flame ./alloc.jfr --alloc --total alloc.html

there is a large increase in allocations coming from io..smallrye.config.ConfigMappings.mapConfiguration()

6f55d65

Screenshot from 2023-08-25 08-40-01

3875d03

Screenshot from 2023-08-25 08-43-52

@geoand
Copy link
Contributor

geoand commented Aug 25, 2023

@gsmet right.

So according to the profile data, the ConfigMapping is the problem.

This is serious and we need to address it ASAP.

One way as @gsmet and I have mentioned in the past would be to move the class generation to extension build time

@radcortez
Copy link
Member

At a quick glance, it seems that some of the allocation issues are related to NameIterator, which is the class we use to retrieve the segments of each key. Probably we can improve some of the mapping code around that.

@radcortez
Copy link
Member

One way as @gsmet and I have mentioned in the past would be to move the class generation to extension build time

I don't think it would be any different, for this particular case. All generation happens at build time, this is the mapping part. There are probably some pieces from the mapping side that could be moved to build time.

@geoand
Copy link
Contributor

geoand commented Aug 25, 2023

Cool. I'd be glad to help if you need assistance

@radcortez
Copy link
Member

@johnaohara can you please attach the allocation files? Thanks!

@franz1981
Copy link
Contributor

@johnaohara if you attach the jfr the people can use it to extract the lines too

@gsmet
Copy link
Member

gsmet commented Aug 25, 2023

@radcortez FYI the async profiler output is available in this comment: #35406 (comment)

@franz1981
Copy link
Contributor

franz1981 commented Aug 25, 2023

@gsmet it wasn't using the mode I have explained later, so it is slightly less accurate, but I didn't verified how much

@radcortez
Copy link
Member

@radcortez FYI the async profiler output is available in this comment: #35406 (comment)

Thanks. Sorry, I didn't notice they were attached earlier. I was only looking at the latest screenshots.

@radcortez
Copy link
Member

Drilling down on the graphs, a lot of the allocations are performed by https://github.com/smallrye/smallrye-config/blob/30be0d1def67783b5bdc977d4aad2ce5b82dc186/implementation/src/main/java/io/smallrye/config/ConfigMappingProvider.java#L1053 method.

This method tried to do some matching between environment variables and the mapped properties. Since we now have more mappings, and if there is an issue with this method, it makes sense that we didn't notice it until now since we didn't have that many before.

Also, many things run twice because of the static init config and the runtime init config, which are created separately. For some time, I wanted to reuse some of the static config stuff to feed in the runtime config and reduce the allocations. I guess we need to do that now.

@franz1981
Copy link
Contributor

Looking it quickly I see many hanging fruits there regardless...like, many of the String and builders allocated are not necessary at all...

@geoand
Copy link
Contributor

geoand commented Aug 25, 2023

Right, this is how things are meant to work :)
We actually put effort into optimizing stuff when they start causing a problem.

@radcortez
Copy link
Member

Correct. I'll start the work right away when I get back. Meanwhile, feel free to revert the original commit to avoid affecting the performance further.

We now have a baseline and we can work and improve on it. Thanks!

@franz1981
Copy link
Contributor

That's a fair point indeed @geoand , or we would end up writing ugly code for every single line of code!

@johnaohara
Copy link
Member Author

The zipped flame graphs in #35406 (comment) were allocation samples.

We updated the methodology to capture the allocation sizes and give a more accurate picture on what is causing the extra memory pressure, this methodology produced the screenshots in this #35406 (comment)

Attached are the 2nd allocation size profiles

alloc-sizes-6f55d65.zip

alloc-sizes-3875d03.zip

@geoand
Copy link
Contributor

geoand commented Aug 25, 2023

feel free to revert the original commit to avoid affecting the performance further

I think we can live with this in 3.3 (especially as 3.2 is not affected) and have a better 3.4

@franz1981
Copy link
Contributor

@radcortez let me know if the method @johnaohara used which I have explained in the comment works for you...
It should be so reliable that just looking at the number of samples (which is the number of totally allocated bytes) you will immediately spot any improvement, but beware...it come with a startup time cost (because is not meant to measure that), like 14 seconds for the test made by John!

@radcortez
Copy link
Member

Sure. Thanks!

@franz1981
Copy link
Contributor

@radcortez I've sent something at smallrye/smallrye-config#984
Let me know if that can help; my approach has been a bit brutal but should help regardless other optimizations

@radcortez
Copy link
Member

radcortez commented Aug 28, 2023

To add more information:

The issue is caused by special rules to match environment variables and regular properties. The matching is applied to all available environment variables with all properties available in mappings. We hardly noticed this before, because we only had a few extensions migrated, and with the increase in mappings, the issue became more visible.

@franz1981 PR in smallrye/smallrye-config#984 addresses the allocation issues, which should help when we need to match stuff, but adding a filter to the environment variables to consider drastically reduces the number of matching we need to perform (and allocations).

Currently, I'm seeing the following numbers in my box:

Commit Run Mode RSS (Mb)
6f55d65 JVM 163,87
3875d03 JVM 195,02
with PR JVM 163,94
6f55d65 Native 58,52
3875d03 Native 69,11
with PR Native 58,75

The PR seems to return the number to the previously expected RSS. Maybe with a very slight increase, but I'll keep looking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/config kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants