Memory leak #185

benjwadams · 2024-08-02T17:06:26Z

Describe the bug
Memory leak under ERDDAP

To Reproduce
Steps to reproduce the behavior:
Uncertain, but this behavior occurs on some production ERDDAP servers with many aggregations.
Expected behavior
Reasonable memory usage that doesn't grow without
Screenshots
If applicable, add screenshots to help explain your problem.

Server

What ERDDAP™ server? If you're the admin, please provide what version of ERDDAP™, Tomcat, and Java you are using.
ERDDAP 2.23, based off Axiom's Docker-ERDDAP image
Desktop (please complete the following information):
OS: Linux

Additional context
Memory grows without bound in certain ERDDAP configurations, eventually exhausting memory.

ChrisJohnNOAA · 2024-08-02T17:38:38Z

One thing to note, Java generally uses all available memory before initiating garbage collection. This behavior can sometimes be interpreted as a leak, though it is just how the JVM behaves. Does memory usage reach a peak and then drop? If your server has no queries for a period does memory usage drop to a low baseline? Does the server crash with an out of memory error?

Can you provide additional information about your setup? For example what datasets (or at least dataset types) are you using? Are there queries or feature usage that seem to trigger your issues? Does it seem to happen if the server is under high load? Do you have a lot of datasets that refresh and are you using the new sharedWatchService?

Can you upgrade the software? ERDDAP's latest version is 2.24. I'd also recommend using the most recent LTS Java version (21), and ensuring the Tomcat version is up to date.

srstsavage · 2024-08-02T18:17:20Z

Currently docker-erddap doesn't have an image for ERDDAP 2.24. I can work on publishing that with Java 21 and updated Tomcat to facilitate testing here. Will ping when that's ready.

rmendels · 2024-08-04T18:53:01Z

@srstsavage In looking up how to set up your Docker it shows:

docker run
-p 8080:8080
--env ERDDAP_MIN_MEMORY=4G --env ERDDAP_MAX_MEMORY=8G
...
axiom/docker-erddap

ERDDAP_MIN_MEMORY, ERDDAP_MAX_MEMORY are not default settings in Tomcat, so it must be something you do internally in the docker image. What do these map do inside the image, and generally what does the setenv.sh look like, or any other scripts controlling the startup of Tomcat and Java (things like which garbage collector is being used, any restrictions on metaspace use etc etc). It may just be a coincidence, but most of the reports of this have been in sites running the Docker image. We have had a related problem but it has to do with swapping and running out of swap space, mainly because we need more memory - the Java memory model introduced in Java 16 changes a lot of how things are to be set. One of the key differences is that a lot of things that use to be in heap are not put there any more, so that the total java memory usage is much larger than max heap size (on our heavy used system heap is set at 10GB, rarely is more than 5GB-7GB used, but total Java memory usage stays around 21GB - so just monitoring heap size is't sufficient. ) Basically you don't want any swapping (a long story).

Also if memory serves in this particular instance -Xms, -Xmx are set to different values. For a server type setting it is much preferred that these be the same values, you can duckduckgo this and see discussions of this.

srstsavage · 2024-08-04T21:53:33Z

Docker image for ERDDAP 2.24 is available at axiom/docker-erddap:2.24-jdk21-openjdk

Github - tag, release

@rmendels At your suggestion I updated the README to favor ERDDAP_MEMORY which sets Xms and Xmx to the same value (#85). There may be some use cases where users would want to set those differently, but I agree that most users will probably want them set to the same value to avoid heap adjustments.

setenv.sh is available in the GitHub repository.

Do we have any concept of how many users are using Docker vs manually setting up a Tomcat instance? I have no clue, but I've only heard of Docker deployments for many years now (other than the CoastWatch mothership of course).

rmendels · 2024-08-04T22:19:57Z

@srstsavage @benjwadams Great with the image! Thanks. This might require some work, but would help us if once you get the particular ERDDAP instance mentioned above, if you could monitor the usage of the following:

heap space use
metaspace use
total java memory use (and the total memory available)
swap space use (and total swap space available)
number of threads running under java (I find the number given by say visualvm is not as good as that given by btop)

A detailed time series isn't as important as likely maximum values of each and some idea of how much they fluctuate. Particularly for total memory use, you need to have the ERDDAP completely loaded and running for a bit to get a feel for the total java memory and number of threads

One more thing that would be useful is to get an idea on what type of requests are being made during peak usage. So for instance is the heaviest use when NDBC accesses the data what are the URLs given. These type of things may help as narrow down if there is leak where it may be, or if there are Java settings that can alleviate the problems.

benjwadams · 2024-08-05T17:26:08Z

cc @pavaniankam92

benjwadams · 2024-10-18T17:14:33Z

We're still working on instrumentation of this issue, but I did put an .hprof dump into the Eclipse Memory Analyzer Tool from an ERDDAP that had been running for four days or so:

Memory usage currently reported by top is well in excess of this:

4062619 rocky 20 0 31.0g 23.8g 0 S 1.0 38.3 9472:41 java

ChrisJohnNOAA · 2024-10-21T16:38:11Z

We're still working on instrumentation of this issue, but I did put an .hprof dump into the Eclipse Memory Analyzer Tool from an ERDDAP that had been running for four days or so:

Memory usage currently reported by top is well in excess of this:

4062619 rocky 20 0 31.0g 23.8g 0 S 1.0 38.3 9472:41 java

Erddap.java holds a number of concurrent hashmaps which have data loaded into them shortly after the server starts up (and will hold it as long as the server is running- some profilers might flag this behavior as a potential leak because it will be very old). There are a number of ConcurrentHashMaps in Erddap.java, knowing which one(s) are large for you and if they are growing (particularly, growing after the initial load datasets) would be extremely helpful since I have not reproduced the problem you are having.

pavaniankam92 · 2024-10-30T15:50:09Z

Configured JVM Metrics of ERDDAP application

Please find the metrices for the past 12 hours, attached screenshots here for your reference
We will monitor the JVM metrices and see what parameters exceeds during the peak load

Let me know if you any questions on the same.

benjwadams · 2024-11-07T18:08:43Z

Having some serious ERDDAP issues today where I/O is getting bogged down and server load average is extremely high.

Got an interesting stack trace in the logs along with log message that explicitly mentions possible memory leak conditions, however:

07-Nov-2024 17:53:09.247 WARNING [Thread-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [erddap] is still processing a request that has yet to finish. This is very likely to create a memory leak. You can control the time allowed for requests to finish by using the unloadDelay attribute of the standard Context implementation. Stack trace of request processing thread:[
 [email protected]/java.io.RandomAccessFile.readBytes(Native Method)
 [email protected]/java.io.RandomAccessFile.read(RandomAccessFile.java:405)
 ucar.unidata.io.RandomAccessFile.read_(RandomAccessFile.java:738)
 ucar.unidata.io.RandomAccessFile.readBuffer(RandomAccessFile.java:503)
 ucar.unidata.io.RandomAccessFile.seek(RandomAccessFile.java:491)
 ucar.nc2.iosp.hdf5.BTree2$LeafNode.<init>(BTree2.java:181)
 ucar.nc2.iosp.hdf5.BTree2$InternalNode.recurse(BTree2.java:168)
 ucar.nc2.iosp.hdf5.BTree2.<init>(BTree2.java:88)
 ucar.nc2.internal.iosp.hdf5.H5objects.readGroupNew(H5objects.java:2112)
 ucar.nc2.internal.iosp.hdf5.H5objects.access$700(H5objects.java:27)
 ucar.nc2.internal.iosp.hdf5.H5objects$H5Group.<init>(H5objects.java:197)
 ucar.nc2.internal.iosp.hdf5.H5objects$H5Group.<init>(H5objects.java:155)
 ucar.nc2.internal.iosp.hdf5.H5objects.readRootSymbolTable(H5objects.java:59)
 ucar.nc2.internal.iosp.hdf5.H5headerNew.readSuperBlock1(H5headerNew.java:357)
 ucar.nc2.internal.iosp.hdf5.H5headerNew.read(H5headerNew.java:253)
 ucar.nc2.internal.iosp.hdf5.H5iospNew.build(H5iospNew.java:122)
 ucar.nc2.NetcdfFiles.build(NetcdfFiles.java:811)
 ucar.nc2.NetcdfFiles.open(NetcdfFiles.java:750)
 ucar.nc2.NetcdfFiles.open(NetcdfFiles.java:276)
 ucar.nc2.NetcdfFiles.open(NetcdfFiles.java:243)
 ucar.nc2.NetcdfFiles.open(NetcdfFiles.java:216)
 gov.noaa.pfel.coastwatch.griddata.NcHelper.openFile(NcHelper.java:849)
 gov.noaa.pfel.coastwatch.pointdata.Table.readNDNc(Table.java:6866)
 gov.noaa.pfel.erddap.dataset.EDDTableFromNcFiles.lowGetSourceDataFromFile(EDDTableFromNcFiles.java:212)
 gov.noaa.pfel.erddap.dataset.EDDTableFromFiles.getSourceDataFromFile(EDDTableFromFiles.java:3271)
 gov.noaa.pfel.erddap.dataset.EDDTableFromFilesCallable.call(EDDTableFromFilesCallable.java:105)
 gov.noaa.pfel.erddap.dataset.EDDTableFromFilesCallable.call(EDDTableFromFilesCallable.java:31)
 gov.noaa.pfel.erddap.util.ThreadedWorkManager.addTask(ThreadedWorkManager.java:35)
 gov.noaa.pfel.erddap.dataset.EDDTableFromFiles.getDataForDapQuery(EDDTableFromFiles.java:3925)
 gov.noaa.pfel.erddap.dataset.EDDTable.getTwawmForDapQuery(EDDTable.java:1744)
 gov.noaa.pfel.erddap.dataset.EDDTable.respondToDapQuery(EDDTable.java:3271)
 gov.noaa.pfel.erddap.Erddap.doDap(Erddap.java:4750)
 gov.noaa.pfel.erddap.Erddap.doGet(Erddap.java:629)
 jakarta.servlet.http.HttpServlet.service(HttpServlet.java:564)
 jakarta.servlet.http.HttpServlet.service(HttpServlet.java:658)
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:205)
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
 org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:51)
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:174)
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
 org.apache.catalina.filters.CorsFilter.handleNonCORS(CorsFilter.java:331)
 org.apache.catalina.filters.CorsFilter.doFilter(CorsFilter.java:158)
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:174)
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
 org.apache.catalina.filters.HttpHeaderSecurityFilter.doFilter(HttpHeaderSecurityFilter.java:129)
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:174)
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:167)
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:90)
 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:482)
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:115)
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:93)
 org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:735)
 org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:673)
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:340)
 org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:391)
 org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:63)
 org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:896)
 org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1744)
 org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52)
 org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
 org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
 org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
 [email protected]/java.lang.Thread.run(Thread.java:840)]

ChrisJohnNOAA · 2024-11-07T18:51:26Z

@benjwadams Based on that stack trace it looks like there may be expensive queries against netcdf based dataset(s). Can you check the ERDDAP logs for what the expensive queries might be? If there's a specific dataset that is causing a lot of expensive reads, I could try to look into what might be slow, if you share the dataset (and some sample files).

benjwadams · 2024-11-07T19:10:02Z

I've seen a couple, but I'm not sure if it's due to load averages being high in the first place.
I think it may be due to a sort taking place on some queries -- how does ERDDAP index data for things such as sorting operations?

benjwadams · 2024-11-07T19:17:09Z

Highly abridged output:

grep '#2062' log.txt{.previous,}
log.txt.previous:{{{{#2062 2024-11-07T18:57:39+00:00 [https] (unknownIPAddress) GET /erddap/tabledap/ce_312-20200705T1232.json?longitude,latitude,qartod_location_test_flag,time&orderBy(%22time%22)
log.txt:#2062 Error: ClientAbortException
log.txt:#2062 FAILURE. TIME=342080ms  (>10s!)
log.txt:*** sendError for request #2062 caught ERROR=ClientAbortException

Pertinent datasets.xml entry excerpt:

                        <!-- defaultDataQuery uses datasetID -->
                        <!--
                        <defaultDataQuery>&amp;trajectory=ce_312-20200705T1232</defaultDataQuery>
                        <defaultGraphQuery>longitude,latitude,time&amp;.draw=markers&amp;.marker=2|5&.color=0xFFFFFF&.colorBar=|||||</defaultGraphQuery>
                        -->
                        <reloadEveryNMinutes>720</reloadEveryNMinutes>
                        <updateEveryNMillis>-1</updateEveryNMillis>
                        <!-- use datasetID as the directory name -->
                        <fileDir>/data/data/priv_erddap/OOI-CE/ce_312-20200705T1232</fileDir>
                        <recursive>false</recursive>
                        <fileNameRegex>.*\.nc</fileNameRegex>
                        <metadataFrom>last</metadataFrom>
                        <sortedColumnSourceName>time</sortedColumnSourceName>
                        <sortFilesBySourceNames>trajectory time</sortFilesBySourceNames>
                        <fileTableInMemory>false</fileTableInMemory>
                        <accessibleViaFiles>true</accessibleViaFiles>

rmendels · 2024-11-07T19:54:16Z

Any chance you are getting hit by a scan. One of the one's we get can give 20-40 requests per second for most of the day. Depending on your settings, that can chew up emory like crazy (outside of the heap), leading to swapping, at which point things do bog down. Because the way Java memory works when we see scans going on, we have heap size at 14GB but we often see 32GB allocated to java when using a command like pidstat and heap space is not totally utilized). The moral seems to be that a site with a large amount of data and heavy usage needs a lot of RAM.

-Roy

On Nov 7, 2024, at 11:17 AM, Benjamin Adams ***@***.***> wrote: Highly abridged output: grep '#2062' log.txt{.previous,} log.txt.previous:{{{{#2062 2024-11-07T18:57:39+00:00 [https] (unknownIPAddress) GET /erddap/tabledap/ce_312-20200705T1232.json?longitude,latitude,qartod_location_test_flag,time&orderBy(%22time%22) log.txt:#2062 Error: ClientAbortException log.txt:#2062 FAILURE. TIME=342080ms (>10s!) log.txt:*** sendError for request #2062 caught ERROR=ClientAbortException Pertinent datasets.xml entry excerpt:   <reloadEveryNMinutes>720</reloadEveryNMinutes> <updateEveryNMillis>-1</updateEveryNMillis>  <fileDir>/data/data/priv_erddap/OOI-CE/ce_312-20200705T1232</fileDir> <recursive>false</recursive> <fileNameRegex>.*\.nc</fileNameRegex> <metadataFrom>last</metadataFrom> <sortedColumnSourceName>time</sortedColumnSourceName> <sortFilesBySourceNames>trajectory time</sortFilesBySourceNames> <fileTableInMemory>false</fileTableInMemory> <accessibleViaFiles>true</accessibleViaFiles> — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

********************** "The contents of this message do not reflect any position of the U.S. Government or NOAA." ********************** Roy Mendelssohn Supervisory Operations Research Analyst NOAA/NMFS Environmental Research Division Southwest Fisheries Science Center ***Note new street address*** 110 McAllister Way Santa Cruz, CA 95060 Phone: (831)-420-3666 Fax: (831) 420-3980 e-mail: ***@***.*** www: https://www.pfeg.noaa.gov/ "Old age and treachery will overcome youth and skill." "From those who have been given much, much will be expected" "the arc of the moral universe is long, but it bends toward justice" -MLK Jr.

benjwadams · 2024-11-07T20:09:11Z

Haven't been seeing extreme requests from external sources.

I think it might be the subset variables. I tried without qartod_location_test_flag and it ran very quickly. Without, it took about 7 1/2 minutes to load.

This might be something left field of the memory problems in this GitHub issue, but it does seem to eat up quite a but of I/O. Perhaps it's worth opening a separate issue for this particular behavior.

Of course, spawning a bunch of ERDDAP threads will chew up memory and CPU too while it's waiting for results.

ChrisJohnNOAA · 2024-11-08T14:01:00Z

@benjwadams Please do make a separate issue about the subset variables. It would be helpful if you could provide a full dataset definition, data files, and example queries on that dataset that demonstrate the issue.

srstsavage mentioned this issue Aug 4, 2024

Update README to favor ERDDAP_MEMORY usage (same Xms/Xms) axiom-data-science/docker-erddap#84

Closed

srstsavage mentioned this issue Aug 26, 2024

Docker deamon for ERDDAP hosted on AWS keeps crashing ioos/erddap-gold-standard#69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak #185

Memory leak #185

benjwadams commented Aug 2, 2024

ChrisJohnNOAA commented Aug 2, 2024

srstsavage commented Aug 2, 2024

rmendels commented Aug 4, 2024

srstsavage commented Aug 4, 2024 •

edited

Loading

rmendels commented Aug 4, 2024

benjwadams commented Aug 5, 2024

benjwadams commented Oct 18, 2024

ChrisJohnNOAA commented Oct 21, 2024

pavaniankam92 commented Oct 30, 2024

benjwadams commented Nov 7, 2024 •

edited

Loading

ChrisJohnNOAA commented Nov 7, 2024

benjwadams commented Nov 7, 2024

benjwadams commented Nov 7, 2024

rmendels commented Nov 7, 2024 via email

benjwadams commented Nov 7, 2024

ChrisJohnNOAA commented Nov 8, 2024

Memory leak #185

Memory leak #185

Comments

benjwadams commented Aug 2, 2024

ChrisJohnNOAA commented Aug 2, 2024

srstsavage commented Aug 2, 2024

rmendels commented Aug 4, 2024

srstsavage commented Aug 4, 2024 • edited Loading

rmendels commented Aug 4, 2024

benjwadams commented Aug 5, 2024

benjwadams commented Oct 18, 2024

ChrisJohnNOAA commented Oct 21, 2024

pavaniankam92 commented Oct 30, 2024

benjwadams commented Nov 7, 2024 • edited Loading

ChrisJohnNOAA commented Nov 7, 2024

benjwadams commented Nov 7, 2024

benjwadams commented Nov 7, 2024

rmendels commented Nov 7, 2024 via email

benjwadams commented Nov 7, 2024

ChrisJohnNOAA commented Nov 8, 2024

srstsavage commented Aug 4, 2024 •

edited

Loading

benjwadams commented Nov 7, 2024 •

edited

Loading