Skip to content

Docker start.py receives spurious SIGTERM signal #3650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
simranjeetc opened this issue Jun 24, 2021 · 38 comments
Closed

Docker start.py receives spurious SIGTERM signal #3650

simranjeetc opened this issue Jun 24, 2021 · 38 comments
Labels

Comments

@simranjeetc
Copy link

Describe the bug
I have a project jdk8-b120 which is causing opengrok server to shutdown.

I want to disable the sync and reindex for this project. Please find mirror.yml below

# Empty config file for opengrok-mirror
ignore_errors: true

projects:
  jdk8.*:
    disabled: true
    incoming_check: false

Despite doing this I still see reindex getting triggered(screenshot below) for this project. What am I missing here?
Also, how I can find what is causing the shutdown? Can suggester be the reason for the same?

I have used 1.4 for quite sometime and never had any such issues.

Screenshots
image

@vladak
Copy link
Member

vladak commented Jun 24, 2021

What do you mean by 'shutdown' exactly ?

@simranjeetc
Copy link
Author

simranjeetc commented Jun 24, 2021

What do you mean by 'shutdown' exactly ?

In my docker-compose file(below) you can find that I had to use restart: always option because the container was coming down. Upon looking at the logs(attached), I found that tomcat was starting time and again as I find Starting Tomcat in logs everytime it does.

Let me know if you need any other info for the same.

Please note that I have also done some manual restarts. So a few of those would be present in the attached logs as well.

opengrok-log.tar.gz

services:
  opengrok-1-7-11:
    container_name: opengrok-1.7.11
    image: opengrok/docker:1.7.11
    restart: always
    volumes:
      - 'opengrok_data1_7_11:/opengrok/data'
      - './src/:/opengrok/src/'  # source code
      - './etc/:/opengrok/etc/'  # folder contains configuration.xml
      - '/etc/localtime:/etc/localtime:ro'
      - './logs/:/usr/local/tomcat/logs/'
    ports:
      - "9090:8080/tcp"
      - "5001:5000/tcp"
    environment:
      SYNC_PERIOD_MINUTES: '30'
      INDEXER_OPT: '-H -P -G -R /opengrok/etc/read_only.xml'
    # Volumes store your data between container upgrades
volumes:
  opengrok_data1_7_11:
networks:
  opengrok-1-7-11:

@vladak vladak added tools Python tools docker labels Jun 24, 2021
@vladak
Copy link
Member

vladak commented Jun 24, 2021

It seems all the occurrences in the log when Tomcat is going down are preceded by:

Received signal 15
Terminating Tomcat <subprocess.Popen object at 0x7f338f047fd0>

which means something sent the SIGTERM signal to the main container process, most probably externally. I don't think this is related to the processed data.

@vladak
Copy link
Member

vladak commented Jun 24, 2021

For the record, I noticed the same thing when setting up http://demo.opengrok.org (#1740) however I was not sure whether this was caused by watchtower or Docker itself.

@simranjeetc
Copy link
Author

It seems all the occurrences in the log when Tomcat is going down are preceded by:

Received signal 15
Terminating Tomcat <subprocess.Popen object at 0x7f338f047fd0>

which means something sent the SIGTERM signal to the main container process, most probably externally. I don't think this is related to the processed data.

On running below commands

docker container logs 2ef 2>&1 | grep -B 100 -i "Starting tomcat" | grep -i "Received signal 15" | uniq -c
Output::      `7 Received signal 15`

You can ignore these as they would refer to docker-compose restarts issued by me.

docker container logs 2ef 2>&1 | grep -B 100 -i "Starting tomcat" | grep -i "Exception in thread Sync thread" | uniq -c
Output::      `5 Exception in thread Sync thread:`

It is these instances that I am more concerned about. In attached logs, the count would be 4. It is 5 for me since the tomcat shutdown again since the time I shared those logs.

And all of these have a suggester exception preceding them. As can be seen in the logs, it is happening for different projects and not the same one every time.

23-Jun-2021 04:06:01.623 INFO [Thread-20] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler ["http-nio-8080"]
23-Jun-2021 04:06:01.631 INFO [Thread-20] org.apache.catalina.core.StandardService.stopInternal Stopping service [Catalina]
23-Jun-2021 04:06:01.642 SEVERE [ForkJoinPool-122-worker-3] org.opengrok.suggest.Suggester.lambda$getRebuildRunnable$2 Could not rebuild suggester
net.openhft.chronicle.hash.ChronicleHashClosedException: Access to ChronicleMap{name=defs, file=/opengrok/data/suggester/jdk8-b120/defs_search_count.db, identityHash
Code=817420606} after close()
at net.openhft.chronicle.map.VanillaChronicleMap.q(VanillaChronicleMap.java:431)
at net.openhft.chronicle.map.VanillaChronicleMap.mapContext(VanillaChronicleMap.java:443)
at net.openhft.chronicle.map.VanillaChronicleMap.optimizedGet(VanillaChronicleMap.java:579)
at net.openhft.chronicle.map.VanillaChronicleMap.get(VanillaChronicleMap.java:567)
at java.base/java.util.concurrent.ConcurrentMap.getOrDefault(ConcurrentMap.java:88)
at org.opengrok.suggest.popular.impl.chronicle.ChronicleMapAdapter.get(ChronicleMapAdapter.java:62)
at org.opengrok.suggest.SuggesterProjectData.lambda$getSearchCounts$2(SuggesterProjectData.java:476)
at org.opengrok.suggest.SuggesterProjectData$WFSTInputIterator.weight(SuggesterProjectData.java:603)
at org.apache.lucene.search.suggest.SortedInputIterator.sort(SortedInputIterator.java:184)
at org.apache.lucene.search.suggest.SortedInputIterator.(SortedInputIterator.java:76)
at org.apache.lucene.search.suggest.SortedInputIterator.(SortedInputIterator.java:62)
at org.apache.lucene.search.suggest.fst.WFSTCompletionLookup$WFSTInputIterator.(WFSTCompletionLookup.java:273)
at org.apache.lucene.search.suggest.fst.WFSTCompletionLookup.build(WFSTCompletionLookup.java:115)
at org.opengrok.suggest.SuggesterProjectData.build(SuggesterProjectData.java:266)
at org.opengrok.suggest.SuggesterProjectData.build(SuggesterProjectData.java:253)
at org.opengrok.suggest.SuggesterProjectData.rebuild(SuggesterProjectData.java:238)
at org.opengrok.suggest.Suggester.lambda$getRebuildRunnable$2(Suggester.java:347)
at java.base/java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1407)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
23-Jun-2021 04:06:01.650 INFO [ForkJoinPool.commonPool-worker-15] org.opengrok.suggest.Suggester.shutdownAndAwaitTermination Suggesters for [jdk8-b120] were successfully rebuilt (took 0 seconds)
23-Jun-2021 04:06:01.749 INFO [Thread-20] org.opengrok.indexer.configuration.WatchDogService.stop Watchdog stopped
23-Jun-2021 04:06:01.767 SEVERE [Thread-20] org.apache.catalina.loader.WebappClassLoaderBase.checkThreadLocalMapForLeaks The web application [ROOT] created a ThreadLocal with key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@2e0b17a2]) and a value of type [net.openhft.chronicle.hash.impl.ContextHolder] (value [net.openhft.chronicle.hash.impl.ContextHolder@3d4211e1]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.
23-Jun-2021 04:06:01.805 INFO [Thread-20] org.apache.coyote.AbstractProtocol.stop Stopping ProtocolHandler ["http-nio-8080"]
23-Jun-2021 04:06:01.836 INFO [Thread-20] org.apache.coyote.AbstractProtocol.destroy Destroying ProtocolHandler ["http-nio-8080"]
could not get configuration from web application on http://localhost:8080/
Exception in thread Sync thread:

@vladak
Copy link
Member

vladak commented Jun 24, 2021

I don't see the Exception in thread Sync thread in the attached log file. Could you share the relevant snippet ?

@simranjeetc
Copy link
Author

simranjeetc commented Jun 24, 2021

Used this to narrow down to the snippets around Exception in thread sync thread

cat opengrok-v2.log| grep -i -B 1000 "Starting tomcat" | grep -i -A 50 -B 100 "exception in thread sync thread" | less

@vladak
Copy link
Member

vladak commented Jun 24, 2021

The exception in the Sync thread was fixed in #3651 and the fix is part of 1.7.13 however this is only a symptom of the problem. The exception while saving configuration to disk happened because it was not possible to retrieve the configuration from the web app which in turn failed because Tomcat was going down for some reason.

@simranjeetc
Copy link
Author

simranjeetc commented Jun 25, 2021

25-Jun-2021 08:51:18.228 INFO [ForkJoinPool.commonPool-worker-11] org.opengrok.suggest.Suggester.rebuild Rebuilding the following suggesters: [verity]
25-Jun-2021 08:51:30.770 INFO [Thread-20] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler ["http-nio-8080"]
25-Jun-2021 08:51:30.779 INFO [Thread-20] org.apache.catalina.core.StandardService.stopInternal Stopping service [Catalina]
25-Jun-2021 08:51:30.878 INFO [Thread-20] org.opengrok.indexer.configuration.WatchDogService.stop Watchdog stopped
25-Jun-2021 08:51:30.963 SEVERE [ForkJoinPool-274-worker-3] org.opengrok.suggest.Suggester.lambda$getRebuildRunnable$2 Could not rebuild suggester
        net.openhft.chronicle.hash.ChronicleHashClosedException: Access to ChronicleMap{name=refs, file=/opengrok/data/suggester/verity/refs_search_count.db, identityHashCod
e=752444000} after close()
                at net.openhft.chronicle.map.VanillaChronicleMap.q(VanillaChronicleMap.java:431)
                at net.openhft.chronicle.map.VanillaChronicleMap.mapContext(VanillaChronicleMap.java:443)
                at net.openhft.chronicle.map.VanillaChronicleMap.optimizedGet(VanillaChronicleMap.java:579)
                at net.openhft.chronicle.map.VanillaChronicleMap.get(VanillaChronicleMap.java:567)
                at java.base/java.util.concurrent.ConcurrentMap.getOrDefault(ConcurrentMap.java:88)
                at org.opengrok.suggest.popular.impl.chronicle.ChronicleMapAdapter.get(ChronicleMapAdapter.java:62)
                at org.opengrok.suggest.SuggesterProjectData.lambda$getSearchCounts$2(SuggesterProjectData.java:476)
                at org.opengrok.suggest.SuggesterProjectData$WFSTInputIterator.weight(SuggesterProjectData.java:603)
                at org.apache.lucene.search.suggest.SortedInputIterator.sort(SortedInputIterator.java:184)
                at org.apache.lucene.search.suggest.SortedInputIterator.<init>(SortedInputIterator.java:76)
                at org.apache.lucene.search.suggest.SortedInputIterator.<init>(SortedInputIterator.java:62)
                at org.apache.lucene.search.suggest.fst.WFSTCompletionLookup$WFSTInputIterator.<init>(WFSTCompletionLookup.java:273)
                at org.apache.lucene.search.suggest.fst.WFSTCompletionLookup.build(WFSTCompletionLookup.java:115)
                at org.opengrok.suggest.SuggesterProjectData.build(SuggesterProjectData.java:266)
                at org.opengrok.suggest.SuggesterProjectData.build(SuggesterProjectData.java:253)
                at org.opengrok.suggest.SuggesterProjectData.rebuild(SuggesterProjectData.java:238)
                at org.opengrok.suggest.Suggester.lambda$getRebuildRunnable$2(Suggester.java:347)
                at java.base/java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1407)
                at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
                at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
                at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
                at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
                at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
25-Jun-2021 08:51:30.965 INFO [ForkJoinPool.commonPool-worker-11] org.opengrok.suggest.Suggester.shutdownAndAwaitTermination Suggesters for [verity] were successfully rebuil
t (took 12 seconds)
25-Jun-2021 08:51:30.966 INFO [ForkJoinPool.commonPool-worker-5] org.opengrok.suggest.Suggester.rebuild Rebuilding the following suggesters: [verity]
25-Jun-2021 08:51:30.972 SEVERE [ForkJoinPool-275-worker-3] org.opengrok.suggest.Suggester.lambda$getRebuildRunnable$2 Could not rebuild suggester
	net.openhft.chronicle.hash.ChronicleHashClosedException: Access to ChronicleMap{name=defs, file=/opengrok/data/suggester/verity/defs_search_count.db, identityHashCod
e=2041243366} after close()
                at net.openhft.chronicle.map.VanillaChronicleMap.q(VanillaChronicleMap.java:409)
                at net.openhft.chronicle.map.VanillaChronicleMap.mapContext(VanillaChronicleMap.java:443)
                at net.openhft.chronicle.map.VanillaChronicleMap.optimizedGet(VanillaChronicleMap.java:579)
                at net.openhft.chronicle.map.VanillaChronicleMap.get(VanillaChronicleMap.java:567)
                at java.base/java.util.concurrent.ConcurrentMap.getOrDefault(ConcurrentMap.java:88)
                at org.opengrok.suggest.popular.impl.chronicle.ChronicleMapAdapter.get(ChronicleMapAdapter.java:62)
                at org.opengrok.suggest.SuggesterProjectData.lambda$getSearchCounts$2(SuggesterProjectData.java:476)
                at org.opengrok.suggest.SuggesterProjectData$WFSTInputIterator.weight(SuggesterProjectData.java:603)
                at org.apache.lucene.search.suggest.SortedInputIterator.sort(SortedInputIterator.java:184)
                at org.apache.lucene.search.suggest.SortedInputIterator.<init>(SortedInputIterator.java:76)
                at org.apache.lucene.search.suggest.SortedInputIterator.<init>(SortedInputIterator.java:62)
                at org.apache.lucene.search.suggest.fst.WFSTCompletionLookup$WFSTInputIterator.<init>(WFSTCompletionLookup.java:273)
                at org.apache.lucene.search.suggest.fst.WFSTCompletionLookup.build(WFSTCompletionLookup.java:115)
                at org.opengrok.suggest.SuggesterProjectData.build(SuggesterProjectData.java:266)
                at org.opengrok.suggest.SuggesterProjectData.build(SuggesterProjectData.java:253)
                at org.opengrok.suggest.SuggesterProjectData.rebuild(SuggesterProjectData.java:238)
                at org.opengrok.suggest.Suggester.lambda$getRebuildRunnable$2(Suggester.java:347)
                at java.base/java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1407)
                at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
                at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
                at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
                at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
                at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
25-Jun-2021 08:51:30.974 INFO [ForkJoinPool.commonPool-worker-5] org.opengrok.suggest.Suggester.shutdownAndAwaitTermination Suggesters for [verity] were successfully rebuilt
 (took 0 seconds)
25-Jun-2021 08:51:30.987 SEVERE [Thread-20] org.apache.catalina.loader.WebappClassLoaderBase.checkThreadLocalMapForLeaks The web application [ROOT] created a ThreadLocal wit
h key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@1d2e0f1c]) and a value of type [net.openhft.chronicle.hash.impl.ContextHolder] (value [net.openhft.chroni
cle.hash.impl.ContextHolder@5b017732]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable me
mory leak.
25-Jun-2021 08:51:30.988 SEVERE [Thread-20] org.apache.catalina.loader.WebappClassLoaderBase.checkThreadLocalMapForLeaks The web application [ROOT] created a ThreadLocal wit
h key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@8c5f5b]) and a value of type [net.openhft.chronicle.hash.impl.ContextHolder] (value [net.openhft.chronicl
e.hash.impl.ContextHolder@74d214c1]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memo
ry leak.
25-Jun-2021 08:51:30.989 SEVERE [Thread-20] org.apache.catalina.loader.WebappClassLoaderBase.checkThreadLocalMapForLeaks The web application [ROOT] created a ThreadLocal wit
h key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@8c5f5b]) and a value of type [net.openhft.chronicle.hash.impl.ContextHolder] (value [net.openhft.chronicl
e.hash.impl.ContextHolder@44abbbe0]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memo
ry leak.
25-Jun-2021 08:51:30.990 SEVERE [Thread-20] org.apache.catalina.loader.WebappClassLoaderBase.checkThreadLocalMapForLeaks The web application [ROOT] created a ThreadLocal wit
h key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@8c5f5b]) and a value of type [net.openhft.chronicle.hash.impl.ContextHolder] (value [net.openhft.chronicl
e.hash.impl.ContextHolder@33fe07a5]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memo
ry leak.
25-Jun-2021 08:51:30.991 SEVERE [Thread-20] org.apache.catalina.loader.WebappClassLoaderBase.checkThreadLocalMapForLeaks The web application [ROOT] created a ThreadLocal wit
h key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@8c5f5b]) and a value of type [net.openhft.chronicle.hash.impl.ContextHolder] (value [net.openhft.chronicl
e.hash.impl.ContextHolder@5469501d]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memo
ry leak.
25-Jun-2021 08:51:30.992 SEVERE [Thread-20] org.apache.catalina.loader.WebappClassLoaderBase.checkThreadLocalMapForLeaks The web application [ROOT] created a ThreadLocal wit
h key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@1d2e0f1c]) and a value of type [net.openhft.chronicle.hash.impl.ContextHolder] (value [net.openhft.chroni
cle.hash.impl.ContextHolder@90a076b]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable mem
ory leak.
25-Jun-2021 08:51:30.992 SEVERE [Thread-20] org.apache.catalina.loader.WebappClassLoaderBase.checkThreadLocalMapForLeaks The web application [ROOT] created a ThreadLocal wit
h key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@8c5f5b]) and a value of type [net.openhft.chronicle.hash.impl.ContextHolder] (value [net.openhft.chronicl
e.hash.impl.ContextHolder@6fea3b00]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memo
ry leak.
25-Jun-2021 08:51:30.993 SEVERE [Thread-20] org.apache.catalina.loader.WebappClassLoaderBase.checkThreadLocalMapForLeaks The web application [ROOT] created a ThreadLocal wit
h key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@53bdcce3]) and a value of type [net.openhft.chronicle.hash.impl.ContextHolder] (value [net.openhft.chroni
cle.hash.impl.ContextHolder@ad5a8a0]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable mem
ory leak.
25-Jun-2021 08:51:31.023 INFO [Thread-20] org.apache.coyote.AbstractProtocol.stop Stopping ProtocolHandler ["http-nio-8080"]
25-Jun-2021 08:51:31.032 INFO [Thread-20] org.apache.coyote.AbstractProtocol.destroy Destroying ProtocolHandler ["http-nio-8080"]
could not get configuration from web application on http://localhost:8080/
Exception in thread Sync thread:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/scripts/start.py", line 350, in project_syncer
    save_config(logger, uri, config_path)
  File "/scripts/start.py", line 250, in save_config
    config_file.write(config)
TypeError: write() argument must be str, not None

This is latest tomcat shutdown. Let me know if you want me to try something to get you more information.

@vladak
Copy link
Member

vladak commented Jun 25, 2021

Is there anything relevant before these log entries ?

The 2 lines:

25-Jun-2021 08:51:30.770 INFO [Thread-20] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler ["http-nio-8080"]
25-Jun-2021 08:51:30.779 INFO [Thread-20] org.apache.catalina.core.StandardService.stopInternal Stopping service [Catalina]

basically say that Tomcat is stopping. There is no info why it is stopping. The first line actually comes from the Tomcat-10/java/org/apache/catalina/core/StandardService.java#setInternal() method as it is stopping the Connectors first: https://github.com/apache/tomcat/blob/197edd2f4bb50bb46ce4c35e6d49765916662ccc/java/org/apache/catalina/core/StandardService.java#L470-L497

The rest of the log is just fallout I think - the suggester failing because the web app is yanked underneath, the start.py API calls failing for the same reason.

Next step would be raising the log level of the Tomcat itself. Something like this (assumes the container is running):

  • docker container cp <container name>:/usr/local/tomcat/conf /tmp/conf
  • change the docker compose volumes to contain: '/tmp/conf:/usr/local/tomcat/conf'
  • change /tmp/conf/logging.properties to contain .level = ALL and change java.util.logging.ConsoleHandler.level to FINEST
  • start the container with this configuration and wait for Tomcat to stop

This will produce lots of output. Hopefully when Tomcat is stopping the next time there will be some relevant log messages.

@simranjeetc
Copy link
Author

simranjeetc commented Jun 26, 2021

Couple of observations/infos -

  • This time the restart happened after approximately 21 hours which was a quite more than what I have seen before (4-5 hours).
  • There has been lesser activity today on the grok instance in last 24 hours.
  • This time instead of doing a restart with docker-compose restart, I did a docker-compose down and then docker-compose up -d.

Also attaching the logging.properties for your reference.

handlers = java.util.logging.ConsoleHandler

.handlers = java.util.logging.ConsoleHandler

############################################################
# Handler specific properties.
# Describes specific configuration info for Handlers.
############################################################

java.util.logging.ConsoleHandler.level = FINEST
java.util.logging.ConsoleHandler.formatter = org.apache.juli.OneLineFormatter

############################################################
# Facility specific properties.
# Provides extra control for each logger.
############################################################

org.apache.catalina.core.ContainerBase.[Catalina].[localhost].level = ALL
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].handlers = java.util.logging.ConsoleHandler

org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/manager].level = ALL
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/manager].handlers =  java.util.logging.ConsoleHandler

org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/host-manager].level = ALL
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/host-manager].handlers = java.util.logging.ConsoleHandler

@vladak
Copy link
Member

vladak commented Jun 28, 2021

The log level was increased however there is no information about why the StandardService was stopped:

INFO: Refreshing searcher managers to: http://localhost:8080/
Jun 26, 2021 5:54:18 PM org.opengrok.indexer.util.Statistics logIt
INFO: Indexer finished (took 0:01:52)

Sync done
Saving configuration to /opengrok/etc/configuration.xml
26-Jun-2021 17:54:47.880 INFO [Thread-20] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler ["http-nio-8080"]
26-Jun-2021 17:54:47.889 INFO [Thread-20] org.apache.catalina.core.StandardService.stopInternal Stopping service [Catalina]
26-Jun-2021 17:54:47.989 FINE [Thread-20] org.apache.catalina.core.StandardContext.filterStop Stopping filters
26-Jun-2021 17:54:47.989 FINE [Thread-20] org.apache.catalina.core.StandardContext.filterStop  Stopping filter 'ExpiresHalfHourFilter'
26-Jun-2021 17:54:47.990 FINE [Thread-20] org.apache.catalina.core.StandardContext.filterStop  Stopping filter 'AuthorizationFilter'
26-Jun-2021 17:54:47.990 FINE [Thread-20] org.apache.catalina.core.StandardContext.filterStop  Stopping filter 'CookieFilter'
26-Jun-2021 17:54:47.990 FINE [Thread-20] org.apache.catalina.core.StandardContext.filterStop  Stopping filter 'Tomcat WebSocket (JSR356) Filter'
26-Jun-2021 17:54:47.990 FINE [Thread-20] org.apache.catalina.core.StandardContext.filterStop  Stopping filter 'ExpiresOneDayFilter'
26-Jun-2021 17:54:47.991 FINE [Thread-20] org.apache.catalina.core.StandardContext.filterStop  Stopping filter 'ExpiresOneYearFilter'
26-Jun-2021 17:54:47.991 FINE [Thread-20] org.apache.catalina.core.StandardContext.filterStop  Stopping filter 'StatisticsFilter'
26-Jun-2021 17:54:48.007 INFO [Thread-20] org.opengrok.indexer.configuration.WatchDogService.stop Watchdog stopped

Maybe you can use the big hammer and add:

.level = ALL

to the logging.properties file.

@simranjeetc
Copy link
Author

I tried something else too. Noticed that whenever there is a restart there is a lucene exception before that.

WARNING: ERROR addFile(): /opengrok/src/jdk8-b120/langtools/test/tools/javac/limits/LongName.java
java.lang.IllegalArgumentException: Document contains at least one immense term in field="defs" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 110, 109, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 48, 49, 50, 51]...', original message: bytes can be at most 32766 in length; got 140000
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:981)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:524)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:488)

This is being caused in project jdk8-b120. So I removed that project from the src directory. After that tomcat has been up for last 24 hours. So either this exception is the cause or it was causing the shutdown to happen more frequently. There are instances however where this exception has occurred and the server did not shutdown.

I only need this project to be indexed once and after that it can be ignored. I tried doing this using the mirror.yml in #3650 (comment). But it is still getting indexed every time. Can you please tell me what else I need to do to disable future re-indexing of this project?

Meanwhile, I will change the .level to ALL and report back with logs.

@vladak
Copy link
Member

vladak commented Jun 28, 2021

I think the /opengrok/src/jdk8-b120/langtools/test/tools/javac/limits/LongName.java exception is just a red herring. In the last attached log it happens many more times (in fact I think it happens during each reindex of the project) than the Tomcat stopping. It would be highly suspicious if Tomcat stopped because one of its applications experienced a problem.

Trying https://github.com/openjdk/jdk/tree/jdk8-b120 with opengrok/docker:master to see if I can reproduce this.

@jwehler
Copy link

jwehler commented Jun 28, 2021

Jumping into this issue with a 'me too' on the shutdown. I see it on a semi-regular basis over the past 3 releases minimum. Maybe once every two days I'll find that the opengrok docker container isn't running (I need to learn the auto-restart tricks). It always happens near the end of a periodic indexing operation, seemingly around the suggester rebuild time. I just had it happen now while on release 1.7.13. My log levels are not high, so I do not think there is anything new in my information that isn't already listed above.

...
Jun 28, 2021 2:18:20 PM org.opengrok.indexer.index.Indexer refreshSearcherManagers
INFO: Refreshing searcher managers to: http://localhost:8080/
Jun 28, 2021 2:18:20 PM org.opengrok.indexer.util.Statistics logIt
INFO: Indexer finished (took 48.747 seconds)

Sync done
28-Jun-2021 14:21:31.467 INFO [Thread-27] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler ["http-nio-8080"]
28-Jun-2021 14:21:31.476 INFO [Thread-27] org.apache.catalina.core.StandardService.stopInternal Stopping service [Catalina]
28-Jun-2021 14:21:31.502 INFO [Thread-27] org.apache.catalina.core.StandardWrapper.unload Waiting for [1] instance(s) to be deallocated for Servlet [org.opengrok.web.api.v1.RestApp]
Saving configuration to /opengrok/etc/configuration.xml
Waiting for reindex to be triggered
28-Jun-2021 14:21:31.619 INFO [Thread-27] org.opengrok.indexer.configuration.WatchDogService.stop Watchdog stopped
28-Jun-2021 14:22:37.714 SEVERE [ForkJoinPool-113-worker-19] org.opengrok.suggest.Suggester.lambda$getRebuildRunnable$2 Could not rebuild suggester
        net.openhft.chronicle.hash.ChronicleHashClosedException: Access to ChronicleMap{name=type, file=/opengrok/data/suggester/myproject/type_search_count.db, identityHashCode=1033479179} after close()
                at net.openhft.chronicle.map.VanillaChronicleMap.q(VanillaChronicleMap.java:409)
                at net.openhft.chronicle.map.VanillaChronicleMap.mapContext(VanillaChronicleMap.java:443)
                at net.openhft.chronicle.map.VanillaChronicleMap.optimizedGet(VanillaChronicleMap.java:579)
                at net.openhft.chronicle.map.VanillaChronicleMap.get(VanillaChronicleMap.java:567)
                at java.base/java.util.concurrent.ConcurrentMap.getOrDefault(ConcurrentMap.java:88)
                at org.opengrok.suggest.popular.impl.chronicle.ChronicleMapAdapter.get(ChronicleMapAdapter.java:62)
                at org.opengrok.suggest.SuggesterProjectData.lambda$getSearchCounts$2(SuggesterProjectData.java:476)
                at org.opengrok.suggest.SuggesterProjectData$WFSTInputIterator.weight(SuggesterProjectData.java:603)
                at org.apache.lucene.search.suggest.SortedInputIterator.sort(SortedInputIterator.java:184)
                at org.apache.lucene.search.suggest.SortedInputIterator.<init>(SortedInputIterator.java:76)
                at org.apache.lucene.search.suggest.SortedInputIterator.<init>(SortedInputIterator.java:62)
                at org.apache.lucene.search.suggest.fst.WFSTCompletionLookup$WFSTInputIterator.<init>(WFSTCompletionLookup.java:273)
                at org.apache.lucene.search.suggest.fst.WFSTCompletionLookup.build(WFSTCompletionLookup.java:115)
                at org.opengrok.suggest.SuggesterProjectData.build(SuggesterProjectData.java:266)
                at org.opengrok.suggest.SuggesterProjectData.build(SuggesterProjectData.java:253)
                at org.opengrok.suggest.SuggesterProjectData.rebuild(SuggesterProjectData.java:238)
                at org.opengrok.suggest.Suggester.lambda$getRebuildRunnable$2(Suggester.java:347)
                at java.base/java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1407)
                at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
                at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
                at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
                at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
                at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
28-Jun-2021 14:22:37.716 INFO [ForkJoinPool.commonPool-worker-23] org.opengrok.suggest.Suggester.shutdownAndAwaitTermination Suggesters for [myproject] were successfully rebuilt (took 1 minute 27 seconds)
...
<repeating pattern for each project, about a dozen, but with a rebuild of 0 seconds until things shutdown>
...
28-Jun-2021 14:22:37.754 INFO [ForkJoinPool.commonPool-worker-1] org.opengrok.suggest.Suggester.shutdownAndAwaitTermination Suggesters for [someotherproject] were successfully rebuilt (took 0 seconds)
28-Jun-2021 14:22:37.784 INFO [Thread-27] org.apache.coyote.AbstractProtocol.stop Stopping ProtocolHandler ["http-nio-8080"]
28-Jun-2021 14:22:37.837 INFO [Thread-27] org.apache.coyote.AbstractProtocol.destroy Destroying ProtocolHandler ["http-nio-8080"]

(Replaced my project names with generic names in the log output.)

Seems like the severe errors are an artifact of things shutting down out from under it, but I don't see anything in the logs leading up to it all that indicates what happened to trigger it.

@vladak
Copy link
Member

vladak commented Jun 29, 2021

Seems like I reproduced this on my laptop with the latest master Docker image:

opengrok-test | INFO: Indexer finished (took 0:01:22)
opengrok-test | 
opengrok-test | 29-Jun-2021 07:32:36.009 INFO [Thread-14] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler ["http-nio-8080"]
opengrok-test | Sync done
opengrok-test | 29-Jun-2021 07:32:36.015 INFO [Thread-14] org.apache.catalina.core.StandardService.stopInternal Stopping service [Catalina]
opengrok-test | 29-Jun-2021 07:32:36.044 INFO [Thread-14] org.opengrok.indexer.configuration.WatchDogService.stop Watchdog stopped
opengrok-test | 29-Jun-2021 07:32:36.065 SEVERE [ForkJoinPool-2305-worker-3] org.opengrok.suggest.Suggester.lambda$getRebuildRunnable$2 Could not rebuild suggester

...

opengrok-test | could not get configuration from web application: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
opengrok-test | Waiting for reindex to be triggered
opengrok-test exited with code 0

It took more than 24 hours (the indexer period was set to 5 minutes). Next thing I will run the container with full Tomcat debug and also killsnoop-bpfcc and execsnoop-bpfcc.

@vladak
Copy link
Member

vladak commented Jun 30, 2021

The next run ended with the same after some 24 hours (I could not really tell the precise time because of power outage at night that might have put my laptop to sleep for some time. Also I did not properly record the time of the startup).

The container went down with:

opengrok-test | INFO: Indexer finished (took 55.313 seconds)
opengrok-test | 
opengrok-test | Received signal 15
opengrok-test | Terminating Tomcat <subprocess.Popen object at 0x7fc1ea2a3780>
opengrok-test | 30-Jun-2021 07:44:00.366 FINE [Thread-14] org.apache.catalina.util.LifecycleBase.setStateInternal Setting state for [StandardServer[8005]] to [STOPPING_PREP]
opengrok-test | 30-Jun-2021 07:44:00.366 FINE [Thread-14] org.apache.catalina.util.LifecycleBase.setStateInternal Setting state for [StandardServer[8005]] to [STOPPING]
opengrok-test | 30-Jun-2021 07:44:00.366 FINE [Thread-14] org.apache.catalina.mbeans.GlobalResourcesLifecycleListener.destroyMBeans Destroying MBeans for Global JNDI Resources
opengrok-test | Sync done
opengrok-test | 30-Jun-2021 07:44:00.367 FINE [Thread-14] org.apache.catalina.util.LifecycleBase.setStateInternal Setting state for [StandardService[Catalina]] to [STOPPING_PREP]
opengrok-test | 30-Jun-2021 07:44:00.367 INFO [Thread-14] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler ["http-nio-8080"]
opengrok-test | 30-Jun-2021 07:44:00.368 FINE [Thread-14] org.apache.tomcat.util.net.AbstractEndpoint.unlockAccept About to unlock socket for:/172.21.0.2:8080
opengrok-test | 30-Jun-2021 07:44:00.368 FINE [Thread-14] org.apache.tomcat.util.net.AbstractEndpoint.unlockAccept Socket unlock completed for:/172.21.0.2:8080
opengrok-test | 30-Jun-2021 07:44:00.373 INFO [Thread-14] org.apache.catalina.core.StandardService.stopInternal Stopping service [Catalina]

and the signal trace contained this:

07:43:32  567    systemd-udevd    15   1720360 0
07:43:32  567    systemd-udevd    15   1720366 0
07:43:32  567    systemd-udevd    15   1720370 0
07:43:56  1719906 java             9    127991 0
07:43:56  1568456 python3          15   127500 0
07:43:56  1568456 python3          15   127501 0
07:43:56  1568456 python3          15   127502 0
07:43:56  1568456 python3          15   127503 0
07:43:56  1568456 python3          15   127504 0
07:43:56  1568456 python3          15   127505 0
07:43:56  1568456 python3          15   127507 0
07:45:01  531    systemd-journal  0    1024   0
07:45:01  531    systemd-journal  0    1062   0
07:45:01  531    systemd-journal  0    1063   0
07:45:01  531    systemd-journal  0    1913   0
07:45:01  531    systemd-journal  0    1017   0

which did not match the main start.py process as I captured the process tree when the container was started:

1568348 ?        Sl     0:29 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 2b2f27fddb7d32476d79ea1a60179805b6bd4bd5ad5d42ccc4d6b37c0e01949a -address /run/containerd/containerd.sock
1568370 ?        Ssl    0:13  \_ python3 /scripts/start.py                      
1568463 ?        Sl   100:29      \_ /usr/local/openjdk-11/bin/java -Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dorg.apache.catalina.security.SecurityListener.UMASK=0027 -Dignore.endorsed.dirs= -classpath /usr/local/tomcat/bin/bootstrap.jar:/usr/local/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/usr/local/tomcat -Dcatalina.home=/usr/local/tomcat -Djava.io.tmpdir=/usr/local/tomcat/temp org.apache.catalina.startup.Bootstrap start
1633488 ?        S      0:00      \_ python3 /scripts/start.py                  
1633489 ?        S      0:00      \_ python3 /scripts/start.py                  
1633490 ?        S      0:00      \_ python3 /scripts/start.py                  
1633491 ?        S      0:00      \_ python3 /scripts/start.py                  
1633492 ?        S      0:00      \_ python3 /scripts/start.py                  
1633493 ?        Sl     0:00      \_ python3 /scripts/start.py                  
1633514 ?        Sl     0:00      |   \_ /usr/bin/python3 /usr/local/bin/opengrok-reindex-project --printoutput --jar /opengrok/lib/opengrok.jar -U http://localhost:8080/ -P jdk8-b120 -- -r dirbased -G -m 256 --leadingWildCards on -c /usr/local/bin/ctags -U http://localhost:8080/ -H jdk8-b120
1633526 ?        Sl     1:59      |       \_ /usr/local/openjdk-11/bin/java -Dorg.opengrok.indexer.history.Subversion=/usr/bin/svn -Dorg.opengrok.indexer.history.Mercurial=/usr/bin/hg -jar /opengrok/lib/opengrok.jar -R /tmp/tmplau_snfw -r dirbased -G -m 256 --leadingWildCards on -c /usr/local/bin/ctags -U http://localhost:8080/ -H jdk8-b120

so it is not possible to tell which process actually sent the SIGTERM. Anyhow, the SIGTERM was somehow delivered to the main process. Could be a problem with the tracing or how the signal was delivered. For instance it could be delivered by other means than the kill syscall as noted on https://stackoverflow.com/a/62434615/11582827

Initially I thought there is some way how e.g. the dockerd could send the signal to the container however I am starting to think there is some weird way how the synchronization done in do_sync() could lead to the SIGTERM being sent to the main process. The way how do_sync() works is that it creates a pool of worker processes to handle the synchronization of multiple projects in parallel. When the work is done, these processes are terminated with SIGTERM as visible in the killsnoop-bpfcc log above. There were 8 worker processes (corresponding to multiprocessing.cpu_count()) that were SIGTERM'd after the sync was done. This is done automatically by the Pool context manager:

with Pool(processes=numworkers) as pool:
try:
cmds_base_results = pool.map(worker, cmds_base, 1)
except KeyboardInterrupt:
return FAILURE_EXITVAL
else:
for cmds_base in cmds_base_results:
cmds = CommandSequence(cmds_base)
cmds.fill(cmds_base.retcodes, cmds_base.outputs,
cmds_base.failed)
r = cmds.check(ignore_errors)
if r != SUCCESS_EXITVAL:
retval = FAILURE_EXITVAL
if print_output and logger:
cmds.print_outputs(logger, lines=True)

To verify the hypothesis that this is caused by do_sync() I am going to let a container run with SYNC_PERIOD_MINUTES: '0' for couple of days. Next, I will probably add some better signal tracing to start.py using signal.sigwaitinfo().

@vladak
Copy link
Member

vladak commented Jul 1, 2021

Wow, the container running opengrok/docker:master with SYNC_PERIOD_MINUTES: '0' (which means there was just a single sync - when the container was started) ended after some 10 hours with:

opengrok-test | Received signal 15
opengrok-test | Terminating Tomcat <subprocess.Popen object at 0x7f61ee975128>
opengrok-test | 30-Jun-2021 19:34:21.143 INFO [Thread-14] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler ["http-nio-8080"]
opengrok-test | 30-Jun-2021 19:34:21.149 INFO [Thread-14] org.apache.catalina.core.StandardService.stopInternal Stopping service [Catalina]
opengrok-test | 30-Jun-2021 19:34:21.176 INFO [Thread-14] org.opengrok.indexer.configuration.WatchDogService.stop Watchdog stopped
opengrok-test exited with code 0

which means this problem is not related to the syncing.

In the mean time I ran into https://stackoverflow.com/questions/68197507/why-does-sigterm-handling-mess-up-pool-map-handling while playing with do_sync().

@simranjeetc
Copy link
Author

@vladak Any luck in figuring out the reason behind shutdowns and any fixes/workarounds for the same?

@jwehler
Copy link

jwehler commented Jul 14, 2021

Until a fix is in place, I ended up doing:

docker update --restart=always <opengrok-container>

and that has helped. Had one recovery since my last manual restart two days ago:

docker inspect --format '{{json .RestartCount}}' <opengrok-container>
1

Sadly you can't use the --restart=on-failure option because the exit code appears to be 0 when this happens.

@simranjeetc
Copy link
Author

Until a fix is in place, I ended up doing:

docker update --restart=always <opengrok-container>

Agreed. I have been using docker-compose with restart policy as always from last couple of weeks. With that atleast opengrok serves requests without much glitch.

@vladak
Copy link
Member

vladak commented Jul 15, 2021

I modified start.py to serve basically as a signal waiter so the main() basically looks like this:

    signal.signal(signal.SIGTERM, signal_handler)
    signal.signal(signal.SIGINT, signal_handler)

    logger.info("Waiting for signal on PID {}".format(os.getpid()))
    siginfo = signal.sigwaitinfo({signal.SIGTERM, signal.SIGINT})
    print("got %d from %d by user %d\n" % (siginfo.si_signo,
                                           siginfo.si_pid,
                                           siginfo.si_uid))

There is no sync done. It does not even start Tomcat.

After some time (cannot tell how much exactly since the run it was interrupted by putting the laptop to sleep over the course of couple of days and the signal info print statement above does not give any time) it failed with:

opengrok-dev | 2021-07-07 14:16:13,291     INFO opengrok_tools | Waiting for signal on PID 1
opengrok-dev | got 15 from 0 by user 0

I used the modified killsnoop-bpfcc (iovisor/bcc#3520) to also capture the signals sent with sigqueue() however there is nothing in the log pertaining the processes running in the container.

Now, if I Ctrl-C the container running in foreground, it will also produce the "got 15 from 0 by user 0\n\n" message (just not on the container console; probably has to do with shutting down the logging. In order to get the message I had to use strace on the start.py.) however in that case I can see related events in the BPF log:

11:17:20  253725 runc             15   253625 0
11:17:20  1894   dockerd          2    253574 0
11:17:20  1894   dockerd          2    253588 0

The lack of information about signal sender can mean one of these things:

  • it was sent by the kernel
  • it was sent by using other means (there are couple of system calls in Linux that can also send signals and those are not instrumented in my modified version of killsnoop-bpfcc)
  • the signal was not sent at all and it is just some Pyhon artifact/bug

One of the things I'd like to try next (heh, this is getting a bit like bug update in corporate world which I would like to avoid in this free time project where no one should expect anything from the project owners) is to run tomcat:10-jdk11 image (which the OpenGrok image is based on) with a minimal Python program (like the one above) to see if the behavior can be reproduced there as well. Thinking about bisection done for Docker layers..

@vladak vladak changed the title Disable sync and reindex as project causing opengrok shutdown Docker start.py receives spurious SIGTERM signal Jul 20, 2021
@WSUFan
Copy link

WSUFan commented Aug 12, 2021

when you run the image, maybe you can add this option. "--log-driver=journald"

@mes85
Copy link

mes85 commented Aug 26, 2021

I just found this issue report, after investigating stopped opengrok docker containers for 2-3 days.
On my setup the issue occurs quite often, by which I mean every 1-2 hours.

Is there anything I can try to investigate?

@vladak
Copy link
Member

vladak commented Sep 21, 2021

When I tried to debug this last time, the tracepoints I used did not generate any events and siginfo lacked data about what generated the signal. I'd like to retry with the idea from iovisor/bcc#3520 (comment)

Maybe even publish a barebones Docker image that can reproduce the problem.

@AdityaSriramM
Copy link

I still see this issue with latest and master images.
What is the last known version/image that does not have this issue?

@lxylxy123456
Copy link

@AdityaSriramM For me version 1.6.9 (opengrok/docker:1.6.9) does not have this issue. You may want to bisect if you want a version >= 1.7.0 .

@leonidwang
Copy link

I can confirm this still exists in the latest image pulled from DockerHub. Image built from current master version also got this.

The docker image I built got this error around 4 times in 8 hours in the beginning and then looks stable for 19 hours. (always restart in docker-compose config)

# docker ps
CONTAINER ID   IMAGE                 COMMAND               CREATED        STATUS        PORTS                    NAMES
f2fd678d93ee   opengrok-dev:latest   "/scripts/start.py"   27 hours ago   Up 19 hours   0.0.0.0:8888->8080/tcp   opengrok-8888
# docker-compose logs | grep -ie "Connection reset by peer"
opengrok-8888    | could not get configuration from web application: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
opengrok-8888    | could not get configuration from web application: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
opengrok-8888    | could not get configuration from web application: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
opengrok-8888    | could not get configuration from web application: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

The error is the same signal 15 after indexer finished.

opengrok-8888    | Jun 08, 2022 8:53:16 AM org.opengrok.indexer.util.Statistics logIt
opengrok-8888    | INFO: Done indexing data of all repositories (took 0:01:38)
opengrok-8888    | Jun 08, 2022 8:53:16 AM org.opengrok.indexer.util.Statistics logIt
opengrok-8888    | INFO: Indexer finished (took 0:01:38)
opengrok-8888    |
opengrok-8888    | Received signal 15
opengrok-8888    | Terminating Tomcat <Popen: returncode: None args: ['/usr/local/tomcat/bin/catalina.sh', 'run']>
opengrok-8888    | Sync done
opengrok-8888    | 08-Jun-2022 08:53:17.081 INFO [Thread-27] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler ["http-nio-8080"]
opengrok-8888    | 08-Jun-2022 08:53:17.084 INFO [Thread-27] org.apache.catalina.core.StandardService.stopInternal Stopping service [Catalina]
...
opengrok-8888    | 08-Jun-2022 08:53:17.296 INFO [Thread-27] org.apache.coyote.AbstractProtocol.stop Stopping ProtocolHandler ["http-nio-8080"]
opengrok-8888    | 08-Jun-2022 08:53:17.326 INFO [Thread-27] org.apache.coyote.AbstractProtocol.destroy Destroying ProtocolHandler ["http-nio-8080"]
opengrok-8888    | Waiting for reindex to be triggered
opengrok-8888    | could not get configuration from web application: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

@leonidwang
Copy link

leonidwang commented Jun 11, 2022

[Edit] I stand corrected. After running v1.7.14 longer, I hit the issue again, but looks this time there's no 'signal 15' in the log, issue started from 'org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler' and then connection reset by peer.
v1.6.9 is stable for days.
I'll try other versions.


My deployment hasn't been upgraded for around a year, but I removed the image immediately after pulling the latest one, so I can't remember the last good version.
But it's the version about a year ago. Checking the release history, it should be around 1.7.1x, so I tried different versions at several similar hosts.

I hit the error at 1.7.15, and it is stable at 1.7.14 for 17 hours.
1.6.9 is also stable for over 24 hours.

# docker ps
CONTAINER ID   IMAGE                    COMMAND               CREATED        STATUS        PORTS                    NAMES
9fa1ad9664db   opengrok/docker:1.7.14   "/scripts/start.py"   17 hours ago   Up 17 hours   0.0.0.0:8888->8080/tcp   opengrok-8888


# docker ps
CONTAINER ID   IMAGE                COMMAND               CREATED        STATUS        PORTS                    NAMES
6bdaacef97bd   opengrok-dev:1.6.9   "/scripts/start.py"   27 hours ago   Up 24 hours   0.0.0.0:8888->8080/tcp   opengrok-8888

@Keinkil
Copy link

Keinkil commented Nov 11, 2022

Is there any update on this issue?
We noticed it as well, and didn't have the bandwidth to put much more time into it and had to stop on 1.6.9 because of that, but 1.7 has so many nice features that we would like to use.

@jkjha
Copy link

jkjha commented Dec 6, 2022

I am also facing this issue. Is it possible to please bump the priority of this issue please.

Almost during the end of indexing, I see below log and the container gets restarted...

Dec 06, 2022 11:28:13 AM org.opengrok.suggest.Suggester shutdownAndAwaitTermination
INFO: Suggesters for [censored-project-name] were successfully rebuilt (took 8 minutes 8 seconds)
Received signal 15
Terminating Tomcat <Popen: returncode: None args: ['/usr/local/tomcat/bin/catalina.sh', 'run']>
Received signal 15
Terminating Tomcat <Popen: returncode: None args: ['/usr/local/tomcat/bin/catalina.sh', 'run']>
Dec 06, 2022 11:35:35 AM org.apache.coyote.AbstractProtocol pause
INFO: Pausing ProtocolHandler ["http-nio-8080"]
Sync done
Dec 06, 2022 11:35:35 AM org.apache.catalina.core.StandardService stopInternal
INFO: Stopping service [Catalina]
Dec 06, 2022 11:35:36 AM org.opengrok.indexer.configuration.WatchDogService stop
INFO: Watchdog stopped
Dec 06, 2022 11:35:36 AM org.apache.coyote.AbstractProtocol stop
INFO: Stopping ProtocolHandler ["http-nio-8080"]
Dec 06, 2022 11:35:36 AM org.apache.coyote.AbstractProtocol destroy
INFO: Destroying ProtocolHandler ["http-nio-8080"]
could not get configuration from web application: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
Waiting for reindex to be triggered

Unfortunately, I do not see any suggested workaround for this task except downgrading to 1.6.9 which I really want to avoid. Is there any other workaround that would have worked for anyone.

@jkjha
Copy link

jkjha commented Dec 7, 2022

Isn't this issue a blocker to use docker version of opengrok?
Unfortunately we found this issue after releasing it to production and don't see a possible workaround. It seems will cost too much for us to revert everything. :(
Please either put this as NOTE in docker wiki of opengrok or bump the priority of this issue.

@vladak
Copy link
Member

vladak commented Dec 7, 2022

Isn't this issue a blocker to use docker version of opengrok? Unfortunately we found this issue after releasing it to production and don't see a possible workaround. It seems will cost too much for us to revert everything. :( Please either put this as NOTE in docker wiki of opengrok or bump the priority of this issue.

Some open-source projects should probably have some sort of warning in this regard (something along the lines of "do not use this unless you commit to dedicate resources to the project"), esp. those that depend on limited set of contributors with very limited resources and there is no commercial support.

That said, I took a look once again at the problem that seems to happen around the time the synchronization is done, at least in my case.

What I see is this:

opengrok-latest | Received signal 15
opengrok-latest | Terminating Tomcat <Popen: returncode: None args: ['/usr/local/tomcat/bin/catalina.sh', 'run']>
opengrok-latest | Received signal 15
opengrok-latest | Terminating Tomcat <Popen: returncode: None args: ['/usr/local/tomcat/bin/catalina.sh', 'run']>
opengrok-latest | Received signal 15
opengrok-latest | Terminating Tomcat <Popen: returncode: None args: ['/usr/local/tomcat/bin/catalina.sh', 'run']>
opengrok-latest | Received signal 15
opengrok-latest | Terminating Tomcat <Popen: returncode: None args: ['/usr/local/tomcat/bin/catalina.sh', 'run']>
opengrok-latest | Received signal 15
opengrok-latest | Terminating Tomcat <Popen: returncode: None args: ['/usr/local/tomcat/bin/catalina.sh', 'run']>
opengrok-latest | Received signal 15
opengrok-latest | Terminating Tomcat <Popen: returncode: None args: ['/usr/local/tomcat/bin/catalina.sh', 'run']>
opengrok-latest | Received signal 15
opengrok-latest | Terminating Tomcat <Popen: returncode: None args: ['/usr/local/tomcat/bin/catalina.sh', 'run']>
opengrok-latest | Received signal 15
opengrok-latest | Terminating Tomcat <Popen: returncode: None args: ['/usr/local/tomcat/bin/catalina.sh', 'run']>
opengrok-latest | 2022-12-07 16:55:41,502     INFO opengrok_tools | Sync done

The SIGTERM was reported 8 times, suspiciously matching the number of worker Python processes spawned by start.py. It happened when the overall synchronization was completing.

Running my killsnoop-bpfcc-with-sigqueue BPF trace program, I can see this:

TIME      PID    COMM             SIG  TPID   RESULT
...
17:55:41  906476 python3          15   201    0
17:55:41  906476 python3          15   202    0
17:55:41  906476 python3          15   203    0
17:55:41  906476 python3          15   204    0
17:55:41  906476 python3          15   205    0
17:55:41  906476 python3          15   206    0
17:55:41  906476 python3          15   207    0
17:55:41  906476 python3          15   208    0
17:55:41  906732 python3          15   11     0
17:55:41  906729 python3          15   11     0
17:55:41  906730 python3          15   11     0
17:55:41  906733 python3          15   11     0
17:55:41  906728 python3          15   11     0
17:55:41  906731 python3          15   11     0
17:55:41  906726 python3          15   11     0
17:55:41  906727 python3          15   11     0

Modulo the UTC offset, it matches the Docker container logs. Now, the output is clouded by the fact that the numbers in the PID column report PIDs from the global namespace while the TPID (target PID) is using the container PID namespace.

When the sync was still running, I captured the process listing from within the container:

    PID TTY      STAT   TIME COMMAND
      1 ?        Ssl    0:02 python3 /scripts/start.py
     11 ?        Sl     8:17 /opt/java/openjdk/bin/java -Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dorg.apache.catalina.security.SecurityListener.UMASK=0027 -Dignore.endorsed.dirs= -classpath /usr/local/tomcat/bin/bootstrap.jar:/usr/local/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/usr/local/tomcat -Dcatalina.home=/usr/local/tomcat -Djava.io.tmpdir=/usr/local/tomcat/temp org.apache.catalina.startup.Bootstrap start
    201 ?        Sl     0:00 python3 /scripts/start.py
    202 ?        S      0:00 python3 /scripts/start.py
    203 ?        S      0:00 python3 /scripts/start.py
    204 ?        S      0:00 python3 /scripts/start.py
    205 ?        S      0:00 python3 /scripts/start.py
    206 ?        S      0:00 python3 /scripts/start.py
    207 ?        S      0:00 python3 /scripts/start.py
    208 ?        S      0:00 python3 /scripts/start.py

So, local PID 11 was the Tomcat process and 201-208 were the sync workers. PID 1 was the container entry program, start.py

start.py imports do_sync() and calls it. I think this is important fact. It used to be that do_sync() was executed via the external opengrok-sync command.

do_sync() uses Python process Pool to distribute the tasks among the workers. Given that the "Received signal 15" message was displayed 8 times, it most probably means that the installed signal handler was each invoked in each worker.

The exact mechanics of how this happens escape me. My hypothesis is that SIGTERM is used to communicate the end of the work for the pool workers. In the BPF output above, we can see that it is the global PID 906476 that is sending the signal to all the workers. While the sync was still running, I also captured the global process listing and it looked like this:

|-containerd-shim(906364)-+-python3(906385)-+-java(906482)-+-{java}(906502)    # this is Tomcat
...
           |- python3(906726)-+-opengrok-reinde(907304)-+-java(907377)
                  ... # indexer processes
           |-python3(906727)
           |-python3(906728)
           |-python3(906729)
           |-python3(906730)
           |-python3(906731)
           |-python3(906732)
           |-python3(906733)

I believe 906726-906733 global PIDs correspond to the 201-208 PIDs in the container namespace.

The 906726 was the last worker that had some work. I think when the worker queue is depleted, the last worker sends SIGTERM to the remaining workers as visible in the BPF output.

However, start.py installs the SIGTERM signal handler before running main, i.e. before the workers are created:

opengrok/docker/start.py

Lines 594 to 598 in 92ce674

if __name__ == "__main__":
signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler)
main()

which I think means that each worker has the handler installed. Given the teardown of the worker pool using SIGTERM and given the signal handler existence, the observed problem happens.

vladak added a commit to vladak/OpenGrok that referenced this issue Dec 7, 2022
@vladak vladak closed this as completed in 5acf3a4 Dec 8, 2022
@one-ideaed
Copy link

@vladak thank you much for solving this issue!

@ChristopheBordieu
Copy link

For people interested by Python multiprocessing module, Pool class, and how to close a pool of processes, this link (and the related pages) is interesting:
https://superfastpython.com/multiprocessing-pool-context-manager/

@ChristopheBordieu
Copy link

Wouldn't it be worth to add a pool.close() and pool.join() in the with scope to ensure all tasks have a chance to finish cleanly?
Or the issue is strictly related to signal handlers available in forked processes?

@vladak
Copy link
Member

vladak commented Dec 8, 2022

Wouldn't it be worth to add a pool.close() and pool.join() in the with scope to ensure all tasks have a chance to finish cleanly? Or the issue is strictly related to signal handlers available in forked processes?

Pool is used as a context manager:

with Pool(processes=numworkers) as pool:
so I believe the close is implicit there.

The problem was with signal handler inheritance to workers vs signals used for the Pool internal workings, at least in the particular CPython implementation.

@jkjha
Copy link

jkjha commented Dec 12, 2022

Thank you very much @vladak for fixing this. It will surely help a lot of users who love opengrok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests