-
Notifications
You must be signed in to change notification settings - Fork 779
Docker start.py receives spurious SIGTERM signal #3650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What do you mean by 'shutdown' exactly ? |
In my docker-compose file(below) you can find that I had to use Let me know if you need any other info for the same. Please note that I have also done some manual restarts. So a few of those would be present in the attached logs as well. services:
opengrok-1-7-11:
container_name: opengrok-1.7.11
image: opengrok/docker:1.7.11
restart: always
volumes:
- 'opengrok_data1_7_11:/opengrok/data'
- './src/:/opengrok/src/' # source code
- './etc/:/opengrok/etc/' # folder contains configuration.xml
- '/etc/localtime:/etc/localtime:ro'
- './logs/:/usr/local/tomcat/logs/'
ports:
- "9090:8080/tcp"
- "5001:5000/tcp"
environment:
SYNC_PERIOD_MINUTES: '30'
INDEXER_OPT: '-H -P -G -R /opengrok/etc/read_only.xml'
# Volumes store your data between container upgrades
volumes:
opengrok_data1_7_11:
networks:
opengrok-1-7-11: |
It seems all the occurrences in the log when Tomcat is going down are preceded by:
which means something sent the |
For the record, I noticed the same thing when setting up http://demo.opengrok.org (#1740) however I was not sure whether this was caused by |
On running below commands docker container logs 2ef 2>&1 | grep -B 100 -i "Starting tomcat" | grep -i "Received signal 15" | uniq -c
Output:: `7 Received signal 15` You can ignore these as they would refer to docker-compose restarts issued by me. docker container logs 2ef 2>&1 | grep -B 100 -i "Starting tomcat" | grep -i "Exception in thread Sync thread" | uniq -c
Output:: `5 Exception in thread Sync thread:` It is these instances that I am more concerned about. In attached logs, the count would be 4. It is 5 for me since the tomcat shutdown again since the time I shared those logs. And all of these have a suggester exception preceding them. As can be seen in the logs, it is happening for different projects and not the same one every time.
|
I don't see the |
Used this to narrow down to the snippets around cat opengrok-v2.log| grep -i -B 1000 "Starting tomcat" | grep -i -A 50 -B 100 "exception in thread sync thread" | less |
The exception in the Sync thread was fixed in #3651 and the fix is part of 1.7.13 however this is only a symptom of the problem. The exception while saving configuration to disk happened because it was not possible to retrieve the configuration from the web app which in turn failed because Tomcat was going down for some reason. |
This is latest tomcat shutdown. Let me know if you want me to try something to get you more information. |
Is there anything relevant before these log entries ? The 2 lines:
basically say that Tomcat is stopping. There is no info why it is stopping. The first line actually comes from the The rest of the log is just fallout I think - the suggester failing because the web app is yanked underneath, the Next step would be raising the log level of the Tomcat itself. Something like this (assumes the container is running):
This will produce lots of output. Hopefully when Tomcat is stopping the next time there will be some relevant log messages. |
Couple of observations/infos -
Also attaching the logging.properties for your reference. handlers = java.util.logging.ConsoleHandler
.handlers = java.util.logging.ConsoleHandler
############################################################
# Handler specific properties.
# Describes specific configuration info for Handlers.
############################################################
java.util.logging.ConsoleHandler.level = FINEST
java.util.logging.ConsoleHandler.formatter = org.apache.juli.OneLineFormatter
############################################################
# Facility specific properties.
# Provides extra control for each logger.
############################################################
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].level = ALL
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].handlers = java.util.logging.ConsoleHandler
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/manager].level = ALL
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/manager].handlers = java.util.logging.ConsoleHandler
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/host-manager].level = ALL
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/host-manager].handlers = java.util.logging.ConsoleHandler |
The log level was increased however there is no information about why the
Maybe you can use the big hammer and add:
to the |
I tried something else too. Noticed that whenever there is a restart there is a lucene exception before that.
This is being caused in project I only need this project to be indexed once and after that it can be ignored. I tried doing this using the Meanwhile, I will change the |
I think the Trying https://github.com/openjdk/jdk/tree/jdk8-b120 with |
Jumping into this issue with a 'me too' on the shutdown. I see it on a semi-regular basis over the past 3 releases minimum. Maybe once every two days I'll find that the opengrok docker container isn't running (I need to learn the auto-restart tricks). It always happens near the end of a periodic indexing operation, seemingly around the suggester rebuild time. I just had it happen now while on release 1.7.13. My log levels are not high, so I do not think there is anything new in my information that isn't already listed above.
(Replaced my project names with generic names in the log output.) Seems like the severe errors are an artifact of things shutting down out from under it, but I don't see anything in the logs leading up to it all that indicates what happened to trigger it. |
Seems like I reproduced this on my laptop with the latest
It took more than 24 hours (the indexer period was set to 5 minutes). Next thing I will run the container with full Tomcat debug and also |
The next run ended with the same after some 24 hours (I could not really tell the precise time because of power outage at night that might have put my laptop to sleep for some time. Also I did not properly record the time of the startup). The container went down with:
and the signal trace contained this:
which did not match the main
so it is not possible to tell which process actually sent the Initially I thought there is some way how e.g. the opengrok/tools/src/main/python/opengrok_tools/sync.py Lines 103 to 118 in 552e80c
To verify the hypothesis that this is caused by |
Wow, the container running
which means this problem is not related to the syncing. In the mean time I ran into https://stackoverflow.com/questions/68197507/why-does-sigterm-handling-mess-up-pool-map-handling while playing with |
@vladak Any luck in figuring out the reason behind shutdowns and any fixes/workarounds for the same? |
Until a fix is in place, I ended up doing:
and that has helped. Had one recovery since my last manual restart two days ago:
Sadly you can't use the |
Agreed. I have been using docker-compose with restart policy as |
I modified signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler)
logger.info("Waiting for signal on PID {}".format(os.getpid()))
siginfo = signal.sigwaitinfo({signal.SIGTERM, signal.SIGINT})
print("got %d from %d by user %d\n" % (siginfo.si_signo,
siginfo.si_pid,
siginfo.si_uid)) There is no sync done. It does not even start Tomcat. After some time (cannot tell how much exactly since the run it was interrupted by putting the laptop to sleep over the course of couple of days and the signal info print statement above does not give any time) it failed with:
I used the modified Now, if I Ctrl-C the container running in foreground, it will also produce the
The lack of information about signal sender can mean one of these things:
One of the things I'd like to try next (heh, this is getting a bit like bug update in corporate world which I would like to avoid in this free time project where no one should expect anything from the project owners) is to run |
when you run the image, maybe you can add this option. "--log-driver=journald" |
I just found this issue report, after investigating stopped opengrok docker containers for 2-3 days. Is there anything I can try to investigate? |
When I tried to debug this last time, the tracepoints I used did not generate any events and siginfo lacked data about what generated the signal. I'd like to retry with the idea from iovisor/bcc#3520 (comment) Maybe even publish a barebones Docker image that can reproduce the problem. |
I still see this issue with latest and master images. |
@AdityaSriramM For me version 1.6.9 ( |
I can confirm this still exists in the latest image pulled from DockerHub. Image built from current master version also got this. The docker image I built got this error around 4 times in 8 hours in the beginning and then looks stable for 19 hours. (always restart in docker-compose config) # docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f2fd678d93ee opengrok-dev:latest "/scripts/start.py" 27 hours ago Up 19 hours 0.0.0.0:8888->8080/tcp opengrok-8888
# docker-compose logs | grep -ie "Connection reset by peer"
opengrok-8888 | could not get configuration from web application: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
opengrok-8888 | could not get configuration from web application: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
opengrok-8888 | could not get configuration from web application: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
opengrok-8888 | could not get configuration from web application: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) The error is the same signal 15 after indexer finished. opengrok-8888 | Jun 08, 2022 8:53:16 AM org.opengrok.indexer.util.Statistics logIt
opengrok-8888 | INFO: Done indexing data of all repositories (took 0:01:38)
opengrok-8888 | Jun 08, 2022 8:53:16 AM org.opengrok.indexer.util.Statistics logIt
opengrok-8888 | INFO: Indexer finished (took 0:01:38)
opengrok-8888 |
opengrok-8888 | Received signal 15
opengrok-8888 | Terminating Tomcat <Popen: returncode: None args: ['/usr/local/tomcat/bin/catalina.sh', 'run']>
opengrok-8888 | Sync done
opengrok-8888 | 08-Jun-2022 08:53:17.081 INFO [Thread-27] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler ["http-nio-8080"]
opengrok-8888 | 08-Jun-2022 08:53:17.084 INFO [Thread-27] org.apache.catalina.core.StandardService.stopInternal Stopping service [Catalina]
...
opengrok-8888 | 08-Jun-2022 08:53:17.296 INFO [Thread-27] org.apache.coyote.AbstractProtocol.stop Stopping ProtocolHandler ["http-nio-8080"]
opengrok-8888 | 08-Jun-2022 08:53:17.326 INFO [Thread-27] org.apache.coyote.AbstractProtocol.destroy Destroying ProtocolHandler ["http-nio-8080"]
opengrok-8888 | Waiting for reindex to be triggered
opengrok-8888 | could not get configuration from web application: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) |
[Edit] I stand corrected. After running v1.7.14 longer, I hit the issue again, but looks this time there's no 'signal 15' in the log, issue started from 'org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler' and then connection reset by peer. My deployment hasn't been upgraded for around a year, but I removed the image immediately after pulling the latest one, so I can't remember the last good version. I hit the error at 1.7.15, and it is stable at 1.7.14 for 17 hours. # docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9fa1ad9664db opengrok/docker:1.7.14 "/scripts/start.py" 17 hours ago Up 17 hours 0.0.0.0:8888->8080/tcp opengrok-8888
# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6bdaacef97bd opengrok-dev:1.6.9 "/scripts/start.py" 27 hours ago Up 24 hours 0.0.0.0:8888->8080/tcp opengrok-8888 |
Is there any update on this issue? |
I am also facing this issue. Is it possible to please bump the priority of this issue please. Almost during the end of indexing, I see below log and the container gets restarted...
Unfortunately, I do not see any suggested workaround for this task except downgrading to 1.6.9 which I really want to avoid. Is there any other workaround that would have worked for anyone. |
Isn't this issue a blocker to use docker version of opengrok? |
Some open-source projects should probably have some sort of warning in this regard (something along the lines of "do not use this unless you commit to dedicate resources to the project"), esp. those that depend on limited set of contributors with very limited resources and there is no commercial support. That said, I took a look once again at the problem that seems to happen around the time the synchronization is done, at least in my case. What I see is this:
The SIGTERM was reported 8 times, suspiciously matching the number of worker Python processes spawned by Running my
Modulo the UTC offset, it matches the Docker container logs. Now, the output is clouded by the fact that the numbers in the PID column report PIDs from the global namespace while the TPID (target PID) is using the container PID namespace. When the sync was still running, I captured the process listing from within the container:
So, local PID 11 was the Tomcat process and 201-208 were the sync workers. PID 1 was the container entry program,
The exact mechanics of how this happens escape me. My hypothesis is that SIGTERM is used to communicate the end of the work for the pool workers. In the BPF output above, we can see that it is the global PID 906476 that is sending the signal to all the workers. While the sync was still running, I also captured the global process listing and it looked like this:
I believe 906726-906733 global PIDs correspond to the 201-208 PIDs in the container namespace. The 906726 was the last worker that had some work. I think when the worker queue is depleted, the last worker sends SIGTERM to the remaining workers as visible in the BPF output. However, Lines 594 to 598 in 92ce674
which I think means that each worker has the handler installed. Given the teardown of the worker pool using SIGTERM and given the signal handler existence, the observed problem happens. |
@vladak thank you much for solving this issue! |
For people interested by Python multiprocessing module, Pool class, and how to close a pool of processes, this link (and the related pages) is interesting: |
Wouldn't it be worth to add a pool.close() and pool.join() in the with scope to ensure all tasks have a chance to finish cleanly? |
The problem was with signal handler inheritance to workers vs signals used for the Pool internal workings, at least in the particular CPython implementation. |
Thank you very much @vladak for fixing this. It will surely help a lot of users who love opengrok. |
Describe the bug
I have a project
jdk8-b120
which is causing opengrok server to shutdown.I want to disable the sync and reindex for this project. Please find
mirror.yml
belowDespite doing this I still see reindex getting triggered(screenshot below) for this project. What am I missing here?
Also, how I can find what is causing the shutdown? Can suggester be the reason for the same?
I have used 1.4 for quite sometime and never had any such issues.
Screenshots

The text was updated successfully, but these errors were encountered: