-
-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FTL v54b4ad93 crashes sometimes (several times today) for unknown reasons #2112
Comments
Could you please run https://deploy-preview-338--pihole-docs.netlify.app/ftldns/debugging/ has all the details incl. a special section how to do it inside a |
I can see the same happening both for nightly and development tags. This is what is shown after the mentioned debugging procedure inside the docker container:
It doesn't look very helpful. |
Seems, you need to need to add |
I have another with
|
Okay, so |
This issue has been mentioned on Pi-hole Userspace. There might be relevant details there: https://discourse.pi-hole.net/t/pi-hole-stopped-responding/73931/4 |
Thanks for your patience. However, when also adding
|
This means there was no crash. Why FTL exited with exit code 01 will be found in |
This is what the log contains:
Note, that this happens on the current development image, as well as yesterday's and today's nightly image. The log above is from the latest nightly because while it crashes, FTL recovers quicker while development does not and sometimes takes down the container. |
from e.g. the first line says the crash happened in process with PID 3270 which is a Fork of process 52. Unfortunately,
´The first crash you have reported here had the crash in According to the link above, Linux supports setting
allowing edit Clarified above and suggested a different command to run |
might work, too, but I am not sure if suspending the main process is still a healthy system. It may have other consequences. Easiest would probably be waiting until we get a crash not in a fork so |
With
With
|
Maybe it's a good idea to create a development build that includes debug symbols just to have a valid stack trace output? |
Here is "my" crash:
|
We already have the full set of debug symbols in the release builds, otherwise, you would not have seen the function names and related code lines in:
But this also shows that following forks/children won't work for us here. The location at where you entered @schuettecarsten Could you attach the debugger as well? This latest crash was in a fork, too, however, the first one (the very first post in this issue ticket) was in a normal thread of the main process ( |
Maybe here:
or
or
I'm not experienced with gdb. Thank you for your patience. If these backtraces are not helpful, I apologize |
Signal 17 is |
@bermeitinger-b I don't think there is anything wrong with the network table TBH, my current assumption is that somehow handling of the strings got broken and your database got populated with garbage data. This would likely explain the crash you are seeing, too. But I am still not sure how/where it happens. I have meanwhile manually inspected the entire string processing code paths twice and did not find anything odd. Please also enable |
Here an excerpt after flushing the network table:
|
Yes, this brings us a lot closer. Please restart
you can try to search for the first occurrence of this strange string like
|
It'd also be helpful if you could run
and see if any errors regarding hash tables are reported in |
Thanks. I've restarted and it prints a lot of
Typically, this is an indicator that it's not a literal E. After many of these lines, it looks the output changes:
About the hash collisions:
Seems okay, however, I'm not sure why it scans 100 clients. There should only be 3 (localhost, DoH, DoT). |
Looking at the logs, we are in a bit of a chicken-and-egg-problem as the database already has some broken stuff which then "contaminates" FTL's strings on the history import during startup. I'd suggest to disable database importing for the moment,
and then restart again. You may also want to clean out the log file first to get rid of the binary stuff that is in there right now. edit The 100 clients come about because FTL detects clients based on their (string) IP address and when the IP address is garbage, a new client is added for each new garbage string. |
I've deleted the old database and started new. It generated the following log:
The rest is the same as in the first logs above. As above, it did not restart automatically. The log file is not binary anymore, so the network table is looking correctly. |
Did you have the debugger |
Maybe:
and right after:
Then, the whole container dies. Edit: After some time, the glyphs reappear in the network table. I don't have a backtrace for those but will check the log. |
I don't think you're in a position to apologize ;) Thank you for your efforts. I've started valgrind. Within docker, that's a bit complicated because stopping pihole-FTL stops the container. I've adjusted the pihole:
image: pihole:local
...
entrypoint: /bin/bash
tty: true Then, I could start valgrind. The error log is here: It indicates something at pihole-FTL also dies with many I hope this helps. |
No, this is actually fine. There will be a lot of output, the log (or |
I hope this leads somewhere. I can't really reproduce the state. This time, it crashes after 58 minutes and it's detached with error 33. I've removed the nostops for 33 (and 37).
And here the valgrind.log: https://0x0.st/Xk7C.log There is one binary line in line 55 which I've removed. (If you need it, it's here https://0x0.st/Xk7F.bin) |
Thanks. The important bit is
I did just push an update to the same debug branch (commit 26d1dd3), please re-build the container and 🤞 |
Thanks, the new commit resulted in the probably same message:
Full log: https://0x0.st/XdHX.log This time, I couldn't I'm sure it's the correct version. From FTL.log:
|
Too bad, so back to the drawing board. I did just push another update to the branch adding further debug output. It'd be good if you could update and try again. Once it crashes, we'd need the address the algorithm tried to access, like what you quoted from
Please ensure you still have If you can attach both
edit The expected commit would be b4f194b |
This time, gdb spammed with "Thread 4 event 33" and the CPU usage was so busy with gdb and valgrind that I had to restart the server. So, I could not run these gdb commands. These are the last 7 minutes in FTL.log: https://0x0.st/Xdbb.txt And this is the valgrind.log: https://0x0.st/XdbZ.log |
Thanks. Unfortunately, this is another crash in another location without the "Invalid read of size 4" section so we don't get the address it tried to look at. I pushed another commit (59c22db) that adds some extra out-of-bounds checking. Lets hope I'm not going down the wrong track here... edit 1 It'll be interesting to see:
edit 2 I added another error message, updated the expected commit hash above |
Thanks. I think it didn't crash where it could be traced. I've removed the
(Then, it's dead.) valgrind.log: https://0x0.st/Xdbd.log These are the last lines in FTL.log: https://0x0.st/Xdbn.txt |
Yeah, it shouldn't stop at SIG33. When |
Okay, I think I have something new. With red, very dry eyes, I have been staring at your previous log and found the following interesting lines:
Okay ... so it tried to access memory at I am actually quite confident that adding the missing remapping instructions (commit 341464f) fixes the bug we are seeing here. It also explains why it mostly happens in forks and why I haven't seen it myself before (my network is simply to quiet/not enough queries). |
I take my hat off to you while bowing with respect. You did it. 👍 With this branch, it's running since 10 hours without a crash. Thank you very much. |
I just looked at your code. Is this code thread-safe? Or is thread safety no issue because of a global "lock" somewhere? |
Thank you for confirming the fix! I have now split this into two branches: The one you are currently on ( I did that to ease review as these are basically two independent changes which should result in two independent pull requests (#2120 and #2121 ). I also removed a few of the commits adding more debug output which we didn't need in the end. No need to keep "dead" code around.
We take care of only accessing these global memory sections when a global (shared-memory) lock is helt ( |
The bug should be fixed in |
I can confirm no crash after the patch. Thank you! |
Hello, I appreciate the effort that's gone into PiHole, but I am facing ongoing issues with this crashing problem that seems unresolved still despite that you've stated it's bee resovled. Unfortunately, the frequent crashes persist, occurring every 30-60 minutes with the same error: Segmentation Fault (SEGV_MAPPER). What I Think Might Be HappeningBased on my observations and investigations, primarily examining different logs, I believe the crashes are related to how Pi-hole handles large DNS responses. When a DNS reply is too large to fit into a single UDP packet, it gets truncated, resulting in a "reply is truncated" log entry. Immediately after this "reply is truncated," Pi-hole crashes. My educated technical guess is that there's a bug in Pi-hole's handling of these truncated responses. It might not be correctly checking the size of incoming data or is trying to access memory that isn't allocated, leading to a segmentation fault with the SEGV_MAPERR error (address not mapped to object). Essentially, Pi-hole could be attempting to process these truncated replies incorrectly, causing the crashes. As mentioned, it's quite possible that when this happens, Pi-hole doesn't necessarily need to crash. Instead, it could gracefully handle the DNS error/truncated response, and either provide the client with what it has and drop the rest of the data, or at worst, drop the affected packets entirely to avoid crashing. If it's a decision between crashing and dropping the DNS response, then dropping it would be the better option. This doesn’t seem to be a valid reason for Pi-hole to crash. With improved error handling in critical areas, we could avoid spending so much time debugging issues and tracking down crash sources. While I don’t use the same programming language as you, I always consider potential error scenarios in my projects, particularly where I don’t control the responses (like DNS responses). In such cases, I sometimes place try-catch blocks around entire functions or code sections, and I decide a course of action that incorporates one or more of the following: discard the data an exception is caught or something is null or out of bounds, retry x times, assume a default value, log the issue, and move on. The internet might go down, a cable might get unplugged, or a process might die due to Docker resource limits. Instead of crashing, there must be a more graceful way to handle critical Pi-hole functions? Such methods as retrying a DNS request if there's an error processing it (which I believe is the issue here) or providing the client only what it has, would be best. I don't know the circumstances or am familiar with the code behind PiHole so maybe my understanding of how it handles crashes is not complete, but I only provide my professional opinion here. The error reports are attached at the bottom of my comment. While I'm here, I also wanted to address some other concerns about my experience with Pi-hole:
Thank you for your understanding and continued efforts to improve PiHole. I hope these points can be addressed Here are the relevant error logs for this particular issue:
5 MINUTES LATER:
50 MINUTES LATER:
|
As you said, you already have it (but not included it): Please provide the backtrace generated by |
I actually said, or meant to say, if my meaning was unclear, is that due to time constraints I haven't gone through the above instructions for installing and setting up the debug tools yet. This is why I haven't included the gdb log. If I had the logs, I would have attached them to my comment – I certainly wouldn't leave them out on purpose and waste your time. Related, Or Are Dev Images Just Really Unstable?I'm not sure if you're suggesting that the dev image includes these tools, but I haven't found any like that on this Alpine image. It’s frustrating that the dev images of PiHole lack any symbols, backtracing, debug tools, or troubleshooting scripts. Also, right out of the box, the gravity db was corrupted or corrupted itself within the first hour without any block lists added the first day (error -2 on the homescreen). I had to follow some commands on a forum post to delete the DB and recreate it. The pihole repair commands did not work either. Just throwing it all out there. There isn’t even a way to restart the PiHole service/process without restarting the entire container, and the “reboot” command is missing as well – which I thought came standard with BusyBox. Many Docker containers have the reboot command. I also completely understand that, naturally because these are dev images, there may be things broken, I'm not expecting it to be perfect, but where it already went through a beta testing phase I'm quite surprised by the number of issues I've encountered with PiHole right out within the first few days. That's actually the reason for my posts is to help share information about what I'm encountering, and hopefully these bugs can be fixed soon. I’ll need to set aside some time tomorrow or the next to take down the Docker container and follow the above instructions for installing and configuring the debug features and tools. Unfortunately, I don't have that time at the moment. I can type at lightning speed tho, it's a talent of mine. If the information and logs I've provided aren't sufficient to pinpoint the problem, I'll try to get the debug environment set up this weekend. Based on what I've found, you shouldn't have to go through much code. This issue occurs when the log entry "response is truncated" is written to the DNS log right after receiving notification from the DNS server that it's response to PiHole has been truncated. Presumably the DNS reply is too large for a udp packet. Wherever that event is handled, that's where the crash occurs. The code block handling the "response is truncated" event can't be more than 100 lines? Which even seems substantial for handing a DNS response and making 1 log entry. But perhaps I don't fully understand the complexity of PiHole’s event handling. Which leads me to Reiterated: Consider adding backtracing, symbols, & debugging to PiHole DevHere is where I'd like to reiterate: Please consider including the debug tools, symbols, backtracing, etc., in the development branch builds of the Docker images or provide an easy way to enable debugging options. An option like “pihole-ftl --enable-debugging” that either enables backtracing, symbols, and tools, or fetches and sets them up, would be much preferable to manually downloading and installing them. Having to take down the container to add permissions and commands and then rebuild it makes me nervous, as I risk losing everything in it. Taking down or editing containers has sometimes taken me whole days, or even multiple days when things have gone wrong, and in the past I've lo losing everything I set up. Since then I learned you can save images of your changes I admit that while I am very technically inclined with over a decade of software repair experience, Docker is very new to me still, so my understanding and handling of it may not be optimal. Reiterated: My Plea, Please Consider Bringing Back Update Command to PiHole Docker ImagesFinally, I'm going to reiterate my stance and plea here. Please hear me out. The decision to disable the update functionality in the PiHole-FTL, just because it's Docker and re-pulling an image is the Docker way – this method of updating PiHole is incredibly frustrating. That works for some things, but that doesn't have to be “the way.” Reconfiguring, reinstalling, and setting up my environment again is a significant time investment. For eg. on PiHole I have many tools and scripts I install, eg. ssh, network and IP troubleshooting tools, Cloudflared, dnscrypt, log rules automation. All that in addition to custom scripts for things such as IPv6 static addresses and static routes. I have my own /48 static block of IPv6 from my ISP, but it's not SLAAC compatible, I have to setup the routes myself, and Docker doesn't support that. I know some might argue that a Docker container should only contain PiHole, but that's not practical. Even your official docs describe installing and setting up cloudflared, which isn't included in the images. Enabling the update functionality would save users considerable time, effort, and frustration. If you must, then you could put the update feature in Docker images behind a disclaimer or labs option, but that would certainly be far better than the current option of having to tear down the container. Request for Feedback ReplyI hope the PiHole devs take the time to fully read my comments and feedback, and consider my opinions / suggestions. Let me know if you need me to provide any additional information about my feedback / suggestions. If you could, I would like whoever is in charge of these decisions to answer for me if there is any specific reasons these actually can't/won't be done, that would be helpful to me. I will still setup the debugging tools, hopefully, this weekend, if the provided error information and logs isn't sufficient to find the bug.* However, as of right now PiHole Dev image is not usable. It crashes far too much 30+ times a day, sometimes within 20 min of pulling the Docker image. This is build Dec 22 2024. I'm surprised it's working for anyone else, but perhaps it has to do with using Cloudflared as the DNS provider. Maybe the way Cloudflared is providing it's responses, but a DNS response packet doesn't change drastically from one server to the next, so that doesn't make a lot of sense either. |
You do not need to rebuild the container or anything. The debug symbols are already included. The only bits that are missing is Everything you need is described briefly in https://deploy-preview-338--pihole-docs.netlify.app/ftldns/gdb It would have really taken you much less time to follow the few steps than writing such a long reply. We will make sure we actually discuss adding
As you said, it is probably due to your particular installation. Looking only at our Discourse forum, it seems that many are using it, yet, this is the first time we have heard about this. It may be an issue in Pi-hole but it may also very well be an issue in the embedded
I'll defer much of your other questions to our Docker experts, this is not my primary area of expertise. @pi-hole/docker-maintainers
These instructions are not meant for docker installations but if installed "bare metal" on the machine itself. |
I've referenced the response in a new issue on the docker repo so as not to spam the participants of this closed thread with discussion irrelevant to them: pi-hole/docker-pi-hole#1676_ |
Is there anything you can provide to help us possibly duplicate your setup? Or a packet capture of the 'large UDP packets' you mention in the linked discussion? |
FTL crashed for unknown reasons several times today.
Here is a full log;
The text was updated successfully, but these errors were encountered: