-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[version3] Deadlock on node closing #417
Comments
I don't get this, our CI builds use 1.57. |
Using a gcc version later than 5.x has boost warnings that shake out as errors (known boost bug corrected in later versions). A solution is to specify gcc-5 such as: CC=gcc-5 CXX=g++-5 ./install.sh [args] |
Seems like a good reason to bump our boost minimum. |
The root cause appears to be that |
The problem is that the guard implementation is flawed. The |
@fpelliccioni Thanks for troubleshooting and posting this! Please re-verify once this is merged. |
On a related note, is there a max version of boost that is acceptable for upgrade? Specifically, what restrictions are there keeping us from the latest version (1.67.0 as of now) assuming proper build and smoke testing for now? |
There’s no strong reason for any particular version. The rationale has been that a lower version reduces the need to install. But the reality of boost install variation means this isn’t so important. As long as there is a NuGet package and all builds are good I don’t see a problem with a higher version. |
Boost has been upgraded in master #420 |
Status? |
I am on vacations right now. I will check it again in 3 weeks and I will
notice you.
|
ok, hvae fun! |
Status? |
I will test it again and I will let you know.
Give me a couple of hours.
Regards!
|
The deadlock is still happening. @garceri (Gerardo) told me that he also experiences these deadlocks every day when he works on libbitcoin community servers. |
Odd that it is apparently common for you but that @thecodefactory has not seen it. |
What does this mean? |
Gerardo is our devops, he regularly configures environments and he has reported that he has experienced the deadlock many times. |
It is easy to reproduce with the instructions I gave. |
Ok, thanks. To eliminate the possibility that it's script related (to ensure python I/O deadlocks are not introduced), you have verified that this is easily reproducible by manually starting/stopping a version3 server? |
Gerardo has verified it manually. (I have verified it using the script). |
How is it that he is working so frequently with libbitcoin server vs. bitprim server? |
I have been doing some tests with libbitcoin recently and found that if you increase the number of connections (both incoming and outgoing) that seems to start causing deadlocks. |
I’ve run thousands of connections and have never seen it. If I can’t trap it in a debugger it’s unlikely to be found. |
Which values are you using? I'll try the same. |
You can use my script and then attach to the process. |
I could do it, but not today... |
The backtrace can help, but if run from the script, it's not guaranteed to be a deadlock in the server. Once attached, it's stopped and inspecting the state might not tell much. I'm not saying it can't happen, but I'm trying to find a sane way to verify this. I've been mind-numbingly starting/stopping a latest v3 bn server manually over and over while looking at other things. No shutdown errors so far. |
It is possible that running in Debug mode the error does not manifest. |
Do you work only in Windows? |
I use GNU/Linux. |
It’s unlikely to differ in debug vs. release builds if a debugger is not attached. But I run both and have never seen it. |
@fpelliccioni @garceri If you have access to GNU/Linux, see if you can reproduce it with this script (which is hacky, but more or less the same idea as yours, but eliminates the possibility of python pipe I/O deadlocks):
I've had this running the past ~30 minutes or so. |
I am running it. |
Deadlock using the bash script.
|
Interesting, leave it running to see if it resolves, but that's definitely a longer time for exit than I've seen by far. I've been running mine since I said earlier (~6+ hours ago), still without any issue. If it runs longer and you're able to attach to it, the backtraces could help. |
Can we say that we are a deadlock? |
It appears we are reproducing what we have seen anecdotally. It happens on bitprim builds/machines but not others. |
Ok, but is a normal machine with Ubuntu Linux, and the build is done using I will talk with @jujumax and @garceri to bring you access to one of those machines. |
Send a |
The fact that it's ubuntu does concern me. There was a reported bug (on slack) that causes hangs if you don't build boost statically, specifically on ubuntu. I was able to reproduce it on a VM, but haven't followed up on that yet. I also haven't re-tested since we've upgraded boost versions. Can you confirm that you've built your version statically? EDIT: And yes, the hangs are on shutdown (reported as when ctrl-c was pressed). |
I ran the
I will try using |
@thecodefactory do you have a repro? |
No, I don't (at least not yet, using the script). I did stop running it after a couple days, but could try again a little bit more. It might be something like environment related, so it might be better to try it on my ubuntu VM to see if it's more likely there. |
No luck reproducing so far, running all day on my Ubuntu 16.04.4 LTS VM.
It's possible something is manifesting from gcc version differences? |
Several days into running the script on the Ubuntu VM, I got a hang on shutdown. @fpelliccioni Were you ever able to attach so that we can compare notes here?
|
I still have not ever seen this. Note also: libbitcoin/libbitcoin-server#360 |
@thecodefactory How many peers were configured for the above trace? |
Good question, it's been a while so I don't recall. I had archived the VM it was running on, but will try to restore it to find out. |
Ok, it was a v3 run using the above shell script with a testnet configuration. 0 inbound, 2 outbound, 1 manual peer. I'll update and rebuild the v3 code and see if I can get this test re-started with latest. |
Great, 3 was the right answer. |
(Posted here by Eric indication).
Hello.
I am experimenting a deadlock when I try to close the node (bn) Ctrl-C (SIGINT, SIGTERM).
To get the
bn
executable I follow:Using the following configuration file:
... and I have the following Python script to run the node multiple times:
Node chain initialization and script running:
(the python script have to be in the same directory as
bn
executable)Thanks and regards,
Fernando.
The text was updated successfully, but these errors were encountered: