-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JRuby and TLS causing odd timeouts/reconnections? #142
Comments
Apologies if much of this reply gets duplicated. Given the multi-threading capabilities of Ruby, this gem surely does not see much multi-thread use. And it also does not get very much testing with JRuby. In theory I see nothing wrong with what you describe - including ACK from a thread other than the receiver. I know of no reason that should not work. I will assume you have checked AMQ logs for possible problems, yes? You use the term 'deadlock'. Do you mean just a 'hang' ? Other Questions: What version of jruby? I will help if I can. However, I do not think this is a stomp gem problem. Here is why: I just used Ruby 2.4.2p131 (2017-05-10 revision 58637) [x86_64-linux] (I build my own Ruby versions) and your sample code. I pumped 5000 messages through AMQ at 10 messages / second, message size about 100 bytes. Zero problems. So I think the gem is solid. JRuby is a different story. Using jruby 9.1.2.0 (2.3.0) 2016-05-26 7357c8f OpenJDK 64-Bit Server VM 25.131-b11 on 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11 +jit [linux-x86_64] I cannot get that code to run at all. I get different results depending on if the connection is marked reliable or not. I will investigate this further, because I am now curious ....... JRuby support in the stomp gem has always been painful. Have you considered trying to use Ruby fibers for this effort? But frankly, if I wanted rich multi-threading/concurrency I would have gone with a client in Java, or go, or maybe even C++. |
I repeated the above 5000 message experiment with: jruby 1.7.25 (1.9.3p551) 2016-04-13 867cb81 on OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11 +jit [linux-amd64] Zero problems. The latest JRuby 9.x version: jruby 9.1.12.0 (2.3.3) 2017-06-15 33c6439 OpenJDK 64-Bit Server VM 25.131-b11 on 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11 +jit [linux-x86_64] has the same erratic behavior as the 9.1.2.0 version. |
I now have your example code running fine with JRuby 9.x versions. Not your problem, but JRuby support is a pain. JRuby socket I/O does not behave like normal Ruby socket I/O (or even like C socket I/O). This gem has been changed multiple times to work around that. In any case, I see nothing wrong with the stomp gem at this point. And cannot recreate what you describe. If you will document further, I will explore further. But right now I am thinking there is something ..... unique about your particular environment that triggers this behavior. |
Thanks for checking into this. I'll see if I can't answer all the questions. tldr; I think my issue might be due to using TLS to connect to the broker. I will assume you have checked AMQ logs for possible problems, yes? You use the term 'deadlock'. Do you mean just a 'hang' ? What version of jruby? What version of stomp gem? What version of concurrent-ruby gem? What version of AMQ? What STOMP Protocol level? Also, I failed to mention that I am using TLS1.2 to connect to the broker with client certificates for auth. I wrote a little consumer similar to that posted above, and it is not behaving in the same way as the entire application (full hang), however it does have some issues which might or might not be related. First of all, I use the unless connread || @ssl
raise Stomp::Error::ReceiveTimeout unless IO.select([read_socket], nil, nil, @iosto)
end From my rudimentary understanding of the code it looks like the client is getting a message (or at least the start of one) and hanging on the The second issue with the simple client is similar to the above, but in a separate spot in the code. When I increase the prefetch size to 2 (or more), the client starts receiving (and acking) messages. Eventually it stops, however, and the last message that I see is unless connread || @ssl
raise Stomp::Error::ReceiveTimeout unless IO.select([read_socket], nil, nil, @iosto)
end The Ruby core docs indicate some differences with IO.select and IO-like objects, and specifically reference OpenSSL sockets with a bunch of caveats. I have no idea how JRuby implements this under the hood. In the full application (albeit a bit stripped down to help diagnose the STOMP issue), it seems to be hanging in a similar place (i.e. the Because all of the issues seem to be centered around the low-level socket stuffs, and I'm not using a "real" socket (OpenSSL), I'm inclined to believe that is the core problem. Let me know if I can do anything more to help debug. |
After using As for the issue with the simple client (i.e. various "socket" select timeouts), I don't have any more information from |
Great reports above. Many reports do not have detail like you provide. Question regarding TLS: There are four basic use cases. See the examples subdirectory. Which use case are you using? (Does not sound like dual-auth). The gem is full of JRuby hacks added over the years. I admit that. You can see that in the code. Need some refactoring efforts in that area I think. My problem. The real difficulty with basic socket I/O is not JRuby per se. It is Java implementation of sockets, which does not match normal C behavior. You can see all of those gem logic changes to try to attempt to support JRuby if you look at the code. This issue is complicated by SSL/TLS implementations. Even in C, SSL is a hack upon basic sockets. Hence the 'ssl' checks in the IO routines. And all is even more complicated by the fairly recent C libs use of epoll rather than select. The _is_ready kind of logic is a reflection of all that. If this turns out to be a real problem with your logging package .... please let me know. An aside: I had to laugh at your lack of access to AMQ logs. Management loves to say: build me a swimming pool deck. Only use wood and screws. And oh, by the way, you can only use a hammer to do this. I will continue to look at and think about this and experiment as time permits. Because something seems not quite right. I will not support broker specific headers like AMQ prefetch, I cannot. There are too many RabbitMQ users and these days Artemis (which has it own thorns). And IMO it is regrettable that Apollo pretty much died. It was the best of the open source brokers. |
Your above analysis of the code in the _transmit method is excellent. Knowing that you are using SSL was crucial for me to understand (and be able to reproduce) some of what you are reporting. As a consequence I do not think all you are seeing is a result of problems with the logger you are using. Because I can certainly recreate some of it when SSL sockets are used. Thanks for all this info. I will keep you posted on any upcoming changes. And please let me know further information about the logger. Have you opened an issue with them? |
Sorry I left out the SSL part in my initial report. I thought that it was the separate acking thread, but apparently there was more to it. Perhaps I should change the title of the issue to reflect our current understanding (i.e. Problems with JRuby, threading, logging, and SSL)? And I'm using SSL similar to what is in ssl_uc3.rb. My STOMP options look like the following: options = {
:hosts => [{
:login => 'me',
:host => server,
:port => 7230,
:ssl => ::Stomp::SSLParams.new({ :key_file => "private_key.pem", :cert_file => "cert.pem" }),
:parse_timeout => 10,
:sslctx_newparm => :TLSv1_2,
:connect_headers => { "accept-version" => "1.1", "host" => server }
}],
}
client = Stomp::Client.new(options) The "simple" client that I referenced above was stripped of the my entire app and limited to just requiring "concurrent" and "stomp". It is very similar to the code that I originally typed, with the inclusion of the above options hash. I have not yet opened an issue with the logging gem as I'm trying to get a small (retypeable) script to demonstrate the issue and narrow it down to the logging gem and not my stupidity. Thanks for your help, and I totally understand not supporting broker-specific extensions. |
Updating the title of this issue would be good I think. Please do so at your convenience. But - Use the term TLS, not SSL. Old names for old functionality are difficult to unlearn. And also ..... I would say not to mention logging unless you cross-ref this issue with one opened for the logging package. At this point, I do not think the SSL use case matters, just the fact of SSL yes/no. |
Unfortunately, I might be coming back around to my initial hypothesis (acks and receives in different threads causing problems). If you take the simple client (above) with TLS and I started analyzing the jstack trace after the deadlock occurs, and it says that their are two offending threads that are deadlocked: pool-1-thread-3 (apparently one of the thread pool processing/acking threads) and Ruby-0-Thread-3 (which looks like the STOMP receive thread). pool-1-thread-3 (process/ack thread) is deadlocked at sock.puts, when it is writing the ack message. It is waiting on a
Ruby-0-Thread-3 (STOMP recieve thread) deadlocks at a call to s.ready?, specifically in the
All that said, since just |
A couple of notes. There is no question that the gem has one or more problems in the ares you are attempting to use. Those problems may or may not be related to threading. I am not comfortable with that conclusion yet (but do not rule it out). The problems most certainly are related to low level socket I/O and or ssl and/or JRuby. I have a plan of attack here, but have not started writing code on that line yet. It is time for me to do some major rework in the receive routines to isolate IO#select, ssl, and jruby logic. At present there are so many small hacks in that area (good intentions over the years), that it makes the code very difficult to reason about. That effort will take a bit of time, and require extensive testing here. Changes like that have been known to have negative side effects in areas like heart beats and fail over. I cannot break fail over. The gem as a couple of corporate users with literally hundreds of brokers that use fail over extensively. You may see this triggered by a logger require, but that actually may not be anything but the trigger. I suppose they could be doing something fancy during load but will repeat: I can recreate some (not all) of what you describe with no logger. I can recreate the fast timeout failures with ease. I cannot recreate a hang/block (and your recent analysis would seem to indicate a true deadlock). However, I will attempt to run some experiments with that logger required, and observe the effects. For a 'real' Ruby, as best I recall, the check of @io.ready? is a check of ..... an ssl instance variable (@io), that instance variable being defined in Ruby's C code. I will go back and look at that a bit more / again. To see if it gives me any ideas. I have no idea how JRuby would 'emulate' this in Java. I would almost bet that checks of that instance variable are not thread safe. The Ruby developers have traditionally not had to worry much about multi-threading. Hmmmm ..... just looking at some C and Ruby code in the openssl extension. @io is the underlying 'IO Like Object' for the socket:
and I am thinking the #ready? method is provided by class IO. Notice that SSLObject is a wrapper. It does not inherit from any class. An interesting experiment you could run is something like this. In netio.rb:
One other odd question: your original example shows the use of a 'topic'. Do you really expect and want message semantics for topics (as opposed to queues)? (This is a curiosity question. I have observed that many developers ...... use topics when ..... they really want a queue). In any case, more to come on this. I will report back later with any news. I hope you did not need your code in production tomorrow. |
I ran the following experiment with my deadlock case (i.e. def ssl.ready?
STDOUT.puts "ssl.ready?"
@ssl_ready_lock ||= Mutex.new
@ssl_ready_lock.synchronize do
STDOUT.puts "ssl.ready? sync"
! @rbuffer.empty? || @io.ready?
end
end The deadlock happens right after the It took me a while to realize why a bare Given JRuby's marketing about "true" threading parallelism, I'd hope that their implementations would consider multi-threading use cases, but perhaps something was missed. Thanks for all the great work, and let me know if I can help any further. I'm fine running as we currently are (on MRI with limited parallelism), but would love to experiment more with JRuby when given the chance. We've also got some workarounds to limit the work done in our processing/acking thread which have been working decently. So yeah, no hurry. As for my use of topics vs queues, I have both. Production has a series of queues that are bridged to various topics (usually a 1:1 topic:queue relationship) to allow multiple consumers and all the other good things that come along with persistent queues (buffering when the app is down being the most useful). Development configuration just uses the topics, however, since we are just exercising the app consumption and don't require the benefits of queues. Also, the corporate ActiveMQ team gets really mad when the queues back up (even with queue limits), as they say that it stresses their hubs. We can deal with this in production, but I really don't want to have to deal with it every day in development as the queues would almost always be backed up. |
Thanks. Is it possible to show me your latest deadlock case code? I'd like to make sure I really understand what you are doing, that I am missing nothing. And code tells me that. |
Sure, although its re-typed from our internal network and a couple of things (server/login/key/topic names) are obviously changed: require 'concurrent'
require 'logging'
require 'stomp'
options = {
:hosts => [{
:login => 'me',
:host => 'myamqserver',
:port => 7230,
:ssl => ::Stomp::SSLParams.new({ :key_file => 'private_key.pem', :cert_file => 'cert.pem' }),
:parse_timeout => 10,
:sslctx_newparm => :TLSv1_2,
:connect_headers => { "accept-version" => "1.1", "host" => 'myamqserver' }
}],
}
thread_pool = Concurrent::FixedThreadPool.new(5)
client = Stomp.new(options)
client.subscribe('/topic/mytopic', :id => '1', :ack => 'client-individual') do |msg|
thread_pool.post do
begin
sleep(0.01) # simulate "work"
ensure
client.ack(msg)
end
end
end
puts 'subscribed'
loop do
sleep 10
puts '.'
end I'm using RVM to install JRuby, and I execute the code like
|
Thanks. I am current with your code now. I will do a few things different in testing. I do not use RVM. And will not. I cannot imagine who would care about your internal topic names. I will start to work on this over the weekend. If at some point I wanted to give you a gem build for smoke testing, would you be able to receive and test that? That code does not show AMQ 'prefetch'. Are you using that or not? It does change broker behavior. |
I don't think this makes any difference, but I would like to know: is your broker configured with needClientAuth=true or needClientAuth=false? |
I am somewhat at a loss as to how to proceed with your exact problem. I cannot recreate deadlock/hang/block of the environment. Even with the logger required. And many other odd experiments. I have to believe that something in your environment is triggering the behavior you see. If you have any suggestions, I am all ears. |
I have finally been able to recreate this on a sporadic basis. Recreation is not consistent. Maybe 20% of my tests will hang. All of the following seem to be need to be present for me to ever recreate:
When I can recreate, if I repeatedly press ^C, I see a number of messages that contain:
Which is an interesting message. What mutex? I am pretty much back to thinking that this is not a stomp gem problem. I will continue probing as time permits. |
Thanks for all your research into this. I can probably import and test a pre-release gem, but it won't be a very fast process due to corporate restrictions sigh. Due to the same restrictions, I am not positive how the broker is configured, but I am pretty sure that they have And I get a similar message when I ^C a deadlocked app, and was able to confirm the deadlock with jstack, namely running I am running by app on a decently provisioned "cloud" (OpenStack) VM, and the brokers seem to be adequately provisioned, however it is virtual, so I can't be sure what's actually going on underneath the VM. That said, I am currently processing over 250 messages/second with the MRI implementation and 10 threads on a similarly provisioned machine, and am hoping that I can increase that a bit with JRuby due to better parallelism (i.e. no GIL). In my test setup, I usually latch onto a topic that gives me about 5-10 messages/sec, and that seems to be enough to replicate the deadlock within a few (<20) seconds. Also, as another data point, I was able to get into the deadlock situation without I don't disagree that this sounds like a non-STOMP problem, although I'm not sure where to look further. JRuby? Unfortunately, the latest JRuby that I have internally is 9.0.3.0, but I'll see if I can't request a newer on in the next couple of days. |
Odds and ends here ........ Wow. Cool that you saw a reproduce without the logger. Even if it took some time. That is good information. I do not think a newer jruby will help. I can recreate the hangs with:
However, I do not see that 'waiting for mutex' message with 9.1.12.0. No information is produced with the use of ^C for that version. In the interests of full disclosure of what I have looked at:
Thinking about all of those issue reports at a high level, I more-or-less conclude:
Fortunately I do have access to my test broker. And to the logs produced. One interesting note (not surprising) is that: when the hang/block occurs, all of the ACKs have not made it to the broker. I very much wish that Ruby had a 'deadlock detection' mechanism as robust as the implementation in go (golang). This would be much easier to analyze. On average, how large are your messages (bytes)? |
I will add some comments before I get side tracked and/or just forget. After additional research and experiments I conclude that this deadlock/hang behavior does not exist for:
To experience this behavior the required environment is:
I can recreate the behavior without using either the discussed logger gem or the concurrency gem. That code is here: https://gist.github.com/gmallard/18108cac927b0d0704ca1c4cb95cc762 I had hoped that study and analysis would reveal a way to 'code around' this apparent JRuby problem. Unfortunately, given what I have observed during testing, I do not see a way the stomp gem can provide a 'code around'. (I remain open to suggestions of course). What follows is a high level description.
The '_transmit' method wraps a critical section using the 'transmit semaphore mutex'. The basic job of this critical section is to put a single STOMP message onto the outbound wire. This is performed in phases, basically:
As time proceeds, the threads created by the ACKer program compete for this critical section. At some point during processing an arbitrary thread will:
This eventually and naturally leads to all other threads being waiting for the transmit mutex, i.e. the deadlock. I hope the above description is reasonably clear. To reiterate: I do not see a way to code around this at present. @mlartz - my advice at this point is that you reconsider the use of SSL or JRuby or both. Also, I think this needs to be reported to the JRuby dev team, but my example(s) are too complex. Help me think about that please. |
I do not know why but this issue popped into my head recently, and I thought I would take another look. General observations: The deadlock/hang problem persists with: jruby 9.2.7.0 (2.5.3) 2019-04-09 8a269e3 OpenJDK 64-Bit Server VM 25.212-b03 on 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03 +jit [linux-x86_64] I can now recreate the problem with a normal Ruby environment. Ubuntu build: ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux-gnu] The problem appears specific to "topics". I can not recreate it using a 'queue'. I tried setting: OpenSSL::debug = true in hopes of getting more information. Unfortunately, that did not tell me anything new. One other note at this time: I can not recreate this problem with the Apollo broker or with Apache Artemis. And my local RabbitMQ is not set up for SSL. So the relevant environment seems to be:
|
I have updated the program I am using for testing and recreating this issue. That can be found here: https://gist.github.com/gmallard/0b1b6cc27d18c4c71b3649f5a3db409d Testing Procedure:
I have been testing with 5 receiver threads and 10,000 messages. That recreates the problem reliably. I discovered something during recent tests. Specifying an appropriate activemq.prefetchSize makes this problem totally disappear !! I just had success (using JRuby) with:
I can only conclude that this issue is triggered by what is fundamentally an AMQ configuration problem. |
Is it possible/recommended to ack a message from a worker thread? I've been using a STOMP server (corporate ActiveMQ) and client acks (
{ack: 'client-individual'}
) to provide reliable receipt and storage of messages. So I'm receiving a STOMP message, storing it to an external service, and then acking the message. I'd like to not block the receive thread and increase the concurrency of the storage requests. As such, I created a thread pool to do the storage of the message and then ack the message, which should provide the guarantee that all messages that are acked to STOMP are also stored.Unforutnately, I'm developing on a closed network, so the pseudo-code below is all constructed/simplified for this issue:
This seems to work fine in MRI. After a while, I needed to do a bit more work in the
store_to_external_service
, and MRI's parallelism was really lacking, so I attempted to switch to JRuby. When in JRuby, everything seems to chug along just fine to start, however the message receipt deadlocks after a few seconds, right after a@client.ack
. When I eliminate the thread pool (and just do the storing/acking inline), no deadlocks are observed. If I use even one storage thread (@thread_pool = Concurrent::FixedThreadPool.new(1)
), however, I get a deadlock after a few seconds of message receipt.So, ultimate question ... is message acking threadsafe? Is it OK to ack a message from a thread other than the receive thread? Or am I doing something else boneheaded wrt the worker thread(s)?
The text was updated successfully, but these errors were encountered: