You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi all,
I just had an issue in prod where one of our Java servers got permanently stuck with the stack below:
"reconnect" - Thread t@237
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
- locked <10c23f69> (a java.io.BufferedInputStream)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
- locked <21231510> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
- locked <21231510> (a java.io.InputStreamReader)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at io.nats.client.ConnectionImpl.readLine(ConnectionImpl.java:1172)
at io.nats.client.ConnectionImpl.readOp(ConnectionImpl.java:1236)
at io.nats.client.ConnectionImpl.processExpectedInfo(ConnectionImpl.java:709)
at io.nats.client.ConnectionImpl.processConnectInit(ConnectionImpl.java:667)
at io.nats.client.ConnectionImpl.doReconnect(ConnectionImpl.java:1007)
at io.nats.client.ConnectionImpl$3.run(ConnectionImpl.java:846)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
at io.nats.client.NatsThread.run(NatsThread.java:50)
(Note, this is not using the latest version of the Java NATS client code).
I believe the issue was caused by a temporary network partition on our infrastructure; however the client lib failed to recover. I suspect the NATS server closed the connection after not hearing from the client for a while, but the TCP FIN never arrived, and as the client didn't do any more writing, no RST either, hence the client never got an IOException - and just sat forever in the socket read call.
I believe normally this would be addressed with a socket read timeout (eg setSoTimeout) but I do not see that in my current client lib source, nor in the later versions on github.
Does anyone have any comments on this? Do we know if such timeouts are present in client libs for other languages? It would seem that without a timeout we are vulnerable to network partition errors such as this permanently stalling a client, even after the partition heals?
Producing a repro of this is probably not going to be easy, but might be possible with some sort of packet filter / ipfw tests.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi all,
I just had an issue in prod where one of our Java servers got permanently stuck with the stack below:
(Note, this is not using the latest version of the Java NATS client code).
I believe the issue was caused by a temporary network partition on our infrastructure; however the client lib failed to recover. I suspect the NATS server closed the connection after not hearing from the client for a while, but the TCP FIN never arrived, and as the client didn't do any more writing, no RST either, hence the client never got an
IOException
- and just sat forever in the socket read call.I believe normally this would be addressed with a socket read timeout (eg
setSoTimeout
) but I do not see that in my current client lib source, nor in the later versions on github.Does anyone have any comments on this? Do we know if such timeouts are present in client libs for other languages? It would seem that without a timeout we are vulnerable to network partition errors such as this permanently stalling a client, even after the partition heals?
Producing a repro of this is probably not going to be easy, but might be possible with some sort of packet filter / ipfw tests.
Many thanks!
Beta Was this translation helpful? Give feedback.
All reactions