Socket timeouts? #588

tul · 2022-02-01T18:15:49Z

tul
Feb 1, 2022

Hi all,
I just had an issue in prod where one of our Java servers got permanently stuck with the stack below:

"reconnect" - Thread t@237
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        - locked <10c23f69> (a java.io.BufferedInputStream)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        - locked <21231510> (a java.io.InputStreamReader)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.readLine(BufferedReader.java:324)
        - locked <21231510> (a java.io.InputStreamReader)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at io.nats.client.ConnectionImpl.readLine(ConnectionImpl.java:1172)
        at io.nats.client.ConnectionImpl.readOp(ConnectionImpl.java:1236)
        at io.nats.client.ConnectionImpl.processExpectedInfo(ConnectionImpl.java:709)
        at io.nats.client.ConnectionImpl.processConnectInit(ConnectionImpl.java:667)
        at io.nats.client.ConnectionImpl.doReconnect(ConnectionImpl.java:1007)
        at io.nats.client.ConnectionImpl$3.run(ConnectionImpl.java:846)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
        at io.nats.client.NatsThread.run(NatsThread.java:50)

(Note, this is not using the latest version of the Java NATS client code).

I believe the issue was caused by a temporary network partition on our infrastructure; however the client lib failed to recover. I suspect the NATS server closed the connection after not hearing from the client for a while, but the TCP FIN never arrived, and as the client didn't do any more writing, no RST either, hence the client never got an IOException - and just sat forever in the socket read call.

I believe normally this would be addressed with a socket read timeout (eg setSoTimeout) but I do not see that in my current client lib source, nor in the later versions on github.

Does anyone have any comments on this? Do we know if such timeouts are present in client libs for other languages? It would seem that without a timeout we are vulnerable to network partition errors such as this permanently stalling a client, even after the partition heals?

Producing a repro of this is probably not going to be easy, but might be possible with some sort of packet filter / ipfw tests.

Many thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Socket timeouts? #588

{{title}}

Replies: 0 comments

Select a reply

Socket timeouts? #588

tul Feb 1, 2022

Replies: 0 comments

tul
Feb 1, 2022