Troubleshoot zookeeper connection failures #310

solsson · 2020-03-08T17:41:50Z

I've experimented with a 3-kafka 5-zk cluster on a 3-node GKE e2-small cluster, i.e. just enough resources to run the Kafka stack together with some basic infra. During tests I occasionally get kafka pods restarting due to failures to connect to zookeeper. Also clients like the kafka-topics CLI had troubles connecting. No zookeeper restarts, no OOMKilled. See the first commit comment for some stack traces.

and I'm guessing it could be due to throttling Kafka for example reports: kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING And zookeeper logs include things like: java.io.EOFException at java.base/java.io.DataInputStream.readInt(DataInputStream.java:397) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1206) [2020-03-06 12:35:45,056] WARN Connection broken for id 1, my id = 4, error = (org.apache.zookeeper.server.quorum.QuorumCnxManager) java.net.SocketException: Socket closed at java.base/java.net.SocketInputStream.socketRead0(Native Method) at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115) at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168) at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140) at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252) at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271) at java.base/java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1206) I've been able to resolve it twice with this memory limit increase, but of course it could also be that the restart resolves the issue.

through the service. This helps troubleshoot issues like #310 by pointing out (by podIP) which actual zookeeper connection that failed. Also I like the simplification.

solsson · 2020-04-25T11:48:43Z

The issue is present in clusters with preemptive notes as well. I just found https://issues.apache.org/jira/browse/ZOOKEEPER-2938 with fix https://issues.apache.org/jira/browse/ZOOKEEPER-2164. I just found those issues so on next failure I'll try to only delete the leader. Until now I've scaled down and up again.

solsson · 2020-05-02T07:20:24Z

https://zookeeper.apache.org/doc/r3.6.1/releasenotes.html contains:

ZOOKEEPER-2164 - fast leader election keeps failing
ZOOKEEPER-3769 - fast leader election does not end if leader is taken down
ZOOKEEPER-3776 - Cluster stuck not forming up quorum

They might not be the cause for timeouts, but definitely pose difficulties for stabilization on ephemeral/shared-core nodes.

solsson · 2020-05-07T05:40:02Z

After testing both Zookeeper 3.5.7 and 3.6.1 with a lot of different configs the essence of the failure remains. A kafka pod fails to start up, with timeouts like:

[2020-05-03 16:24:29,532] INFO Opening socket connection to server zoo/10.28.2.7:2181 (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:29,533] INFO Socket connection established, initiating session, client: /10.28.2.3:45490, server: zoo/10.28.2.7:2181 (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:35,536] WARN Client session timed out, have not heard from server in 6002ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:35,536] INFO Client session timed out, have not heard from server in 6002ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:36,256] INFO [ZooKeeperClient Kafka server] Closing. (kafka.zookeeper.ZooKeeperClient)
[2020-05-03 16:24:37,241] INFO Opening socket connection to server zoo/10.28.0.2:2181 (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:37,243] INFO Socket connection established, initiating session, client: /10.28.2.3:50054, server: zoo/10.28.0.2:2181 (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:43,251] WARN Client session timed out, have not heard from server in 6007ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:43,265] ERROR Timeout error occurred for the packet 'clientPath:null serverPath:null finished:false header:: 0,-11  replyHeader:: 0,0,0  request:: null response:: null'. (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:43,368] INFO EventThread shut down for session: 0x0 (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:43,369] INFO Session: 0x0 closed (org.apache.zookeeper.ZooKeeper)
[2020-05-03 16:24:43,377] INFO [ZooKeeperClient Kafka server] Closed. (kafka.zookeeper.ZooKeeperClient)
[2020-05-03 16:24:43,380] ERROR Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING
	at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:262)
	at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:258)
	at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:119)
	at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1863)
	at kafka.server.KafkaServer.createZkClient$1(KafkaServer.scala:378)
	at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:403)
	at kafka.server.KafkaServer.startup(KafkaServer.scala:210)
	at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:44)
	at kafka.Kafka$.main(Kafka.scala:82)
	at kafka.Kafka.main(Kafka.scala)

The amount of retries depends on the combination of zookeeper.session.timeout.ms and zookeeper.connection.timeout.ms. For example with session timeout 6s and connection timeout 60s you get ~10 retries. Kafka 2.2.0 changelog says:

ZooKeeper hosts are now re-resolved if connection attempt fails. But if your ZooKeeper host names resolve to multiple addresses and some of them are not reachable, then you may need to increase the connection timeout zookeeper.connection.timeout.ms.

And Kafka 2.5.0 increases the default session timeout to 18s based on a reasoning around transient instability vs genuine failures.

I haven't found any effect of varying these timeouts on the actual cause of the crashloop.
With debug logging on zookeeper you can see that Kafka (pod IP 10.28.2.4) succeeds in connecting, but that the connection closes (normally?). In this log snippet there's a successful connection from a different broker (10.28.0.11) happinging inbetween.

[2020-05-05 04:10:32,348] DEBUG Client attempting to establish new session: session = 0x3000001d47a0013, zxid = 0x0, timeout = 6000, address = /10.28.2.4:47514 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-05-05 04:10:32,351] DEBUG Processing request:: sessionid:0x3000001d47a0013 type:createSession cxid:0x0 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a (org.apache.zookeeper.server.quorum.CommitProcessor)
[2020-05-05 04:10:40,450] DEBUG Accepted socket connection from /10.28.0.11:52104 (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2020-05-05 04:10:40,451] DEBUG Session establishment request from client /10.28.0.11:52104 client's lastZxid is 0x0 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-05-05 04:10:40,451] DEBUG Client attempting to establish new session: session = 0x3000001d47a0014, zxid = 0x0, timeout = 6000, address = /10.28.0.11:52104 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-05-05 04:10:40,451] DEBUG Processing request:: sessionid:0x3000001d47a0014 type:createSession cxid:0x0 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a (org.apache.zookeeper.server.quorum.CommitProcessor)
[2020-05-05 04:10:40,593] DEBUG Closed socket connection for client /10.28.2.4:47514 which had sessionid 0x3000001d47a0013 (org.apache.zookeeper.server.NIOServerCnxn)
[2020-05-05 04:10:40,594] DEBUG Closed socket connection for client /10.28.2.4:47500 which had sessionid 0x3000001d47a0012 (org.apache.zookeeper.server.NIOServerCnxn)

to simplify troubleshooting in cases like #310 with ephemeral pods Because of the relation between replica count and container args we're only doing this on fixed scale variants, not on bases.

Yolean/kubernetes-kafka#310 because when I investigated the timeouts there I found reports that similar errors had been fixed by cleaning up (or trashing) persisted state. > Detected a direct/mapped ByteBuffer in the image heap. > A direct ByteBuffer has a pointer to unmanaged C memory, > and C memory from the image generator is not available at image run time. > A mapped ByteBuffer references a file descriptor, > which is no longer open and mapped at run time. Error: com.oracle.graal.pointsto.constraints.UnsupportedFeatureException: Detected a direct/mapped ByteBuffer in the image heap. A direct ByteBuffer has a pointer to unmanaged C memory, and C memory from the image generator is not available at image run time.A mapped ByteBuffer references a file descriptor, which is no longer open and mapped at run time. To see how this object got instantiated use -H:+TraceClassInitialization. The object was probably created by a class initializer and is reachable from a static field. You can request class initialization at image run time by using the option --initialize-at-run-time=<class-name>. Or you can write your own initialization methods and call them explicitly from your main entry point. Detailed message: Trace: at parsing org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:78) Call path from entry point to org.apache.zookeeper.server.persistence.FilePadding.padFile(FileChannel): at org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:76)

solsson mentioned this pull request Aug 16, 2020

GraalVM native-image builds based on the Kafka 2.6.0 release solsson/dockerfiles#31

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshoot zookeeper connection failures #310

Troubleshoot zookeeper connection failures #310

solsson commented Mar 8, 2020

solsson commented Apr 25, 2020

solsson commented May 2, 2020 •

edited

Loading

solsson commented May 7, 2020

Troubleshoot zookeeper connection failures #310

Are you sure you want to change the base?

Troubleshoot zookeeper connection failures #310

Conversation

solsson commented Mar 8, 2020

solsson commented Apr 25, 2020

solsson commented May 2, 2020 • edited Loading

solsson commented May 7, 2020

solsson commented May 2, 2020 •

edited

Loading