Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshoot zookeeper connection failures #310

Open
wants to merge 1 commit into
base: static-props-some-day
Choose a base branch
from

Conversation

solsson
Copy link
Contributor

@solsson solsson commented Mar 8, 2020

I've experimented with a 3-kafka 5-zk cluster on a 3-node GKE e2-small cluster, i.e. just enough resources to run the Kafka stack together with some basic infra. During tests I occasionally get kafka pods restarting due to failures to connect to zookeeper. Also clients like the kafka-topics CLI had troubles connecting. No zookeeper restarts, no OOMKilled. See the first commit comment for some stack traces.

and I'm guessing it could be due to throttling

Kafka for example reports:
kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING

And zookeeper logs include things like:
java.io.EOFException
        at java.base/java.io.DataInputStream.readInt(DataInputStream.java:397)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1206)

[2020-03-06 12:35:45,056] WARN Connection broken for id 1, my id = 4, error =  (org.apache.zookeeper.server.quorum.QuorumCnxManager)
java.net.SocketException: Socket closed
        at java.base/java.net.SocketInputStream.socketRead0(Native Method)
        at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
        at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
        at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271)
        at java.base/java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1206)

I've been able to resolve it twice with this memory limit increase,
but of course it could also be that the restart resolves the issue.
solsson added a commit that referenced this pull request Apr 23, 2020
through the service. This helps troubleshoot issues like #310 by pointing
out (by podIP) which actual zookeeper connection that failed.
Also I like the simplification.
@solsson
Copy link
Contributor Author

solsson commented Apr 25, 2020

The issue is present in clusters with preemptive notes as well. I just found https://issues.apache.org/jira/browse/ZOOKEEPER-2938 with fix https://issues.apache.org/jira/browse/ZOOKEEPER-2164. I just found those issues so on next failure I'll try to only delete the leader. Until now I've scaled down and up again.

@solsson
Copy link
Contributor Author

solsson commented May 2, 2020

https://zookeeper.apache.org/doc/r3.6.1/releasenotes.html contains:

ZOOKEEPER-2164 - fast leader election keeps failing
ZOOKEEPER-3769 - fast leader election does not end if leader is taken down
ZOOKEEPER-3776 - Cluster stuck not forming up quorum

They might not be the cause for timeouts, but definitely pose difficulties for stabilization on ephemeral/shared-core nodes.

@solsson
Copy link
Contributor Author

solsson commented May 7, 2020

After testing both Zookeeper 3.5.7 and 3.6.1 with a lot of different configs the essence of the failure remains. A kafka pod fails to start up, with timeouts like:

[2020-05-03 16:24:29,532] INFO Opening socket connection to server zoo/10.28.2.7:2181 (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:29,533] INFO Socket connection established, initiating session, client: /10.28.2.3:45490, server: zoo/10.28.2.7:2181 (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:35,536] WARN Client session timed out, have not heard from server in 6002ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:35,536] INFO Client session timed out, have not heard from server in 6002ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:36,256] INFO [ZooKeeperClient Kafka server] Closing. (kafka.zookeeper.ZooKeeperClient)
[2020-05-03 16:24:37,241] INFO Opening socket connection to server zoo/10.28.0.2:2181 (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:37,243] INFO Socket connection established, initiating session, client: /10.28.2.3:50054, server: zoo/10.28.0.2:2181 (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:43,251] WARN Client session timed out, have not heard from server in 6007ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:43,265] ERROR Timeout error occurred for the packet 'clientPath:null serverPath:null finished:false header:: 0,-11  replyHeader:: 0,0,0  request:: null response:: null'. (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:43,368] INFO EventThread shut down for session: 0x0 (org.apache.zookeeper.ClientCnxn)
[2020-05-03 16:24:43,369] INFO Session: 0x0 closed (org.apache.zookeeper.ZooKeeper)
[2020-05-03 16:24:43,377] INFO [ZooKeeperClient Kafka server] Closed. (kafka.zookeeper.ZooKeeperClient)
[2020-05-03 16:24:43,380] ERROR Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING
	at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:262)
	at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:258)
	at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:119)
	at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1863)
	at kafka.server.KafkaServer.createZkClient$1(KafkaServer.scala:378)
	at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:403)
	at kafka.server.KafkaServer.startup(KafkaServer.scala:210)
	at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:44)
	at kafka.Kafka$.main(Kafka.scala:82)
	at kafka.Kafka.main(Kafka.scala)

The amount of retries depends on the combination of zookeeper.session.timeout.ms and zookeeper.connection.timeout.ms. For example with session timeout 6s and connection timeout 60s you get ~10 retries. Kafka 2.2.0 changelog says:

ZooKeeper hosts are now re-resolved if connection attempt fails. But if your ZooKeeper host names resolve to multiple addresses and some of them are not reachable, then you may need to increase the connection timeout zookeeper.connection.timeout.ms.

And Kafka 2.5.0 increases the default session timeout to 18s based on a reasoning around transient instability vs genuine failures.

I haven't found any effect of varying these timeouts on the actual cause of the crashloop.
With debug logging on zookeeper you can see that Kafka (pod IP 10.28.2.4) succeeds in connecting, but that the connection closes (normally?). In this log snippet there's a successful connection from a different broker (10.28.0.11) happinging inbetween.

[2020-05-05 04:10:32,348] DEBUG Client attempting to establish new session: session = 0x3000001d47a0013, zxid = 0x0, timeout = 6000, address = /10.28.2.4:47514 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-05-05 04:10:32,351] DEBUG Processing request:: sessionid:0x3000001d47a0013 type:createSession cxid:0x0 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a (org.apache.zookeeper.server.quorum.CommitProcessor)
[2020-05-05 04:10:40,450] DEBUG Accepted socket connection from /10.28.0.11:52104 (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2020-05-05 04:10:40,451] DEBUG Session establishment request from client /10.28.0.11:52104 client's lastZxid is 0x0 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-05-05 04:10:40,451] DEBUG Client attempting to establish new session: session = 0x3000001d47a0014, zxid = 0x0, timeout = 6000, address = /10.28.0.11:52104 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-05-05 04:10:40,451] DEBUG Processing request:: sessionid:0x3000001d47a0014 type:createSession cxid:0x0 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a (org.apache.zookeeper.server.quorum.CommitProcessor)
[2020-05-05 04:10:40,593] DEBUG Closed socket connection for client /10.28.2.4:47514 which had sessionid 0x3000001d47a0013 (org.apache.zookeeper.server.NIOServerCnxn)
[2020-05-05 04:10:40,594] DEBUG Closed socket connection for client /10.28.2.4:47500 which had sessionid 0x3000001d47a0012 (org.apache.zookeeper.server.NIOServerCnxn)

solsson added a commit that referenced this pull request Jul 12, 2020
to simplify troubleshooting in cases like #310 with ephemeral pods

Because of the relation between replica count and container args
we're only doing this on fixed scale variants, not on bases.
solsson added a commit to solsson/dockerfiles that referenced this pull request Aug 16, 2020
Yolean/kubernetes-kafka#310
because when I investigated the timeouts there I found reports
that similar errors had been fixed by cleaning up (or trashing) persisted state.

> Detected a direct/mapped ByteBuffer in the image heap.
> A direct ByteBuffer has a pointer to unmanaged C memory,
> and C memory from the image generator is not available at image run time.
> A mapped ByteBuffer references a file descriptor,
> which is no longer open and mapped at run time.

Error: com.oracle.graal.pointsto.constraints.UnsupportedFeatureException: Detected a direct/mapped ByteBuffer in the image heap. A direct ByteBuffer has a pointer to unmanaged C memory, and C memory from the image generator is not available at image run time.A mapped ByteBuffer references a file descriptor, which is no longer open and mapped at run time.   To see how this object got instantiated use -H:+TraceClassInitialization. The object was probably created by a class initializer and is reachable from a static field. You can request class initialization at image run time by using the option --initialize-at-run-time=<class-name>. Or you can write your own initialization methods and call them explicitly from your main entry point.
Detailed message:
Trace:
	at parsing org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:78)
Call path from entry point to org.apache.zookeeper.server.persistence.FilePadding.padFile(FileChannel):
	at org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:76)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant