Questions on functionality #18

lconnell · 2017-11-03T10:40:55Z

Thanks for building this check! I have some questions on some of the behavior I have been experiencing while using this tool and want to make sure I am using it properly.

Replication and partitions count is set to 1 on the broker-replication-check topic. This doesn't seem correct as it wouldn't be able to determine if all brokers were in the ISR.

I recently have encountered a Kafka outage and after I recovered the cluster, I had to restart the Kafka health check service on ALL of my brokers so that it would detect that the brokers were healthy again. This is most likely due to losing connection to the cluster/zookeeper, however in one of my environments where I am experiencing issues getting kafka-health-check to report a healthy cluster, I can see that it does make reconnect attempts.

INFO[0037] closing connection and reconnecting         
 
INFO[0042] found partition id 1 for broker 0 in topic "broker-0-health-check" 

INFO[0042] found partition id 2 for broker 0 in topic "broker-replication-check"
 
INFO[0042] reconnected

I am still unable to figure out why kafka-health-check will not report green on this cluster. I have recompiled the check with an increased timeout without any progress. This is on a fresh Kafka cluster with only the consumer_offsets partition. It will just report NOOK and continue in a loop as mentioned above.

Thank you!

The text was updated successfully, but these errors were encountered:

andreas-schroeder · 2017-11-13T19:29:03Z

Hi @lconnell ,

thanks for trying kafka-health-check :)
concerning the broker-replication-check topic, the replica set is expanded on broker health check start and shrunk on shutdown.

It's of course sub-par to have to restart the health-check on each and every node; can you provide details on the health-check output? Maybe that's because the health check topics are only auto-created on initial startup; if the vanish during runtime, the health-check will not re-create them.

As of why the cluster isn't reported as healthy, I'd assume that there still is some state in ZooKeeper since it finds partition id 2 in broker-replication-check. Can you give details on what JSON is returned by the health check endpoint / and the /cluster endpoint? This could give more insights on what bothers the health check.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on functionality #18

Questions on functionality #18

lconnell commented Nov 3, 2017 •

edited

Loading

andreas-schroeder commented Nov 13, 2017 •

edited

Loading

Questions on functionality #18

Questions on functionality #18

Comments

lconnell commented Nov 3, 2017 • edited Loading

andreas-schroeder commented Nov 13, 2017 • edited Loading

lconnell commented Nov 3, 2017 •

edited

Loading

andreas-schroeder commented Nov 13, 2017 •

edited

Loading