- Download JSON file from github release page
- Go to domain:3000/dashboard/import
- Upload JSON file
Total Number of active Controllers
sum(kafka_cluster_partition_underreplicated)
sum(kafka_controller_kafkacontroller_activecontrollercount)
Unhealthy: If sum is not equal 1 is not normal
The number of Partitions being copied.
sum(kafka_cluster_partition_underreplicated)
Unhealthy: > 0 is unhealthy. However, if the Kafka cluster is reassigning partitions, this value will also be >0
Total Number of offline partition. (Partitions in this state is neither readable nor writable)
sum(kafka_controller_kafkacontroller_offlinepartitionscount)
Unhealthy: The presence of >0 is abnormal.
Expansion rate of in-sync replicas
sum(rate(kafka_server_replicamanager_isrexpands_total[5m]))
Shrinkage rate of in-sync replicas
sum(rate(kafka_server_replicamanager_isrshrinks_total[5m]))
Time the request waits for the follower Produce
avg(kafka_network_requestmetrics_remotetimems)
Time the request waits for the response
avg(kafka_network_requestmetrics_responsequeuetimems)
Time the request waits for the follower Produce
avg(kafka_network_requestmetrics_remotetimems)
Rank top5 Output Bytes
topk(5, (sum by(topic) ((rate(kafka_server_brokertopicmetrics_bytesout_total[5m]))
or (irate(kafka_server_brokertopicmetrics_bytesout_total[5m])))))
Rank top5 Input Bytes
topk(5, (sum by(topic) (rate(kafka_server_brokertopicmetrics_bytesin_total[5m])
or irate(kafka_server_brokertopicmetrics_bytesin_total[5m]))))
Rank top5 Message received
topk(5, sum by(topic)(rate(kafka_server_brokertopicmetrics_messagesin_total[5m])))
count(kafka_server_kafkaserver_brokerstate{pod=~"$Pod"})
kafka_server_replicamanager_partitioncount{pod=~"$Pod"}
kafka_server_replicamanager_leadercount{pod=~"$Pod"}
###Network Processor NOT idle
Average fraction of time the network processor threads are idle
1- avg(kafka_network_processor_idle_percent{pod=~"$Pod"}) by (pod)
sum by (pod)(irate(kafka_network_requestmetrics_errors_total{pod=~"$Pod"}[5m]))
sum(kafka_server_socket_server_metrics_connection_count{pod=~"$Pod"}) by (pod)
PurgatorySize: Number of requests waiting in producer purgatory, Number of requests waiting in fetch purgatory
sum(kafka_server_delayedoperationpurgatory_purgatorysize{pod=~"$Pod"}) by (delayedOperation)
The average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds.
avg(kafka_server_socket_server_metrics_io_wait_time_ns_avg) by (pod)
f02bf0409c4cfc9bd0cf651e474a7d005affc781
Monitor the records-lag-max metric from the Java consumer
max(kafka_server_replicafetchermanager_maxlag)
consumer message rate
rate(kafka_consumergroup_current_offset[1m])
avg(kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms)
avg(kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms)
avg(kafka_server_socket_server_metrics_request_rate)
avg(rate(kafka_server_socket_server_metrics_response_rate[5m])) * 100
avg(kafka_server_socket_server_metrics_io_wait_time_ns_avg)
CPU Usage in Kafka
calculation not sure
sum(rate(container_cpu_user_seconds_total{namespace=~".*kafka.*"}[30s])) * 100
process_open_fds{container="kafka"}
Percentage of Occupancy Limit
(process_open_fds{container="kafka"} / process_max_fds{container="kafka"}) * 100
Rank top 10 log size according to the topics (Only show >= 1kbs )
topk(10, sum by(topic) (kafka_log_log_size{container="kafka"}) >= 1024)
Rank top 10 log size
topk(10, kafka_log_log_size{container="kafka"})
<<<<<<< HEAD
f02bf0409c4cfc9bd0cf651e474a7d005affc781