Is possible to expose more Clickhouse kafka connector metrics like p99 latency of ingesting to Clickhouse Cloud, # of errors/retriable errors, etc. #441

georgeli-roblox · 2024-09-18T17:20:13Z

Is your feature request related to a problem? Please describe.

@Paultagoras Thanks for adding the Clickhouse connector metrics for #209

# HELP clickhouse_kafka_connect_total ClickHouseKafkaConnector metric ReceivedRecords
# TYPE clickhouse_kafka_connect_total counter
clickhouse_kafka_connect{attribute="ReceivedRecords",sinktask="33",} 0.0
clickhouse_kafka_connect{attribute="RecordProcessingTime",sinktask="33",} 0.0
clickhouse_kafka_connect{attribute="TaskProcessingTime",sinktask="33",} 0.0
clickhouse_kafka_connect{attribute="ReceivedRecords",sinktask="23",} 362.0
clickhouse_kafka_connect{attribute="RecordProcessingTime",sinktask="23",} 8774846.0
clickhouse_kafka_connect{attribute="TaskProcessingTime",sinktask="23",} 7.5087340575E10
...

Also see below graph.

Some questions:

Is there any more detailed documentation on what exactly TaskProcessingTime/RecordProcessingTime/ReceivedRecords are , though they seemed to be intuitive? or we can take a look at the code.
These metrics has the tag of sinktask, Is it possible to link these sinktask#s with the connector name? since one Kafka Connect can host multiple Clickhouse connectors for different tables.
I
am trying to find a way to easily alert/identify/help troubleshoot the issue whether it's the Clickhouse Cloud side or the Roblox internal kafka side. Do you think maybe adding additional metrics could help? e.g. the # of errors/retries, the Clickhouse ingest latency histogram, p99, p95, p50....

Thanks,
George

Describe the solution you'd like
A clear and concise description of what you want to happen.

more detailed documentation on the TaskProcessingTime/RecordProcessingTime/ReceivedRecords metrics
more tags on the metric besides the sinktask number, e.g. tags mapped the name of the CH connector of a topic.
new metrics that show the health of the ingestion to Clickhouse Cloud.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

Paultagoras · 2024-09-19T07:11:55Z

@georgeli-roblox Hi! For Question 1 - do these details clear it up or did you have further questions about it?

georgeli-roblox · 2024-09-19T20:34:14Z

@georgeli-roblox Hi! For Question 1 - do these details clear it up or did you have further questions about it?

Ah, I didn't see this documentation before. Thanks. So the TaskProcessingTime could be used as the e2e (end-to-end) latency of ingesting to the Clickhouse Cloud?

Also , if it needs to wait for the Ack from Clickhouse Cloud, then it could take longer. if this metric is real time for processing 1 record or a batch of records, the prometheus metric scraper might have a larger interval in-between. So maybe it might be better to convert this counter/gauge into histograms for p99/p95/p50 latency ?

mzitnik · 2024-09-24T12:46:27Z

Hi @georgeli-roblox
After discussing with @Paultagoras regarding the feature.
I want to be aligned so that we are not missing anything.
Currently, our bean name is com.clickhouse:type=ClickHouseKafkaConnector,name=SinkTask we are planning to add the topic name context com.clickhouse:type=ClickHouseKafkaConnector,name=SinkTask-{topic}. I do not think we should break down also the partition scope or bring the ability to extract the connector name is I'm not sure that we retrieve inside the sink itself.

For errors/retries, we will collect everything that we can in the scope of the connector, but if the connector is restarting, we will lose the context. I think for the case we need to somehow collect metrics from the confluent platform itself (We will investigate it once done with everything related to the sink itself - if you premiere knowledge, we will appreciate your feedback on that)

Regarding latency, you want us to expose it as histogram, the same as Prometheus will expect.

georgeli-roblox · 2024-09-24T18:20:07Z

Hi @georgeli-roblox After discussing with @Paultagoras regarding the feature. I want to be aligned so that we are not missing anything. Currently, our bean name is com.clickhouse:type=ClickHouseKafkaConnector,name=SinkTask we are planning to add the topic name context com.clickhouse:type=ClickHouseKafkaConnector,name=SinkTask-{topic}. I do not think we should break down also the partition scope or bring the ability to extract the connector name is I'm not sure that we retrieve inside the sink itself.

Having the topic in the metric tag instead of the sinkTask# would help a lot. though it's possible multiple connector sharing the same topic, but I think that would be rare(?).

For errors/retries, we will collect everything that we can in the scope of the connector, but if the connector is restarting, we will lose the context. I think for the case we need to somehow collect metrics from the confluent platform itself (We will investigate it once done with everything related to the sink itself - if you premiere knowledge, we will appreciate your feedback on that)

I think if the connector restarting, and the metrics (errors/retries) getting reset it should be fine. e.g. the COUNTER type, whose value will increase monolithically and need the rate() on interval. Right now, if there are retryable exceptions (defined in a list in the Clickhouse connector), we need to check the service logs.

Regarding latency, you want us to expose it as histogram, the same as Prometheus will expect.

Yes. I think that would work to use the histogram_quantile() over the bucket. or directly with tags like p999, p99, p95, p75, p50. e.g. see this Kafka metric:

Thanks @mzitnik @Paultagoras

georgeli-roblox added the enhancement New feature or request label Sep 18, 2024

Paultagoras self-assigned this Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is possible to expose more Clickhouse kafka connector metrics like p99 latency of ingesting to Clickhouse Cloud, # of errors/retriable errors, etc. #441

Is possible to expose more Clickhouse kafka connector metrics like p99 latency of ingesting to Clickhouse Cloud, # of errors/retriable errors, etc. #441

georgeli-roblox commented Sep 18, 2024 •

edited by mshustov

Loading

Paultagoras commented Sep 19, 2024

georgeli-roblox commented Sep 19, 2024

mzitnik commented Sep 24, 2024

georgeli-roblox commented Sep 24, 2024

Is possible to expose more Clickhouse kafka connector metrics like p99 latency of ingesting to Clickhouse Cloud, # of errors/retriable errors, etc. #441

Is possible to expose more Clickhouse kafka connector metrics like p99 latency of ingesting to Clickhouse Cloud, # of errors/retriable errors, etc. #441

Comments

georgeli-roblox commented Sep 18, 2024 • edited by mshustov Loading

Paultagoras commented Sep 19, 2024

georgeli-roblox commented Sep 19, 2024

mzitnik commented Sep 24, 2024

georgeli-roblox commented Sep 24, 2024

georgeli-roblox commented Sep 18, 2024 •

edited by mshustov

Loading