Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is possible to expose more Clickhouse kafka connector metrics like p99 latency of ingesting to Clickhouse Cloud, # of errors/retriable errors, etc. #441

Open
georgeli-roblox opened this issue Sep 18, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@georgeli-roblox
Copy link

georgeli-roblox commented Sep 18, 2024

Is your feature request related to a problem? Please describe.

@Paultagoras Thanks for adding the Clickhouse connector metrics for #209

# HELP clickhouse_kafka_connect_total ClickHouseKafkaConnector metric ReceivedRecords
# TYPE clickhouse_kafka_connect_total counter
clickhouse_kafka_connect{attribute="ReceivedRecords",sinktask="33",} 0.0
clickhouse_kafka_connect{attribute="RecordProcessingTime",sinktask="33",} 0.0
clickhouse_kafka_connect{attribute="TaskProcessingTime",sinktask="33",} 0.0
clickhouse_kafka_connect{attribute="ReceivedRecords",sinktask="23",} 362.0
clickhouse_kafka_connect{attribute="RecordProcessingTime",sinktask="23",} 8774846.0
clickhouse_kafka_connect{attribute="TaskProcessingTime",sinktask="23",} 7.5087340575E10
...

Also see below graph.

Some questions:

  1. Is there any more detailed documentation on what exactly TaskProcessingTime/RecordProcessingTime/ReceivedRecords are , though they seemed to be intuitive? or we can take a look at the code.
  2. These metrics has the tag of sinktask, Is it possible to link these sinktask#s with the connector name? since one Kafka Connect can host multiple Clickhouse connectors for different tables.
    I
  3. am trying to find a way to easily alert/identify/help troubleshoot the issue whether it's the Clickhouse Cloud side or the Roblox internal kafka side. Do you think maybe adding additional metrics could help? e.g. the # of errors/retries, the Clickhouse ingest latency histogram, p99, p95, p50....

Thanks,
George

Describe the solution you'd like
A clear and concise description of what you want to happen.

  1. more detailed documentation on the TaskProcessingTime/RecordProcessingTime/ReceivedRecords metrics
  2. more tags on the metric besides the sinktask number, e.g. tags mapped the name of the CH connector of a topic.
  3. new metrics that show the health of the ingestion to Clickhouse Cloud.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Screenshot 2024-09-16 at 9 23 08 PM
@georgeli-roblox georgeli-roblox added the enhancement New feature or request label Sep 18, 2024
@Paultagoras Paultagoras self-assigned this Sep 19, 2024
@Paultagoras
Copy link
Contributor

@georgeli-roblox Hi! For Question 1 - do these details clear it up or did you have further questions about it?

@georgeli-roblox
Copy link
Author

@georgeli-roblox Hi! For Question 1 - do these details clear it up or did you have further questions about it?

Ah, I didn't see this documentation before. Thanks. So the TaskProcessingTime could be used as the e2e (end-to-end) latency of ingesting to the Clickhouse Cloud?

Also , if it needs to wait for the Ack from Clickhouse Cloud, then it could take longer. if this metric is real time for processing 1 record or a batch of records, the prometheus metric scraper might have a larger interval in-between. So maybe it might be better to convert this counter/gauge into histograms for p99/p95/p50 latency ?

@mzitnik
Copy link
Collaborator

mzitnik commented Sep 24, 2024

Hi @georgeli-roblox
After discussing with @Paultagoras regarding the feature.
I want to be aligned so that we are not missing anything.
Currently, our bean name is com.clickhouse:type=ClickHouseKafkaConnector,name=SinkTask we are planning to add the topic name context com.clickhouse:type=ClickHouseKafkaConnector,name=SinkTask-{topic}. I do not think we should break down also the partition scope or bring the ability to extract the connector name is I'm not sure that we retrieve inside the sink itself.

For errors/retries, we will collect everything that we can in the scope of the connector, but if the connector is restarting, we will lose the context. I think for the case we need to somehow collect metrics from the confluent platform itself (We will investigate it once done with everything related to the sink itself - if you premiere knowledge, we will appreciate your feedback on that)

Regarding latency, you want us to expose it as histogram, the same as Prometheus will expect.

@georgeli-roblox
Copy link
Author

Hi @georgeli-roblox After discussing with @Paultagoras regarding the feature. I want to be aligned so that we are not missing anything. Currently, our bean name is com.clickhouse:type=ClickHouseKafkaConnector,name=SinkTask we are planning to add the topic name context com.clickhouse:type=ClickHouseKafkaConnector,name=SinkTask-{topic}. I do not think we should break down also the partition scope or bring the ability to extract the connector name is I'm not sure that we retrieve inside the sink itself.

Having the topic in the metric tag instead of the sinkTask# would help a lot. though it's possible multiple connector sharing the same topic, but I think that would be rare(?).

For errors/retries, we will collect everything that we can in the scope of the connector, but if the connector is restarting, we will lose the context. I think for the case we need to somehow collect metrics from the confluent platform itself (We will investigate it once done with everything related to the sink itself - if you premiere knowledge, we will appreciate your feedback on that)

I think if the connector restarting, and the metrics (errors/retries) getting reset it should be fine. e.g. the COUNTER type, whose value will increase monolithically and need the rate() on interval. Right now, if there are retryable exceptions (defined in a list in the Clickhouse connector), we need to check the service logs.

Regarding latency, you want us to expose it as histogram, the same as Prometheus will expect.

Yes. I think that would work to use the histogram_quantile() over the bucket. or directly with tags like p999, p99, p95, p75, p50. e.g. see this Kafka metric:

Screenshot 2024-09-24 at 11 16 17 AM

Thanks @mzitnik @Paultagoras

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants