Description
Overview
As part of the work to roll out replicated ClickHouse we'll be needing some long running testing to ensure stability of the replicated cluster. Specifically, we'll be wanting to know how stable the replicated cluster is when left alone for a while under load (i.e., days, weeks).
We'll need to monitor the system and answer the following questions periodically during a long period of time (a month or so?):
- Is data consistent across all replicas?
- Is query performance acceptable under load?
- Are the queue lengths acceptable under load?
- Do queue lengths grow over time or are they consistent depending on the load?
- <more?>
Implementation
We'll probably want to use clickhouse-admin to extract information from the system. There is a clickhouse-admin binary already installed in each of the clickhouse-{server|keeper} nodes.
To retrieve the information we need, we can leverage the following native ClickHouse tooling:
- ClickHouse creates trace spans for each query. This tracing information can be retrieved from
system.opentelemetry_span_log
- Queue information can be retrieved from
system.replication_queue
andsystem.distributed_ddl_queue
- To verify data consistency, we can use some of the tools described in the ClickHouse knowledge base.
- Query performance information can be retrieved from
system.system.user_processes
Altinity has a pretty cool ClickHouse stress test suite. We can probably use it for running stress tests, or take inspiration from it to create our own stress tests.
Relevant links
- https://clickhouse.com/docs/en/operations/system-tables
- https://clickhouse.com/docs/en/operations/settings/settings#opentelemetry_start_trace_probability
- https://clickhouse.com/docs/en/operations/startup-scripts
- https://clickhouse.com/docs/en/operations/system-tables/opentelemetry_span_log
- https://clickhouse.com/blog/storing-traces-and-spans-open-telemetry-in-clickhouse
- https://clickhouse.com/docs/en/operations/system-tables/query_thread_log
- https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings#query-log
- https://clickhouse.com/docs/en/operations/settings/settings#log_queries
- https://clickhouse.com/docs/en/operations/monitoring
- https://altinity.com/blog/unraveling-the-mystery-of-idle-threads-in-clickhouse
- https://chistadata.com/knowledge-base/troubleshooting-clickhouse-performance/
Tasks
- Distributed DDL queue info retrieval [clickhouse] Retrieve distributed ddl queue info #6986
- Replication queue info retrieval
- OTEL traces info ?
- Query thread log
- Query metric info
- Enable async metric log and metric log system tables [clickhouse] Enable
system.metric_log
andsystem.asynchronous_metric_log
tables #7100 - Clickhouse-admin endpoint to query system tables [clickhouse] Internal monitoring endpoint #7114
- Monitoring tool to visualise internal metrics [clickhouse] Clickana monitoring dashboard tool #7207