forked from Altinity/clickhouse-operator
-
Notifications
You must be signed in to change notification settings - Fork 0
/
prometheus-alert-rules-clickhouse.yaml
533 lines (486 loc) · 41.1 KB
/
prometheus-alert-rules-clickhouse.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: prometheus
role: alert-rules
name: prometheus-clickhouse-operator-rules
spec:
groups:
- name: ClickHouseOperatorRules
rules:
- alert: ClickHouseMetricsExporterDown
expr: up{app='clickhouse-operator'} == 0
labels:
severity: critical
annotations:
identifier: "{{ $labels.pod_name }}"
summary: "metrics-exporter possible down"
description: |-
`metrics-exporter` not sent data more than 1 minutes.
Please check instance status
```kubectl logs -n {{ $labels.namespace }} {{ $labels.pod_name }} -c metrics-exporter -f```
- alert: ClickHouseServerDown
expr: chi_clickhouse_metric_fetch_errors{fetch_type='system.metrics'} > 0
labels:
severity: critical
annotations:
identifier: "{{ $labels.hostname }}"
summary: "clickhouse-server possible down"
description: |-
`metrics-exporter` failed metrics fetch `{{ $labels.fetch_type }}`.
Please check instance status
```kubectl get pods -n {{ $labels.exported_namespace }} | grep $( echo {{ $labels.hostname }} | cut -d '.' -f 1)```
- alert: ClickHouseMetricsExporterFetchErrors
expr: chi_clickhouse_metric_fetch_errors{fetch_type!='system.metrics'} > 0
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}"
summary: "clickhouse-server possible down"
description: |-
`metrics-exporter` failed metrics fetch `{{ $labels.fetch_type }}`.
Please check instance status
```kubectl get pods -n {{ $labels.exported_namespace }} | grep $( echo {{ $labels.hostname }} | cut -d '.' -f 1)```
- alert: ClickHouseServerRestartRecently
expr: chi_clickhouse_metric_Uptime > 1 < 180
labels:
severity: warning
annotations:
identifier: "{{ $labels.hostname }}"
summary: "clickhouse-server started recently"
description: |-
`chi_clickhouse_metric_Uptime` = {{ with printf "chi_clickhouse_metric_Uptime{hostname='%s',exported_namespace='%s'}" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }} seconds {{ end }}
`clickhouse-server` process has been start less than 3 minutes ago.
Look to previous ClickHouse pod log to investigate restart reason
```
kubectl logs -n {{ $labels.exported_namespace }} $( echo {{ $labels.hostname }} | cut -d '.' -f 1)-0 --previous```
```
- alert: ClickHouseDNSErrors
expr: increase(chi_clickhouse_event_DNSError[1m]) > 0 or increase(chi_clickhouse_event_NetworkErrors[1m]) > 0
labels:
severity: warning
annotations:
identifier: "{{ $labels.hostname }}"
summary: "DNS errors occurred"
description: |-
`increase(chi_clickhouse_event_DNSError[1m])` = {{ with printf "increase(chi_clickhouse_event_DNSError{hostname='%s',exported_namespace='%s'}[1m]) or increase(chi_clickhouse_event_NetworkErrors{hostname='%s',exported_namespace='%s'}[1m])" .Labels.hostname .Labels.exported_namespace .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }} errors{{ end }}
Please check DNS settings in `/etc/resolve.conf` and `<remote_servers>` part of `/etc/clickhouse-server/`
See documentation:
- https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#server-settings-remote-servers
- https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#server-settings-disable-internal-dns-cache
- https://clickhouse.com/docs/en/sql-reference/statements/system/
- alert: ClickHouseDistributedFilesToInsertHigh
expr: chi_clickhouse_metric_DistributedFilesToInsert > 50
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}"
summary: "clickhouse-server have Distributed Files to Insert > 50"
description: |-
`chi_clickhouse_metric_DistributedFilesToInsert` = {{ with printf "chi_clickhouse_metric_DistributedFilesToInsert{hostname='%s',exported_namespace='%s'}" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value }} files{{ end }}
`clickhouse-server` have too much files which not insert to `*MergeTree` tables via `Distributed` table engine
Check not synced .bin files via ```kubectl exec -n {{ $labels.exported_namespace }} pod/$(kubectl get pods -n {{ $labels.exported_namespace }} | grep $( echo {{ $labels.hostname }} | cut -d '.' -f 1) | cut -d " " -f 1) -- ls -la /var/lib/clickhouse/data/*/*/*/*.bin```
Also, check documentation:
https://clickhouse.com/docs/en/engines/table-engines/special/distributed/
When you insert data to `Distributed` table.
Data is written to target `*MergreTree` tables asynchronously.
When inserted in the table, the data block is just written to the local file system.
The data is sent to the remote servers in the background as soon as possible.
The period for sending data is managed by the `distributed_directory_monitor_sleep_time_ms` and `distributed_directory_monitor_max_sleep_time_ms` settings.
The Distributed engine sends each file with inserted data separately, but you can enable batch sending of files with the `distributed_directory_monitor_batch_inserts` setting
Also, you can manage distributed tables:
https://clickhouse.com/docs/en/sql-reference/statements/system/#managing-distributed-tables
- alert: ClickHouseDistributedConnectionExceptions
expr: increase(chi_clickhouse_event_DistributedConnectionFailTry[1m]) > 0 or increase(chi_clickhouse_event_DistributedConnectionFailAtAll[1m]) > 0
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Distributed connections fails occurred"
description: |-
`increase(chi_clickhouse_event_DistributedConnectionFailTry[1m])` = {{ with printf "increase(chi_clickhouse_event_DistributedConnectionFailTry{hostname='%s',exported_namespace='%s'}[1m])" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }} errors{{ end }}
`increase(chi_clickhouse_event_DistributedConnectionFailAtAll[1m])` = {{ with printf "increase(chi_clickhouse_event_DistributedConnectionFailAtAll{hostname='%s',exported_namespace='%s'}[1m])" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }} errors{{ end }}
Please, check communications between clickhouse server and host `remote_servers` in `/etc/clickhouse-server/`
https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#server-settings-remote-servers
Also, you can check logs:
```kubectl logs -n {{ $labels.exported_namespace }} $( echo {{ $labels.hostname }} | cut -d '.' -f 1)-0 -f```
- alert: ClickHouseRejectedInsert
expr: increase(chi_clickhouse_event_RejectedInserts[1m]) > 0
labels:
severity: critical
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Rejected INSERT queries occurred"
description: |-
`increase(chi_clickhouse_event_RejectedInserts[1m])` = {{ with printf "increase(chi_clickhouse_event_RejectedInserts{hostname='%s',exported_namespace='%s'}[1m])" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }} queries{{ end }}
`clickhouse-server` have INSERT queries that are rejected due to high number of active data parts for partition in a MergeTree, please decrease INSERT frequency
MergeTreeArchitecture
https://clickhouse.com/docs/en/development/architecture/#merge-tree
system.part_log
https://clickhouse.com/docs/en/operations/system-tables/part_log
system.merge_tree_settings
https://clickhouse.com/docs/en/operations/system-tables/merge_tree_settings
- alert: ClickHouseDelayedInsertThrottling
expr: increase(chi_clickhouse_event_DelayedInserts[1m]) > 0
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Delayed INSERT queries occurred"
description: |-
`increase(chi_clickhouse_event_DelayedInserts[1m])` = {{ with printf "increase(chi_clickhouse_event_DelayedInserts{hostname='%s',exported_namespace='%s'}[1m])" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }} queries{{ end }}
`clickhouse-server` have INSERT queries that are throttled due to high number of active data parts for partition in a MergeTree, please decrease INSERT frequency
https://clickhouse.com/docs/en/development/architecture/#merge-tree
- alert: ClickHouseMaxPartCountForPartition
expr: chi_clickhouse_metric_MaxPartCountForPartition > 100
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Max parts per partition > 100"
description: |-
`chi_clickhouse_metric_MaxPartCountForPartition` = {{ with printf "chi_clickhouse_metric_MaxPartCountForPartition{hostname='%s',exported_namespace='%s'}" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value }} parts{{ end }}
`clickhouse-server` have too many parts in one partition.
Clickhouse MergeTree table engine split each INSERT query to partitions (PARTITION BY expression)
and add one or more PARTS per INSERT inside each partition, after that background merge process run,
and when you have too much unmerged parts inside partition,
SELECT queries performance can significate degrade, so clickhouse try delay or reject INSERT
- alert: ClickHouseLowInsertedRowsPerQuery
expr: increase(chi_clickhouse_event_InsertQuery[1m]) > 0 and (increase(chi_clickhouse_event_InsertedRows[1m]) / increase(chi_clickhouse_event_InsertQuery[1m]) <= 1000)
labels:
severity: warning
annotations:
identifier: "{{ $labels.hostname }}"
summary: "please increase inserted rows per INSERT query"
description: |-
`increase(chi_clickhouse_event_InsertedRows[1m]) / increase(chi_clickhouse_event_InsertQuery[1m])` = {{ with printf "increase(chi_clickhouse_event_InsertedRows{hostname='%s',exported_namespace='%s'}[1m]) / increase(chi_clickhouse_event_InsertQuery{hostname='%s',exported_namespace='%s'}[1m])" .Labels.hostname .Labels.exported_namespace .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }} rows per query{{ end }}
`clickhouse-server` have low insert speed.
https://clickhouse.com/docs/en/about-us/performance/#performance-when-inserting-data
Clickhouse team recommends inserting data in packets of at least 1000 rows or no more than a single request per second.
Please use Buffer table
https://clickhouse.com/docs/en/engines/table-engines/special/buffer/
or
https://clickhouse.com/docs/en/operations/settings/settings/#async-insert
- alert: ClickHouseLongestRunningQuery
expr: chi_clickhouse_metric_LongestRunningQuery > 600
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Long running queries occurred"
description: |-
`clickhouse-server` have queries that running more than `chi_clickhouse_metric_LongestRunningQuery` = {{ with printf "chi_clickhouse_metric_LongestRunningQuery{hostname='%s',exported_namespace='%s'}" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }} seconds{{ end }}
try look to system.processes with long queries
https://clickhouse.com/docs/en/operations/system-tables/processes
```kubectl exec -n {{ $labels.exported_namespace }} pod/$(kubectl get pods -n {{ $labels.exported_namespace }} | grep $( echo {{ $labels.hostname }} | cut -d '.' -f 1) | cut -d " " -f 1) -- clickhouse-client -q "SELECT * FROM system.processes WHERE elapsed >= 600 FORMAT Vertical" | less```
- alert: ClickHouseQueryPreempted
expr: chi_clickhouse_metric_QueryPreempted > 0
labels:
severity: warning
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Preempted queries occurred"
description: |-
`clickhouse-server` have `chi_clickhouse_metric_QueryPreempted` = {{ with printf "chi_clickhouse_metric_QueryPreempted{hostname='%s',exported_namespace='%s'}" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value }} queries{{ end }}
It mean queries that are stopped and waiting due to 'priority' setting.
try look to system.processes
https://clickhouse.com/docs/en/operations/system-tables/processes
```kubectl exec -n {{ $labels.exported_namespace }} pod/$(kubectl get pods -n {{ $labels.exported_namespace }} | grep $( echo {{ $labels.hostname }} | cut -d '.' -f 1) | cut -d " " -f 1) -- clickhouse-client -q "SELECT * FROM system.processes FORMAT Vertical" | less```
- alert: ClickHouseReadonlyReplica
expr: chi_clickhouse_metric_ReadonlyReplica > 0
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}"
summary: "ReadOnly replica occurred"
description: |-
`chi_clickhouse_metric_ReadonlyReplica` = {{ with printf "chi_clickhouse_metric_ReadonlyReplica{hostname='%s',exported_namespace='%s'}" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value }} replicas{{ end }}
`clickhouse-server` have ReplicatedMergeTree tables that are currently in readonly state due to re-initialization after ZooKeeper session loss or due to startup without ZooKeeper configured.
Please check following things:
- kubenetes nodes have free enough RAM and Disk via `kubectl top node`
- status of clickhouse-server pods ```kubectl describe -n {{ $labels.exported_namespace }} pod/$(kubectl get pods -n {{ $labels.exported_namespace }} | grep $( echo {{ $labels.hostname }} | cut -d '.' -f 1) | cut -d " " -f 1)```
- connection between clickhouse-server pods and zookeeper ```kubectl exec -n {{ $labels.exported_namespace }} pod/$(kubectl get pods -n {{ $labels.exported_namespace }} | grep $( echo {{ $labels.hostname }} | cut -d '.' -f 1) | cut -d " " -f 1) -- clickhouse-client -q "SELECT * FROM system.zookeeper WHERE path='/' FORMAT Vertical"```
- connection between clickhouse-server pods via kubernetes services ```kubectl exec -n {{ $labels.exported_namespace }} pod/$(kubectl get pods -n {{ $labels.exported_namespace }} | grep $( echo {{ $labels.hostname }} | cut -d '.' -f 1) | cut -d " " -f 1) -- clickhouse-client -q "SELECT host_name, errors_count FROM system.clusters WHERE errors_count > 0 FORMAT PrettyCompactMonoBlock"```
- status of PersistentVolumeClaims for pods ```kubectl get pvc -n {{ $labels.exported_namespace }}```
Also read documentation:
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replication/#recovery-after-failures
- alert: ClickHouseReplicasMaxAbsoluteDelay
expr: chi_clickhouse_metric_ReplicasMaxAbsoluteDelay > 300
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Replication Lag more 300s occurred"
description: |-
`clickhouse-server` have replication lag `chi_clickhouse_metric_ReplicasMaxAbsoluteDelay` = {{ with printf "chi_clickhouse_metric_ReplicasMaxAbsoluteDelay{hostname='%s',exported_namespace='%s'}" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }} seconds{{ end }}.
When replica have too much lag, it can be skipped from Distributed SELECT Queries without errors and you will have wrong query results.
Check system.replicas, system.replication_queue and free disk space, network connection between clickhouse pod and zookeeper on monitored clickhouse-server pods
Also read documentation:
- https://clickhouse.com/docs/en/operations/system-tables/replicas/
- https://clickhouse.com/docs/en/operations/system-tables/replication_queue/
- alert: ClickHouseTooManyConnections
expr: chi_clickhouse_metric_HTTPConnection + chi_clickhouse_metric_TCPConnection + chi_clickhouse_metric_MySQLConnection > 100
labels:
severity: warning
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Total connections > 100"
description: |-
`chi_clickhouse_metric_HTTPConnection{hostname='{{ .Labels.hostname }}'}` = {{ with printf "chi_clickhouse_metric_HTTPConnection{hostname='%s',exported_namespace='%s'}" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value }} connections{{ end }}
`chi_clickhouse_metric_TCPConnection{hostname='{{ .Labels.hostname }}'}` = {{ with printf "chi_clickhouse_metric_TCPConnection{hostname='%s',exported_namespace='%s'}" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value }} connections{{ end }}
`chi_clickhouse_metric_MySQLConnection{hostname='{{ .Labels.hostname }}'}` = {{ with printf "chi_clickhouse_metric_MySQLConnection{hostname='%s',exported_namespace='%s'}" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value }} connections{{ end }}
`clickhouse-server` have many open connections.
The ClickHouse is adapted to run not a very large number of parallel SQL requests, not every HTTP/TCP(Native)/MySQL protocol connection means a running SQL request, but a large number of open connections can cause a spike in sudden SQL requests, resulting in performance degradation.
Also read documentation:
- https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#max-concurrent-queries
- alert: ClickHouseTooManyRunningQueries
expr: ((chi_clickhouse_metric_Query - chi_clickhouse_metric_PendingAsyncInsert) or (chi_clickhouse_metric_Query)) > 80
labels:
severity: warning
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Too much running queries"
description: |-
`chi_clickhouse_metric_Query{hostname='{{ .Labels.hostname }}'}` = {{ with printf "chi_clickhouse_metric_Query{hostname='%s',exported_namespace='%s'}" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.0f" }} running queries{{ end }}
`chi_clickhouse_metric_PendingAsyncInsert{hostname='{{ .Labels.hostname }}'}` = {{ with printf "chi_clickhouse_metric_PendingAsyncInsert{hostname='%s',exported_namespace='%s'}" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.0f" }} async inserts{{ end }}
Please analyze your workload.
Each concurrent SELECT query use memory in JOINs use CPU for running aggregation function and can read lot of data from disk when scan parts in partitions and utilize disk I/O.
Each concurrent INSERT query, allocate around 1MB per each column in an inserted table and utilize disk I/O.
Look at following documentation parts:
- https://clickhouse.com/docs/en/operations/settings/query-complexity/
- https://clickhouse.com/docs/en/operations/quotas/
- https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#max-concurrent-queries
- https://clickhouse.com/docs/en/operations/system-tables/query_log/
- alert: ClickHouseSystemSettingsChanged
expr: delta(chi_clickhouse_metric_ChangedSettingsHash[5m]) != 0
labels:
severity: warning
annotations:
identifier: "{{ $labels.hostname }}"
summary: "`system.settings` changed"
description: |-
changed `chi_clickhouse_metric_ChangedSettingsHash{hostname='{{ .Labels.hostname }}'}` = {{ with printf "chi_clickhouse_metric_ChangedSettingsHash{hostname='%s',exported_namespace='%s'}" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.0f" }}{{ end }}
- alert: ClickHouseVersionChanged
expr: delta(chi_clickhouse_metric_VersionInteger[5m]) != 0
labels:
severity: warning
annotations:
identifier: "{{ $labels.hostname }}"
summary: "ClickHouse version changed"
description: |-
changed `chi_clickhouse_metric_VersionInteger{hostname='{{ .Labels.hostname }}'}` = {{ with printf "chi_clickhouse_metric_VersionInteger{hostname='%s',exported_namespace='%s'}" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.0f" }}{{ end }}
- alert: ClickHouseZooKeeperHardwareExceptions
expr: increase(chi_clickhouse_event_ZooKeeperHardwareExceptions[1m]) > 0
labels:
severity: critical
annotations:
identifier: "{{ $labels.hostname }}"
summary: "ZooKeeperHardwareExceptions > 1"
description: |-
`increase(chi_clickhouse_event_ZooKeeperHardwareExceptions[1m])` = {{ with printf "increase(chi_clickhouse_event_ZooKeeperHardwareExceptions{hostname='%s',exported_namespace='%s'}[1m])" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }} exceptions{{ end }}
`clickhouse-server` have unexpected Network errors and similar with communitation with Zookeeper.
Clickhouse should reinitialize ZooKeeper session in case of these errors.
- alert: ClickHouseZooKeeperSession
expr: chi_clickhouse_metric_ZooKeeperSession > 1
labels:
severity: critical
annotations:
identifier: "{{ $labels.hostname }}"
summary: "ZooKeeperSession > 1"
description: |-
`chi_clickhouse_metric_ZooKeeperSession` = {{ with printf "chi_clickhouse_metric_ZooKeeperSession{hostname='%s',exported_namespace='%s'}" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.0f" }} sessions{{ end }}
Number of sessions (connections) from `clickhouse-server` to `ZooKeeper` shall be no more than one,
because using more than one connection to ZooKeeper may lead to bugs due to lack of linearizability (stale reads)
that ZooKeeper consistency model allows.
- alert: ClickHouseDiskUsage
expr: predict_linear(chi_clickhouse_metric_DiskFreeBytes[1d],86400) < 0
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}.{{ $labels.disk }}"
summary: "Disk free space enough less 24 hour"
description: |-
`{{ $labels.disk }}` data size: {{ with printf "chi_clickhouse_metric_DiskDataBytes{hostname='%s',exported_namespace='%s',disk='%s'}" .Labels.hostname .Labels.exported_namespace .Labels.disk | query }}{{ . | first | value | humanize1024 }}B {{ end }}
`{{ $labels.disk }}` disk free: {{ with printf "chi_clickhouse_metric_DiskFreeBytes{hostname='%s',exported_namespace='%s',disk='%s'}" .Labels.hostname .Labels.exported_namespace .Labels.disk | query }}{{ . | first | value | humanize1024 }}B {{ end }}
`{{ $labels.disk }}` disk size: {{ with printf "chi_clickhouse_metric_DiskTotalBytes{hostname='%s',exported_namespace='%s',disk='%s'}" .Labels.hostname .Labels.exported_namespace .Labels.disk | query }}{{ . | first | value | humanize1024 }}B {{ end }}
To avoid switching to read-only mode, please scale-up storage.
Currently k8s CSI support resize of Persistent Volumes, moreover you can try add another volume to existing pod with restart pod
please read documentation:
- https://kubernetes.io/blog/2018/07/12/resizing-persistent-volumes-using-kubernetes/
- https://github.com/Altinity/clickhouse-operator/blob/master/docs/storage.md
- https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-multiple-volumes
- https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree-table-ttl
# not well tested alerts which can't be triggered on e2e tests
- alert: ClickHouseReplicatedPartChecksFailed
expr: increase(chi_clickhouse_event_ReplicatedPartChecksFailed[1m]) > 0
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Increased ReplicatedPartCheckFailed"
description: |-
`increase(chi_clickhouse_event_ReplicatedPartChecksFailed[1m])` = {{ with printf "increase(chi_clickhouse_event_ReplicatedPartChecksFailed{hostname='%s',exported_namespace='%s'}[1m])" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }}{{ end }}
`clickhouse-server` increase ReplicatedPartCheckFailed in `system.events` table.
Please check logs on clickhouse-server pods ```kubectl exec -n {{ $labels.exported_namespace }} pod/$(kubectl get pods -n {{ $labels.exported_namespace }} | grep $( echo {{ $labels.hostname }} | cut -d '.' -f 1) | cut -d " " -f 1) -- cat /var/log/clickhouse-server/*.err.log | less```
- alert: ClickHouseReplicatedPartFailedFetches
expr: increase(chi_clickhouse_event_ReplicatedPartFailedFetches[1m]) > 0
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Increased ReplicatedPartFailedFetches"
description: |-
`increase(chi_clickhouse_event_ReplicatedPartFailedFetches[1m])` = {{ with printf "increase(chi_clickhouse_event_ReplicatedPartFailedFetches{hostname='%s',exported_namespace='%s'}[1m])" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }}{{ end }}
`clickhouse-server` increase ReplicatedPartFailedFetches in `system.events` table.
It mean server was failed to download data part from replica of a ReplicatedMergeTree table.
Please check following things:
- connections between clickhouse-server pod and his replicas (see remote_server section in /etc/clickhouse-server/)
- logs on clickhouse-server pods ```kubectl exec -n {{ $labels.exported_namespace }} pod/$(kubectl get pods -n {{ $labels.exported_namespace }} | grep $( echo {{ $labels.hostname }} | cut -d '.' -f 1) | cut -d " " -f 1) -- cat /var/log/clickhouse-server/*.err.log | less```
- alert: ClickHouseReplicatedDataLoss
expr: increase(chi_clickhouse_event_ReplicatedDataLoss[1m]) > 0
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Increased ReplicatedDataLoss"
description: |-
`increase(chi_clickhouse_event_ReplicatedDataLoss[1m])` = {{ with printf "increase(chi_clickhouse_event_ReplicatedDataLoss{hostname='%s',exported_namespace='%s'}[1m])" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }}{{ end }}
`clickhouse-server` increase ReplicatedDataLoss in `system.events` table.
It mean data part that server wanted doesn't exist on any replica (even on replicas that are offline right now).
That data parts are definitely lost. This is normal due to asynchronous replication (if quorum inserts were not enabled),
when the replica on which the data part was written was failed and when it became online after fail
it doesn't contain that data part.
Please check logs on clickhouse-server pods ```kubectl exec -n {{ $labels.exported_namespace }} pod/$(kubectl get pods -n {{ $labels.exported_namespace }} | grep $( echo {{ $labels.hostname }} | cut -d '.' -f 1) | cut -d " " -f 1) -- cat /var/log/clickhouse-server/*.err.log | less```
- alert: ClickHouseStorageBufferErrorOnFlush
expr: increase(chi_clickhouse_event_StorageBufferErrorOnFlush[1m]) > 0
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Increased StorageBufferErrorOnFlush"
description: |-
`increase(chi_clickhouse_event_StorageBufferErrorOnFlush[1m])` = {{ with printf "increase(chi_clickhouse_event_StorageBufferErrorOnFlush{hostname='%s',exported_namespace='%s'}[1m])" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }}{{ end }}
`clickhouse-server` increase StorageBufferErrorOnFlush in `system.events` table.
It mean something went wrong when clickhouse-server try to flush memory buffers to disk.
Please check following things:
- check disks free space and hardware failures
- logs on clickhouse-server pods ```kubectl exec -n {{ $labels.exported_namespace }} pod/$(kubectl get pods -n {{ $labels.exported_namespace }} | grep $( echo {{ $labels.hostname }} | cut -d '.' -f 1) | cut -d " " -f 1) -- cat /var/log/clickhouse-server/*.err.log | less```
- alert: ClickHouseDataAfterMergeDiffersFromReplica
expr: increase(chi_clickhouse_event_DataAfterMergeDiffersFromReplica[1m]) > 0
labels:
severity: warning
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Increased DataAfterMergeDiffersFromReplica"
description: |-
`increase(chi_clickhouse_event_DataAfterMergeDiffersFromReplica[1m])` = {{ with printf "increase(chi_clickhouse_event_DataAfterMergeDiffersFromReplica{hostname='%s',exported_namespace='%s'}[1m])" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }}{{ end }}
`clickhouse-server` increase DataAfterMergeDiffersFromReplica in `system.events` table.
It mean Data after merge is not byte-identical to data on another replicas.
There could be several reasons:
- Using newer version of compression library after server update.
- Using another compression method.
- Non-deterministic compression algorithm (highly unlikely).
- Non-deterministic merge algorithm due to logical error in code.
- Data corruption in memory due to bug in code.
- Data corruption in memory due to hardware issue.
- Manual modification of source data after server startup.
- Manual modification of checksums stored in ZooKeeper.
- Part format related settings like 'enable_mixed_granularity_parts' are different on different replicas.
Server will download merged part from replica to force byte-identical result.
- alert: ClickHouseDistributedSyncInsertionTimeoutExceeded
expr: increase(chi_clickhouse_event_DistributedSyncInsertionTimeoutExceeded[1m]) > 0
labels:
severity: warning
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Increased DistributedSyncInsertionTimeoutExceeded"
description: |-
`increase(chi_clickhouse_event_DistributedSyncInsertionTimeoutExceeded[1m])` = {{ with printf "increase(chi_clickhouse_event_DistributedSyncInsertionTimeoutExceeded{hostname='%s',exported_namespace='%s'}[1m])" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }}{{ end }}
`clickhouse-server` increase DistributedSyncInsertionTimeoutExceeded in `system.events` table.
It mean Synchronous distributed insert timeout exceeded after successfull distributed connection.
Please check documentation https://clickhouse.com/docs/en/operations/settings/settings/#insert_distributed_sync
And check connection between `{{ $labels.hostname }}` and all nodes in shards from remote_servers config section
- alert: ClickHouseFileDescriptorBufferReadOrWriteFailed
expr: increase(chi_clickhouse_event_ReadBufferFromFileDescriptorReadFailed[1m]) > 0 or increase(chi_clickhouse_event_WriteBufferFromFileDescriptorWriteFailed[1m]) > 0
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Increased ReadBufferFromFileDescriptorReadFailed or WriteBufferFromFileDescriptorWriteFailed"
description: |-
`increase(chi_clickhouse_event_ReadBufferFromFileDescriptorReadFailed[1m])` = {{ with printf "increase(chi_clickhouse_event_ReadBufferFromFileDescriptorReadFailed{hostname='%s',exported_namespace='%s'}[1m])" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }}{{ end }}
`increase(chi_clickhouse_event_WriteBufferFromFileDescriptorWriteFailed[1m])` = {{ with printf "increase(chi_clickhouse_event_WriteBufferFromFileDescriptorWriteFailed{hostname='%s',exported_namespace='%s'}[1m])" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }}{{ end }}
`clickhouse-server` increase ReadBufferFromFileDescriptorReadFailed or ReadBufferFromFileDescriptorReadFailed in `system.events` table.
It mean the read (read/pread) or writes (write/pwrite) to a file descriptor. Does not include sockets.
System can't read or write to some files.
Please check logs on clickhouse-server pods ```kubectl exec -n {{ $labels.exported_namespace }} pod/$(kubectl get pods -n {{ $labels.exported_namespace }} | grep $( echo {{ $labels.hostname }} | cut -d '.' -f 1) | cut -d " " -f 1) -- bash -c 'cat /var/log/clickhouse-server/*.err.log | grep -E "Cannot write to file|Cannot read from file"'```
- alert: ClickHouseSlowRead
expr: increase(chi_clickhouse_event_SlowRead[1m]) > 0
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Increased SlowRead"
description: |-
`increase(chi_clickhouse_event_SlowRead[1m])` = {{ with printf "increase(chi_clickhouse_event_SlowRead{hostname='%s',exported_namespace='%s'}[1m])" .Labels.hostname .Labels.exported_namespace | query }}{{ . | first | value | printf "%.2f" }}{{ end }}
`clickhouse-server` increase SlowRead in `system.events` table.
It mean reads from a files that were slow. This indicate system overload. Thresholds are controlled by `SELECT * FROM system.settings WHERE name LIKE 'read_backoff_%'`.
System will reduce the number of threads which used for processing queries.
Check you disks utilization and hardware failures.
- alert: ClickHouseTooManyMutations
expr: chi_clickhouse_table_mutations > 100
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Too much incomplete system.mutations"
description: |-
`chi_clickhouse_table_mutations` = {{ with printf "chi_clickhouse_table_mutations{hostname='%s',exported_namespace='%s',database='%s',table='%s'}" .Labels.hostname .Labels.exported_namespace .Labels.database .Labels.table | query }}{{ . | first | value | printf "%.0f" }}{{ end }}
`chi_clickhouse_table_mutations_parts_to_do` = {{ with printf "chi_clickhouse_table_mutations_parts_to_do{hostname='%s',exported_namespace='%s',database='%s',table='%s'}" .Labels.hostname .Labels.exported_namespace .Labels.database .Labels.table | query }}{{ . | first | value | printf "%.0f" }}{{ end }}
`system.mutations` show too much active mutations.
It mean something wrong with ALTER TABLE DELETE/UPDATE queries.
Please check mutations errors ```kubectl exec -n {{ $labels.exported_namespace }} pod/$(kubectl get pods -n {{ $labels.exported_namespace }} | grep $( echo {{ $labels.hostname }} | cut -d '.' -f 1) | cut -d " " -f 1) -- clickhouse-client -q "SELECT * FROM system.mutations WHERE is_done=0 FORMAT Vertical"```
Read about how to run KILL MUTATION
https://clickhouse.com/docs/en/sql-reference/statements/kill/#kill-mutation
- alert: ClickHouseDetachedParts
expr: chi_clickhouse_metric_DetachedParts > 0
labels:
severity: high
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Detached parts in system.detached_parts"
description: |-
`chi_clickhouse_metric_DetachedParts{hostname="{{ $labels.hostname }}",disk="{{ $labels.disk }}",reason="{{ $labels.reason }}"}` = {{ with printf "chi_clickhouse_metric_DetachedParts{hostname='%s',exported_namespace='%s',database='%s',table='%s',disk='%s',reason='%s'}" .Labels.hostname .Labels.exported_namespace .Labels.database .Labels.table .Labels.disk .Labels.reason | query }}{{ . | first | value | printf "%.0f" }}{{ end }}
`system.detached_parts` show have detached parts.
Please check detached parts in log ```kubectl logs -n {{ $labels.exported_namespace }} pod/$(kubectl get pods -n {{ $labels.exported_namespace }} | grep $( echo {{ $labels.hostname }} | cut -d '.' -f 1) | cut -d " " -f 1) -c clickhouse-pod --since=1h | grep -i detach```
Read about how ATTACH / DROP detached parts
https://clickhouse.com/docs/en/operations/system-tables/detached_parts/
https://kb.altinity.com/altinity-kb-useful-queries/detached-parts/
Legend for reason:
`detached_by_user` - part was detached via ALTER TABLE ... DETACH ... query.
`ignored` - when found during ATTACH\MERGE together with other, bigger parts that cover the same blocks of data, i.e. they were already merged into something else.
`unexpected` - part was found in local file system, modification of part was less than 5 minutes, but not found in ZK
`clone` - old part was detached during clone replica (see `detach_old_local_parts_when_cloning_replica` system.settings)
`broken` - part was marks as broken during startup or merging (check disk, memory, network hardware faillures), look to clickhouse-server.log for details
`noquorum` - part was detached, cause part was created during insert into Distributed table with quorum, but quorum was failed.
`covered-by-broken` - Broken part itself either already moved to detached or does not exist.
- alert: ClickHouseBackgroundMessageBrokerSchedulePoolUtilizationHigh
expr: |
(chi_clickhouse_metric_BackgroundMessageBrokerSchedulePoolSize - chi_clickhouse_metric_BackgroundMessageBrokerSchedulePoolTask) < 1
for: 10m
labels:
severity: warning
team: ClickHouse
annotations:
identifier: "{{ $labels.hostname }}"
summary: "Background Message Broker Schedule pool utilised high"
description: |-
chi_clickhouse_metric_BackgroundMessageBrokerSchedulePoolTask = {{ with printf "chi_clickhouse_metric_BackgroundMessageBrokerSchedulePoolTask{tenant='%s',chi='%s',hostname='%s'}" .Labels.tenant .Labels.chi .Labels.hostname | query }}{{ . | first | value | printf "%.0f" }}{{ end }}
chi_clickhouse_metric_BackgroundMessageBrokerSchedulePoolSize = {{ with printf "chi_clickhouse_metric_BackgroundMessageBrokerSchedulePoolSize{tenant='%s',chi='%s',hostname='%s'}" .Labels.tenant .Labels.chi .Labels.hostname | query }}{{ . | first | value | printf "%.0f" }}{{ end }}
- https://kb.altinity.com/altinity-kb-integrations/altinity-kb-kafka/background_message_broker_schedule_pool_size/
- https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings#background_message_broker_schedule_pool_size
- https://clickhouse.com/docs/en/operations/system-tables/metrics#backgroundmessagebrokerschedulepoolsize
This pool is used for tasks related to message streaming from Apache Kafka or other message brokers.
You need to increase `background_message_broker_schedule_pool_size` to fix the problem.