-
Notifications
You must be signed in to change notification settings - Fork 0
Data Restoration
The best kind of replication is one that requires no manual intervention, like ClickHouse's ReplicatedMergeTree. Very high-availability systems can be constructed with it to withstand single-node failures. Sooner or later, whole-cluster failures also have to be considered and these call for some degree of manual data restoration.
At the scale of my time-series deployment, several million messages/sec, I have been finding it useful to have a small restoration procedure separate from a big one, as described below.
A quick replacement of a small gap in data in the order of tens to hundreds million rows (minutes of my time-series stream). Such gap may be caused by an awkward node restart or a network glitch. Use ClickHouse Copier or simply a manual DELETE-SELECT-INSERT. My typical procedure is:
ALTER TABLE x DELETE WHERE t BETWEEN t0 AND t1
-
SELECT FROM x WHERE t BETWEEN t0 AND t1
andINSERT INTO t
, with rsync or netcat for network transport between the two - repeat for each table, rounding interval
(t0,t1)
appropriately
The rounding is necessary for those of my tables that are time-resolution aggregates.
This is an easy (bunch of SQL), flexible (one can transform schema in SELECT statement as desired + arbitrary resharding) and safe (native protocol command line client) procedure. It is slowed down by having to re-encode data types and re-merge inserted data. I might get a million rows/sec on my cluster with magnetic disk drives.
Repair of major incidents lasting hours to days. Copy and attach entire merged parts. The procedure is (see script example):
-
SELECT path FROM system.parts WHERE table = 'x' AND partition BETWEEN t0 and t1
to list parts in partitions overlapping given interval - copy those parts from their filesystem location to destination cluster's table detached directory (I use netcat over TCP)
ALTER TABLE x DELETE WHERE t BETWEEN t0 AND t1
-
ALTER TABLE x ATTACH PART
on each copied part - repeat for each table
Interval is rounded to largest partition size of interest.
This procedure is faster than SELECT-INSERT patching. I might get a couple million rows/sec on my cluster. A decently formulated benchmark would be useful here, and I might add one in due course.
Main disadvantage is the potential for messing something up by accessing filesystem. Schema of source and destination tables must be identical, with no SQL tweaks possible. Finally, resharding is limited to exclude many-to-one where part names collide.