evaluate new ES-ILM backup / redundancy strategy #235

rahulbot · 2024-02-07T15:36:21Z

With the ILM ES index architecture, @philbudne raised a question about reconsidering our redundancy approach. We now know that we can restore 2-3 months from WARC files in ~2 days. What if we roll-over via ILM to a new index every 2 months, and immediately backup the rolled-over index off-site. Then if we crash restoration is 2-ish days of downloading indexes and recreating the latest (un-backed up) index from .WARC files. I think this is an acceptable downtime, and we can always later add some kind of "hot" duplicate of the latest index if we want. The task here is to consider how to design and implementation for this, whether it would really work, and to make sure it is a good idea.

Related to #157, #231, #54.

rahulbot · 2024-02-14T14:55:46Z

Notes from mtg: sounds doable via an API call (to test), perhaps use 90 days as rough time limit, validate how easy/hard it is to restore an archived index, double-check shard size spec, make sure changes for this don't require re-indexing

thepsalmist · 2024-02-21T14:04:04Z

We can do the ILM policy update via the ILM put API PUT _ilm/policy/<mc_ILM_policy_id>.

When the policy is updated, changes what take effect on our current index mc_search-0001, so this means changes would take effect after mc_search-0002. So we'll have to let the current policy rollover as per the current ILM rollover definitions as documented here

As per this image, our current max shard size is 17.5GB, so we should anticipate rollover when we ingest about triple our current data (this should be sooner than the alternate rollover action of 365 days)

rahulbot · 2024-02-28T15:16:00Z

My proposal for setting up backups is:

after the first ILM rollover (early-March?) we take an image of that newly-frozen index and back it up off-site (S3?)
in March we work on automating that process so the next rollover automatically takes a binary snapshot of the index and backs it up off-site

Closing this and I'll set up different issues accordingly to capture these two tasks to be done at different times.

This supports a two-prong an overall strategy for catastrophic index failure recovery:

restore prior frozen indexes from off-site storage
restore "hot" index from warc files (I guess it is OK if a WARC file overlaps a little with latest "cold" index because we'll have URL-based deduplication working)

rahulbot added this to the Production Beta 3 milestone Feb 7, 2024

rahulbot assigned thepsalmist Feb 7, 2024

thepsalmist mentioned this issue Feb 26, 2024

Fix/update ilm policy #250

Merged

rahulbot closed this as completed Feb 28, 2024

This was referenced Feb 28, 2024

manually back up first "warm" index after first ILM rollover #253

Closed

automate backing up of "warm" ES indexes created by ILM #254

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluate new ES-ILM backup / redundancy strategy #235

evaluate new ES-ILM backup / redundancy strategy #235

rahulbot commented Feb 7, 2024 •

edited

Loading

rahulbot commented Feb 14, 2024

thepsalmist commented Feb 21, 2024

rahulbot commented Feb 28, 2024

evaluate new ES-ILM backup / redundancy strategy #235

evaluate new ES-ILM backup / redundancy strategy #235

Comments

rahulbot commented Feb 7, 2024 • edited Loading

rahulbot commented Feb 14, 2024

thepsalmist commented Feb 21, 2024

rahulbot commented Feb 28, 2024

rahulbot commented Feb 7, 2024 •

edited

Loading