Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evaluate new ES-ILM backup / redundancy strategy #235

Closed
rahulbot opened this issue Feb 7, 2024 · 3 comments
Closed

evaluate new ES-ILM backup / redundancy strategy #235

rahulbot opened this issue Feb 7, 2024 · 3 comments
Assignees

Comments

@rahulbot
Copy link
Contributor

rahulbot commented Feb 7, 2024

With the ILM ES index architecture, @philbudne raised a question about reconsidering our redundancy approach. We now know that we can restore 2-3 months from WARC files in ~2 days. What if we roll-over via ILM to a new index every 2 months, and immediately backup the rolled-over index off-site. Then if we crash restoration is 2-ish days of downloading indexes and recreating the latest (un-backed up) index from .WARC files. I think this is an acceptable downtime, and we can always later add some kind of "hot" duplicate of the latest index if we want. The task here is to consider how to design and implementation for this, whether it would really work, and to make sure it is a good idea.

Related to #157, #231, #54.

@rahulbot rahulbot added this to the Production Beta 3 milestone Feb 7, 2024
@rahulbot
Copy link
Contributor Author

Notes from mtg: sounds doable via an API call (to test), perhaps use 90 days as rough time limit, validate how easy/hard it is to restore an archived index, double-check shard size spec, make sure changes for this don't require re-indexing

@thepsalmist
Copy link
Contributor

We can do the ILM policy update via the ILM put API PUT _ilm/policy/<mc_ILM_policy_id>.

When the policy is updated, changes what take effect on our current index mc_search-0001, so this means changes would take effect after mc_search-0002. So we'll have to let the current policy rollover as per the current ILM rollover definitions as documented here

As per this image, our current max shard size is 17.5GB, so we should anticipate rollover when we ingest about triple our current data (this should be sooner than the alternate rollover action of 365 days)

Screenshot 2024-02-21 at 17 02 25

@rahulbot
Copy link
Contributor Author

My proposal for setting up backups is:

  1. after the first ILM rollover (early-March?) we take an image of that newly-frozen index and back it up off-site (S3?)
  2. in March we work on automating that process so the next rollover automatically takes a binary snapshot of the index and backs it up off-site

Closing this and I'll set up different issues accordingly to capture these two tasks to be done at different times.

This supports a two-prong an overall strategy for catastrophic index failure recovery:

  • restore prior frozen indexes from off-site storage
  • restore "hot" index from warc files (I guess it is OK if a WARC file overlaps a little with latest "cold" index because we'll have URL-based deduplication working)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants