Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verification check fails due to index summary being rebuild after the backup was taken #802

Open
serban21 opened this issue Sep 11, 2024 · 0 comments

Comments

@serban21
Copy link

serban21 commented Sep 11, 2024

Project board link

I'm using Cassandra 4.06, with Medusa 0.22.0. In production on differential backups medusa verify fails due to mismatch on some Summary.db files between the size and md5 kept in manifest.json and the actual size and md5 of the S3 blob.

I investigated and I found out that the index summary was modified at a later stage long time after the SSTable creation:
Screenshot 2024-09-11 at 12 07 12

and the new version was uploaded in S3:
Screenshot 2024-09-11 at 15 55 41

This is a normal Cassandra behavior, controlled by index_summary_resize_interval

There was a Cassandra log entry about the index summary at almost the same timestamp:

INFO  [IndexSummaryManager:1] 2024-09-11 07:19:34,078 IndexSummaryRedistribution.java:83 - Redistributing index summaries

The last differential backup has the correct size and md5 fingerprint. I guess restore will work regardless, since the new summary is just a better version of it, but I didn't test it. Still, it's not ok for the verify to fail on a good backup.

Some ideas for fixing this:

  1. Do not check for MD5 and size for the summary files, just for their presence. It would be better than what it is now, but it will mean that any modification done to the archived summary file will not be detected. And it should be fairly easy to implement.
  2. Update the manifest.json of all old differential backups. Detect when a summary file is overwritten, and go through all manifests. I don't like this, any error could affect those backups.
  3. In case of such verify errors, only for summary files Medusa could go to the last manifest and see if the file is present there with accurate MD5 and size; if so, it should not report errors. If the file is not present it could go backwards through manifest files until it reaches one that has the file. Problems:
    • This will fail if the backups containing the new MD5 were deleted, but I thinks that's unlikely
    • It could be argued that this means that Medusa will not restore exactly what was backed up. It could be a problem for certifications and audits, I guess. It's not a real problem since it's just an sampling of an index
  4. When a summary file is overwritten save the previous variant. I guess the MD5 could be added to the file name, to identify it easier; or put it in a separate folder in data/. This will then require changes in verify, restore and delete. The main advantage will be that we'll not need to look into other manifests and that the backup will keep the exact copy of the files at the moment of the backup.

My preference will be for (3) or (4). (4) is the ideal solution, while (3) could be good enough. I can try to implement one of the ideas. (1) is just a short-time solution.

┆Issue is synchronized with this Jira Story by Unito

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant