Verification check fails due to index summary being rebuild after the backup was taken #802

serban21 · 2024-09-11T14:24:16Z

I'm using Cassandra 4.06, with Medusa 0.22.0. In production on differential backups medusa verify fails due to mismatch on some Summary.db files between the size and md5 kept in manifest.json and the actual size and md5 of the S3 blob.

I investigated and I found out that the index summary was modified at a later stage long time after the SSTable creation:

and the new version was uploaded in S3:

This is a normal Cassandra behavior, controlled by index_summary_resize_interval

There was a Cassandra log entry about the index summary at almost the same timestamp:

INFO  [IndexSummaryManager:1] 2024-09-11 07:19:34,078 IndexSummaryRedistribution.java:83 - Redistributing index summaries

The last differential backup has the correct size and md5 fingerprint. I guess restore will work regardless, since the new summary is just a better version of it, but I didn't test it. Still, it's not ok for the verify to fail on a good backup.

Some ideas for fixing this:

Do not check for MD5 and size for the summary files, just for their presence. It would be better than what it is now, but it will mean that any modification done to the archived summary file will not be detected. And it should be fairly easy to implement.
Update the manifest.json of all old differential backups. Detect when a summary file is overwritten, and go through all manifests. I don't like this, any error could affect those backups.
In case of such verify errors, only for summary files Medusa could go to the last manifest and see if the file is present there with accurate MD5 and size; if so, it should not report errors. If the file is not present it could go backwards through manifest files until it reaches one that has the file. Problems:
- This will fail if the backups containing the new MD5 were deleted, but I thinks that's unlikely
- It could be argued that this means that Medusa will not restore exactly what was backed up. It could be a problem for certifications and audits, I guess. It's not a real problem since it's just an sampling of an index
When a summary file is overwritten save the previous variant. I guess the MD5 could be added to the file name, to identify it easier; or put it in a separate folder in data/. This will then require changes in verify, restore and delete. The main advantage will be that we'll not need to look into other manifests and that the backup will keep the exact copy of the files at the moment of the backup.

My preference will be for (3) or (4). (4) is the ideal solution, while (3) could be good enough. I can try to implement one of the ideas. (1) is just a short-time solution.

┆Issue is synchronized with this Jira Story by Unito

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verification check fails due to index summary being rebuild after the backup was taken #802

Verification check fails due to index summary being rebuild after the backup was taken #802

serban21 commented Sep 11, 2024 •

edited by sync-by-unito bot

Loading

Verification check fails due to index summary being rebuild after the backup was taken #802

Verification check fails due to index summary being rebuild after the backup was taken #802

Comments

serban21 commented Sep 11, 2024 • edited by sync-by-unito bot Loading

serban21 commented Sep 11, 2024 •

edited by sync-by-unito bot

Loading