You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using Cassandra 4.06, with Medusa 0.22.0. In production on differential backups medusa verify fails due to mismatch on some Summary.db files between the size and md5 kept in manifest.json and the actual size and md5 of the S3 blob.
I investigated and I found out that the index summary was modified at a later stage long time after the SSTable creation:
There was a Cassandra log entry about the index summary at almost the same timestamp:
INFO [IndexSummaryManager:1] 2024-09-11 07:19:34,078 IndexSummaryRedistribution.java:83 - Redistributing index summaries
The last differential backup has the correct size and md5 fingerprint. I guess restore will work regardless, since the new summary is just a better version of it, but I didn't test it. Still, it's not ok for the verify to fail on a good backup.
Some ideas for fixing this:
Do not check for MD5 and size for the summary files, just for their presence. It would be better than what it is now, but it will mean that any modification done to the archived summary file will not be detected. And it should be fairly easy to implement.
Update the manifest.json of all old differential backups. Detect when a summary file is overwritten, and go through all manifests. I don't like this, any error could affect those backups.
In case of such verify errors, only for summary files Medusa could go to the last manifest and see if the file is present there with accurate MD5 and size; if so, it should not report errors. If the file is not present it could go backwards through manifest files until it reaches one that has the file. Problems:
This will fail if the backups containing the new MD5 were deleted, but I thinks that's unlikely
It could be argued that this means that Medusa will not restore exactly what was backed up. It could be a problem for certifications and audits, I guess. It's not a real problem since it's just an sampling of an index
When a summary file is overwritten save the previous variant. I guess the MD5 could be added to the file name, to identify it easier; or put it in a separate folder in data/. This will then require changes in verify, restore and delete. The main advantage will be that we'll not need to look into other manifests and that the backup will keep the exact copy of the files at the moment of the backup.
My preference will be for (3) or (4). (4) is the ideal solution, while (3) could be good enough. I can try to implement one of the ideas. (1) is just a short-time solution.
Project board link
I'm using Cassandra 4.06, with Medusa 0.22.0. In production on differential backups
medusa verify
fails due to mismatch on some Summary.db files between the size and md5 kept inmanifest.json
and the actual size and md5 of the S3 blob.I investigated and I found out that the index summary was modified at a later stage long time after the SSTable creation:
and the new version was uploaded in S3:
This is a normal Cassandra behavior, controlled by index_summary_resize_interval
There was a Cassandra log entry about the index summary at almost the same timestamp:
The last differential backup has the correct size and md5 fingerprint. I guess restore will work regardless, since the new summary is just a better version of it, but I didn't test it. Still, it's not ok for the verify to fail on a good backup.
Some ideas for fixing this:
manifest.json
of all old differential backups. Detect when a summary file is overwritten, and go through all manifests. I don't like this, any error could affect those backups.data/
. This will then require changes in verify, restore and delete. The main advantage will be that we'll not need to look into other manifests and that the backup will keep the exact copy of the files at the moment of the backup.My preference will be for (3) or (4). (4) is the ideal solution, while (3) could be good enough. I can try to implement one of the ideas. (1) is just a short-time solution.
┆Issue is synchronized with this Jira Story by Unito
The text was updated successfully, but these errors were encountered: