Feature request: Parity files so that backups can be healed #4

Joshfindit · 2022-02-16T20:33:12Z

I got here from https://unix.stackexchange.com/questions/136947/protecting-data-against-bit-rot/533728#533728 and think you could successfully add the functionality to chkbit-py, making it more powerful.

The text was updated successfully, but these errors were encountered:

laktak · 2022-02-16T21:37:55Z

I know of Pararchive files - but they are not small or lightweight (which is what this tool tries to be).

Couldn't you run that in parallel to chkbit? What would be gained by integration? AFAIK you would have to create and store one (several?) PAR2 files for each file in your original folder. I don't think the current .chkbit file would be suitable for that.

Joshfindit · 2022-02-17T14:58:42Z

Parity generation is heavy, as is the verification. Healing is light enough, however.

What would be gained by integration? Great question.

When the checksum functions or the way they are stored/referenced changes, it would not break the ability to heal the backup like it would for using an external script
If there is bad data in a backup and you’re using an external parity tool, then you have to read the data twice: once to find out which file(s) fail in chkbit-py, and then again for the parity tool to read the existing file(s) in order to gather the data for recovery

Tbh, I don’t think PAR2 files are actually the right choice for integration since they have issues with subfolders and small files. However, I think that using the same process to create parity data in python is the winner. .chkbit could be a folder instead, and store the parity data within it.

Defaults could be:

When generating the initial check: don’t create parity files unless requested
When validating the data: check for parity files and offer healing if available

laktak · 2022-02-20T18:23:43Z

I'm not yet sure if it would be practical from a resource point of view but it does sound interesting. Would you be willing to create a PR for this feature?

It's been a while since I used par2 files - do you know how much space they usually need (for example for a 3mb jpeg or a 50mb mp4)?

Joshfindit · 2022-02-20T19:00:46Z

I don’t currently program in Python, but I could take a shot at it using an existing reed-solomon library if there are any that align with the ideals of this repo.

but I do not have the skills required to determine which packages are viable, but a quick Google pointed me to reedsolo and unireedsolomon as possibilities

Joshfindit · 2022-02-20T19:09:17Z

As to the question of size: It’s based on a percentage of the size of the data to be recovered. PAR/PAR2 files have some overhead, but that overhead is not required.

Backblaze, for example, stores raw parity data on 3 drives out of 20 and end up with “five 9’s” of data durability. ZFS uses about 15%. Personally, I like to have about 30% of the original data in parity data, but most people are not as “wasteful”. Even 10% gives a lot of benefits.

dia3olik · 2022-03-13T18:31:23Z

This feature would be great and it could be invoked only optionally so everyone would be happy ;-)

I also think using a subfolder to store the parity data would be a great choice.

Using a folder named .chkbit which would remain hidden as mentioned would be perfect imho.

laktak · 2022-10-09T19:01:57Z

If there is interest in this feature I will accept a PR.

Please discuss your implementation with me or accept changes to integrate it and understand that it should

be completely optional
be "easy" to maintain in the future

varmint708 · 2025-02-04T18:19:05Z

Not sure if this is related to this thread or #17, but since that was closed, adding details here.

FYI, Just starting with golang so this is just overview i thought i would share in case any one wants to take a deeper look into this.

i was using kopia which is also in golang, it supports adding ECC reed solmon from 1% to 10% and here is the patch which added it: https://github.com/kopia/kopia/pull/2270/files . its been long time since it was added so hopefully tested and stable. it uses https://github.com/klauspost/reedsolomon as shown here: https://github.com/kopia/kopia/blob/master/repo/ecc/ecc_rs_crc.go but it shards things. so i researched a bit and found this library: https://github.com/maruel/rs which does something which might be more suitable to chkbit. below is the example of what it does and what the idea might look like if implemented in chkbit. the only problem is rs is not updated since quite some time and since being a starter i dont really know how to use reedsolomon instead of rs and if they both expose same interface. reedsolomon is actively maintained it seems so that is prefered but for example here i will show the idea using rs.

if you see the example here: https://github.com/maruel/rs/blob/main/example_test.go#L21-L23 it clearly shows how 2 byte ecc can be generated from data. so in atom or split form, there can a separate db called .chkbit-ecc or something similar which could hold ecc from say 1% to 10% with the same json format which is currently there, just replacing a=reedsolmon", h=ecc data, or something similar. the only problem i forsee is, how to calculate the ecc data simultaneously while the blake hash is being calculated so that you dont have to read the file twice, since that would be really long for slow media, anyways if anyone has insight would be good to know.

laktak added the pr-welcome label Oct 9, 2022

laktak added the enhancement label Jan 3, 2024

laktak mentioned this issue Aug 3, 2024

ECC algorithm feature request #17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Parity files so that backups can be healed #4

Feature request: Parity files so that backups can be healed #4

Joshfindit commented Feb 16, 2022

laktak commented Feb 16, 2022

Joshfindit commented Feb 17, 2022

laktak commented Feb 20, 2022

Joshfindit commented Feb 20, 2022

Joshfindit commented Feb 20, 2022

dia3olik commented Mar 13, 2022

laktak commented Oct 9, 2022

varmint708 commented Feb 4, 2025

Feature request: Parity files so that backups can be healed #4

Feature request: Parity files so that backups can be healed #4

Comments

Joshfindit commented Feb 16, 2022

laktak commented Feb 16, 2022

Joshfindit commented Feb 17, 2022

laktak commented Feb 20, 2022

Joshfindit commented Feb 20, 2022

Joshfindit commented Feb 20, 2022

dia3olik commented Mar 13, 2022

laktak commented Oct 9, 2022

varmint708 commented Feb 4, 2025