Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea for deduplication #115

Open
madadam opened this issue Apr 28, 2023 · 1 comment
Open

Idea for deduplication #115

madadam opened this issue Apr 28, 2023 · 1 comment

Comments

@madadam
Copy link
Collaborator

madadam commented Apr 28, 2023

We currently don't support deduplication because block ids are computed as hashes of block ciphertext which is computed using a random nonce. So if the same plaintext is written multiple times, it produces a different ciphertext each time and thus different block id. It is done this way so that even blind replicas can validate blocks (calculate the hash of the ciphertext and compare against the block id). We though that we can have one or the other but turns out there is maybe a way to have both:

Instead of using randomly generated nonce we could use a nonce that is deterministically derived from the plaintext, say:

nonce = BLAKE3(write_key || plaintext)

Then the same plaintext would produce the same ciphertext and thus the same block id (which would still be computed as hash of the ciphertext) thus achieving both deduplication and blind block validation.

Security concerns:

  • Why using keyed hash with the write key when computing the nonce? This is to prevent known plaintext attack: say an adversary who has blind access to the repository suspects the repo might contain some known content. If we used only simple hash (BLAKE3(plaintext)) the adversary could compute the hash of the known plaintext and compare against the blocks and prove that the repo does contain it.

  • This should be safe against nonce-reuse because we are only using the same nonce for the same plaintext. Nonce reuse is only problem when the same nonce is used for different plaintext (citation needed).

  • This would reveal a little bit of information to adversaries who have blind access in that they would be able to tell that the repository contains duplicate blocks. Unclear how much of a concern this is.

  • There might be other security considerations

@madadam
Copy link
Collaborator Author

madadam commented Nov 1, 2023

We discussed this and for now decided to not do this due to abundance of caution (it's unclear how serious the security implications of this are).

For now we implemented much weaker version of deduplication where only if two blocks have the same plaintext and are in the same file/directory at the same position but in different branches, they will have the same block_id (even when they were created independently rather than by merge). This does not reveal any more information about the repository but it improves merge performance by guaranteeing that two branches with identical content have the same hash.

Leaving this open so we can revisit it in the future when we gain more understanding of the security implications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant