Idea for deduplication #115

madadam · 2023-04-28T14:34:50Z

We currently don't support deduplication because block ids are computed as hashes of block ciphertext which is computed using a random nonce. So if the same plaintext is written multiple times, it produces a different ciphertext each time and thus different block id. It is done this way so that even blind replicas can validate blocks (calculate the hash of the ciphertext and compare against the block id). We though that we can have one or the other but turns out there is maybe a way to have both:

Instead of using randomly generated nonce we could use a nonce that is deterministically derived from the plaintext, say:

nonce = BLAKE3(write_key || plaintext)

Then the same plaintext would produce the same ciphertext and thus the same block id (which would still be computed as hash of the ciphertext) thus achieving both deduplication and blind block validation.

Security concerns:

Why using keyed hash with the write key when computing the nonce? This is to prevent known plaintext attack: say an adversary who has blind access to the repository suspects the repo might contain some known content. If we used only simple hash (BLAKE3(plaintext)) the adversary could compute the hash of the known plaintext and compare against the blocks and prove that the repo does contain it.
This should be safe against nonce-reuse because we are only using the same nonce for the same plaintext. Nonce reuse is only problem when the same nonce is used for different plaintext (citation needed).
This would reveal a little bit of information to adversaries who have blind access in that they would be able to tell that the repository contains duplicate blocks. Unclear how much of a concern this is.
There might be other security considerations

The text was updated successfully, but these errors were encountered:

madadam · 2023-11-01T15:23:32Z

We discussed this and for now decided to not do this due to abundance of caution (it's unclear how serious the security implications of this are).

For now we implemented much weaker version of deduplication where only if two blocks have the same plaintext and are in the same file/directory at the same position but in different branches, they will have the same block_id (even when they were created independently rather than by merge). This does not reveal any more information about the repository but it improves merge performance by guaranteeing that two branches with identical content have the same hash.

Leaving this open so we can revisit it in the future when we gain more understanding of the security implications.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea for deduplication #115

Idea for deduplication #115

madadam commented Apr 28, 2023 •

edited

Loading

madadam commented Nov 1, 2023 •

edited

Loading

Idea for deduplication #115

Idea for deduplication #115

Comments

madadam commented Apr 28, 2023 • edited Loading

madadam commented Nov 1, 2023 • edited Loading

madadam commented Apr 28, 2023 •

edited

Loading

madadam commented Nov 1, 2023 •

edited

Loading