Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#342 Amazon S3's parallel SHA2-256 with fixed 5/8MB blocks #343

Merged
merged 6 commits into from
Feb 23, 2024

Conversation

drernie
Copy link
Contributor

@drernie drernie commented Feb 15, 2024

Registering prefixes for Amazon S3's parallel SHA2-256 with fixed 5MB or 8MB blocks
(the ones they actually implement in their SDK, and for which we are building a compatible implementation.

@rvagg
Copy link
Member

rvagg commented Feb 19, 2024

Discussion active in #342

table.csv Outdated
@@ -510,7 +510,7 @@ poseidon-bls12_381-a2-fc1-sc, multihash, 0xb402, draft, Pose
rdfc-1, ipld, 0xb403, draft, The result of canonicalizing an input according to RDFC-1.0 and then expressing its hash value as a multihash value.
ssz, serialization, 0xb501, draft, SimpleSerialize (SSZ) serialization
ssz-sha2-256-bmt, multihash, 0xb502, draft, SSZ Merkle tree root using SHA2-256 as the hashing function and SSZ serialization for the block binary
sha2-256-chunked, multihash, 0xb510, draft, Hash of concatenated SHA2-256 digests of 8*2^n MiB source chunks; n = source_size / (10000 * 8MiB)
sha2-256-chunked, multihash, 0xb510, draft, Hash of concatenated SHA2-256 digests of 8*2^n MiB source chunks; n = log(source_size/(10^4 * 8MiB))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I suggest (for being extra precise):

ceil(log2(source_size / (10^4 * 8MiB)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it is floor, so yes it would be good to be more precise :0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I get my math wrong? (It can well be, it's hard to believe how much time I've spent trying to get this right ;) So let me do the math here again:

Let's say my input is 100GiB, hence 100*1024^3 = 107374182400 bytes. This means:

log2((100*1024^3) / (10^4 * 8 * 1024^2))
= 0.356143810225
  • floor would mean 0 => 8*2^0 = 8 MiB chunk size. 8MiB * 10^4 = 8*1024^2 * 10^4 = 83886080000

  • ceil would mean 1 => 8*2^1 = 16 MiB chunk size 16MiB * 10^4 = 16*1024^2 * 10^4 = 167772160000

    83886080000 < 107374182400 < 167772160000

So I would say it's ceil.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not going to object but I'm not really a fan of this description, too complex and the fact that you're having trouble figuring it out here means that the casual observer is going to have an almost impossible time (think of the case of someone who has a new hash function they want to register that sounds the same but they need to work out if it is).

Also it doesn't work for sizes under threshold (10^4 * 8 MiB) because it can go negative unless you're doing this in a strongly typed language. It really needs a branch for that case.

> n = (sz) => Math.ceil(Math.log2(sz/((10**4)*(8*1024*1024))))
n(10000)
-23

Yeah, a max(0, ...) would fix it but yay more complexity!

Perhaps it could be fixed by speaking about the common (80000 MiB source) case and then talking how to scale beyond with the equation.

Don't want to drag this out though!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw your comment after I merged it (as I saw your approval). In the squashed commit message i linked to the description document. I hope that's good enough.

I wonder if we should introduce yet another column with links to descriptions of the algorithm. But that would be a separate discussion/PR.

@drernie
Copy link
Contributor Author

drernie commented Feb 22, 2024

Yeah, my bad. Engineering just told me the same thing :'-(

Pushed a new description. Will update the documentation.

@drernie
Copy link
Contributor Author

drernie commented Feb 23, 2024

Thank you so much @rvagg !
Anyone else we need to ping to merge this?

@vmx vmx merged commit 11bd20b into multiformats:master Feb 23, 2024
1 check passed
@drernie
Copy link
Contributor Author

drernie commented Feb 23, 2024

🙏

@vmx vmx mentioned this pull request Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants