-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#342 Amazon S3's parallel SHA2-256 with fixed 5/8MB blocks #343
Conversation
Discussion active in #342 |
table.csv
Outdated
@@ -510,7 +510,7 @@ poseidon-bls12_381-a2-fc1-sc, multihash, 0xb402, draft, Pose | |||
rdfc-1, ipld, 0xb403, draft, The result of canonicalizing an input according to RDFC-1.0 and then expressing its hash value as a multihash value. | |||
ssz, serialization, 0xb501, draft, SimpleSerialize (SSZ) serialization | |||
ssz-sha2-256-bmt, multihash, 0xb502, draft, SSZ Merkle tree root using SHA2-256 as the hashing function and SSZ serialization for the block binary | |||
sha2-256-chunked, multihash, 0xb510, draft, Hash of concatenated SHA2-256 digests of 8*2^n MiB source chunks; n = source_size / (10000 * 8MiB) | |||
sha2-256-chunked, multihash, 0xb510, draft, Hash of concatenated SHA2-256 digests of 8*2^n MiB source chunks; n = log(source_size/(10^4 * 8MiB)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I suggest (for being extra precise):
ceil(log2(source_size / (10^4 * 8MiB)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it is floor, so yes it would be good to be more precise :0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I get my math wrong? (It can well be, it's hard to believe how much time I've spent trying to get this right ;) So let me do the math here again:
Let's say my input is 100GiB, hence 100*1024^3 = 107374182400
bytes. This means:
log2((100*1024^3) / (10^4 * 8 * 1024^2))
= 0.356143810225
-
floor
would mean 0 =>8*2^0 = 8
MiB chunk size.8MiB * 10^4 = 8*1024^2 * 10^4 = 83886080000
-
ceil
would mean 1 =>8*2^1 = 16
MiB chunk size16MiB * 10^4 = 16*1024^2 * 10^4 = 167772160000
83886080000 < 107374182400 < 167772160000
So I would say it's ceil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not going to object but I'm not really a fan of this description, too complex and the fact that you're having trouble figuring it out here means that the casual observer is going to have an almost impossible time (think of the case of someone who has a new hash function they want to register that sounds the same but they need to work out if it is).
Also it doesn't work for sizes under threshold (10^4 * 8 MiB
) because it can go negative unless you're doing this in a strongly typed language. It really needs a branch for that case.
> n = (sz) => Math.ceil(Math.log2(sz/((10**4)*(8*1024*1024))))
n(10000)
-23
Yeah, a max(0, ...)
would fix it but yay more complexity!
Perhaps it could be fixed by speaking about the common (80000 MiB source) case and then talking how to scale beyond with the equation.
Don't want to drag this out though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just saw your comment after I merged it (as I saw your approval). In the squashed commit message i linked to the description document. I hope that's good enough.
I wonder if we should introduce yet another column with links to descriptions of the algorithm. But that would be a separate discussion/PR.
Yeah, my bad. Engineering just told me the same thing :'-( Pushed a new description. Will update the documentation. |
Thank you so much @rvagg ! |
🙏 |
Registering prefixes for Amazon S3's parallel SHA2-256 with fixed 5MB or 8MB blocks
(the ones they actually implement in their SDK, and for which we are building a compatible implementation.