-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chacha20
: Process 4 blocks at a time in AVX2 backend
#267
Conversation
It is now agnostic of the number of blocks processed, which is now a constant.
unsafe fn shuffle_epi32<const MASK: i32>(&mut self) { | ||
self.avx = [ | ||
_mm256_shuffle_epi32(self.avx[0], MASK), | ||
_mm256_shuffle_epi32(self.avx[1], MASK), | ||
]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bleh, it turns out _mm256_shuffle_epi32
is a pseudo-function generated such that the second argument is required to be const
, but there's no way to pass a const
as a function argument in our own code. This approach therefore requires const generics, which requires a MSRV of 1.51.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I ended up using const generics like that here:
https://github.com/RustCrypto/universal-hashes/blob/master/polyval/src/backend/pmull.rs#L99-L106
Personally I'd be fine with bumping MSRV to 1.51. It would also let us do a zeroize
1.4 bump.
unsafe fn rol<const BY: i32, const REST: i32>(&mut self) { | ||
self.avx = [ | ||
_mm256_xor_si256( | ||
_mm256_slli_epi32(self.avx[0], BY), | ||
_mm256_srli_epi32(self.avx[0], REST), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here for _mm256_slli_epi32
and _mm256_srli_epi32
.
I originally tried interleaving the AVX2 operations for pairs of blocks with a macro, but the macro was getting rather complex, so I switched to the approach in this PR (adding methods to |
I'd say it's fine to bump MSRV. 1.51 is now 6 months old, which is plenty of time IMO. |
3906f6d
to
8a47574
Compare
For a 4-block buffer, we need to be able to represent the past-the-end buffer position of 256, which is too large for a `u8`.
8a47574
to
85d81c0
Compare
This PR eliminates the performance difference between The remaining issue is that for the |
Awesome! 🎉
Wow, that's really annoying. I was curious what Relevant PR: rust-random/rand#931 cc @dhardy Edit: never mind, it still uses |
It uses a |
When the non-soft backend is being used, its 4-block buffer size results in a `BlockRngCore::Results` type of `[u32; 64]` which doesn't implement `Default`. We replace it with a wrapper type on which we implement the necessary traits.
Tests are now passing. Ran benchmarks on my machine (i7-8700K overclocked to 4.8 GHz):
Current master (0.7.3):
This PR:
|
Looks like you've answered your question, but you'd better ask @kazcw. |
We switch to a 4-block buffer for the combined SSE2 / AVX2 backend, which allows the AVX2 backend to process them together, while the SSE2 backend continues to process one block at a time.
The AVX2 backend is refactored to enable interleaving the instructions per pair of blocks, for better ILP.
Closes #262.