-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NT-long OpenCL: Support lengths up to 125 #5246
NT-long OpenCL: Support lengths up to 125 #5246
Conversation
I guess you mean with mask mode only. Transfers from host probably do become slower, but then those non-mask speeds are such that the attack better be run using the CPU format. However, then there's the case of small mask (e.g., one or two mask characters added to host-provided base words), where usage of OpenCL does help, yet transfers from host also affect the overall speed. |
65b658c
to
edcd16e
Compare
Right. We do have "compressed" buffer transfer though, so transfers only become slower when there are (many) actually long words in there. BTW I see this format doesn't utilize early partial transfer of the key buffer, I should try adding it as well. |
64fd4b0
to
e2126c7
Compare
At 1377755 we're back to a single format, and single-block speed seem to have (best of ~10 runs each on 2080ti) less than 1% penalty. I will investigate the "binary size 8" idea further but I don't really expect it to really fly so we might want to merge at this point (maybe we want to anyway). Oh, and I'll also experient with partial key buffer transfer, but that too is somewhat bug prone so would be separate anyway and may come later. |
The patch is clean, small, and well organized. Merge it as is (to test and mature new ideas) is a very good option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming this passes tests, it looks good to me and can be merged.
I think we have unnecessarily too much code divergence here: #if PLAINTEXT_LENGTH > 27
if (md4_size > 27)
nt_crypt_long(hash, nt_buffer, md4_size);
else
#endif
nt_crypt(hash, nt_buffer, md4_size); If I arrive at the 3x+ figure as follows: the In other words, the high-level separation into If addressing this, please keep the commits so far as-is, but add a commit addressing this concern. I didn't look into how exactly it can be addressed, I just suspect that it can be by implementing the check inside OTOH, with usage of on-device mask such divergence should be very rare, and when on-device mask is not used the bottleneck is in the communication with host and in the host code. So the overall difference between these approaches is probably tiny (much smaller than 3 vs. 2). |
Unfortunately it seems to get a significant performance hit on gfx900 [Radeon RX Vega] (super's AMD-APP 2766.4). It went from 20467M to only 13930M. Limiting from 125 to 59 didn't help much (14163M). I can't really see why it happens but it's a blocker, right? Perhaps we'll have to go back to a separate nt-long format. It would be nice to know performance with current AMD drivers though. Maybe we could detect devices before host code decides on max. length, but that would be very confusing for users. |
That sounds plausible. I'm not quite sure how to address it though. Need to think. |
OK I think I see how to do it. Let's always use the original code for the first block. Right before we stop and skip steps, we put the branch. So we either continue with one or more extra blocks (using normal MD4 macros) or we do the single-block special magic in the end. We'll lose a tiny bit of performance for single-block crypts as we can no longer assume last word is zero. But adding branches for that is likely worse. |
Yes, it's a pretty bad performance regression. In case it's caused by code size increase, my suggestion to share more code between the paths could help there. Having less code total could also help the compiler reduce register pressure.
What about using the original code for the last block instead? |
That will instead switch from using INIT_A..INIT_D in the first steps (and even optimized at that but I suspect the compiler would do it for us anyway) to reading them from the buffer in the first steps. |
Perhaps we can do a stranger thing: Start with the first half of the original MD4, then a branch for possibly more blocks "in the middle of it". But it would be hard to follow and perhaps bug prone. |
Is there any way to hint the compiler to keep code specific to longer passwords separately, like |
I never heard of that. I suspect some compilers might have it but others will fail because it's not in the standard AFAIK. |
Can we just try __builtin_expect and see whether/where it works and then possibly use it conditionally? |
I implemented it like you said, with original code calculating the last or only block. Right now I need three branches in there.
Interestingly enough it works on Mac (all devices), and on Linux both with AMD and nvidia. I'm not seeing a lot of difference though, but a little: AMD now does 14786M with full length support of 125. Edit: POCL also supports it. Given that virtually all drivers use llvm under the hood AFAIK, there's probably not many platforms that don't! |
e45ad09
to
9bc21a4
Compare
Strangely enough, that very change ruined performance on Apple with AMD Radeon Pro Vega 20, from 6G c/s to 2G c/s. But that platform is so flakey we should care much about it - Linux/AMD is more important. |
7fa91bc
to
df2d50a
Compare
It's a pity we lose optimizations in the first few steps, such as the 0x77777777 constant. I think we can accept a little longer divergence, keeping the first few steps specialized (based on whether a/b/c/d are the constants or not) in the if/else above. We should as our first priority optimize for the <=27 case, and reduce divergence as a second priority. |
Yes I already have versions that do exactly that. Still too much hit on AMD. I'm currently trying to reverse a tad more of the MD4 in order to compensate but I fear I'll need to go the 64-bit-binary route (which, once it's all finished and bug free is a very good thing anyway of course) to get it really good. |
No, that's not possible.
This would make it possible though. |
The bitmap size(s) selection logic will handle it just the same (maxes out at 512M bitmap, i.e. 64 MB). I'm thinking we'll actually disable bitmaps completely above whatever threshold where we get too many false positives from it - it only hurts performance (and suck memory) from that point. Theoretically I thought this would actually happen already at eg. 32M loaded hashes: At that point, with the max. of 512M bitmap, we have an "effective mask" of 5 bits unless I calculate it wrongly? But it seems that even just a couple (literally) of "bits" does help - I need to investigate more. |
Now just a single branch in end of function.
This is well tested code in other formats. About 10% boost on 2080ti, against 5300 hashes and pure wordlist, no mask.
Found when implementing a 64-bit version of it. The impact would be suboptimal bitmap parameters when adjusting bitmap as the number of remaining hashes decreases.
9267abb
to
3e57fb0
Compare
OK, 32M hashes doesn't even get the max, it gets "only" a 256M bitmap - but it's probably too small. According to my debug code's calculations it's effectively 3 bits (0x7). The empirical results confirm that this figure is in the correct ballpark, with an actual 1/9 false positives in a run (the debug kernel is actually counting them):
I'll sort out the exact figures and limits later (doesn't really matter for this PR) but anyway I'm pretty sure that at some point we're much better off not using the bitmap at all - and definitely so at 320M loaded hashes... EDIT sorry, the above was the experimental code with new algorithms. Here's bleeding-jumbo - just with the debug code added:
The selections were the same though, and the speed - but bleeding uses more GPU memory for full binaries. |
3e57fb0
to
45d6a5c
Compare
Actually the code (even the current in bleeding) has it wrong - if bitmap size really goes to the hardcoded limit of 512M, the kernel will fail with The actual number of hashes you and/or Sayantan tested at some point appears to be 320294464, it's even in there:
...but if that worked, the limit must have been different at the time: We can only go to 256M. I think I will leave this as-is for now - like I said I'm planning more changes soon (in much smaller portions than this PR) and they will get it right - plus possibly disable bitmaps at some point. EDIT the hardcoded limit is 512MB bitmap in bytes which means 4 G bits (hence overflowing int32). But what I wrote above is basically correct - 320294464 would end up oversized and current kernel would bug out.
So the correct kernel limit is 256 MB bitmap, which is 2 Gbits and a mask of 0x7fffffff. Besides, I doubt a bitmap that large will be more effective than just looking up the two integers, but I will test that. After all, such lookup includes modulus operations that might be expensive. |
Maybe add a comment on |
A way to speed them up would be to precompute (at runtime, but out of loop - perhaps on host) a multiplicative reciprocal, then compute the remainder via division-optimized-into-multiplication, multiplication, and subtraction. Here's an example by @ch3root: unsigned mod3(unsigned x) { return x - (x * 2863311531ull >> 33) * 3; } |
Yeah they don't need to be fast at all. That
That is interesting, provided I can wrap my head around it. On the other hand, my tests with massive numbers of loaded hashes seem to indicate we actually do pretty fine with the bitmaps, contrary to what I thought. Especially with some tweaks I have in mind. More on that later. I seem to be limited by host RAM more than anything else right now (and we already mentioned ways to tackle that). |
Multiplication by the large constant turns the integer number into fixed-point and at the same time multiplies it by the desired fraction. The constant is the fraction of 1.0 in that fixed-point notation. The right shift converts the fixed-point number back to integer (drops the fractional part). |
This can also be done without subtraction: unsigned mod3(unsigned x) { return (((x * 2863311531ull) & ((1ull << 33) - 1)) * 3) >> 33; } Instead of taking the integer part of the fixed-point quotient, this takes its fractional part, brings it to the target range for the remainder, and then takes the integer part of that. It's probably almost the same performance - still two multiplies, one shift, and one other cheap operation (subtraction or bitwise-AND). If we replace 33 with 32 and change the constant accordingly, then maybe the mask can be optimized out (just use the low-half register as input to the second multiplication) and the right shift sometimes too (just use the high-half register). This also works in my testing, but only for unsigned mod3(unsigned x) { return (((x * 0x55555556u) & ((1ull << 32) - 1)) * 3) >> 32; } Edit: here's an updated version that works for the full 32-bit input range: unsigned mod3(unsigned x) { return ((((x + 1) * 0x55555555u) & ((1ull << 32) - 1)) * 3) >> 32; } or the same written in a way exposing MAD: unsigned mod3(unsigned x) { return (((x * 0x55555555u + 0x55555555u) & ((1ull << 32) - 1)) * 3) >> 32; } On an arch with 32-bit registers, like we effectively have on GPUs, this is probably just two 32x32 to 64 widening multiplies in a row, and no other operations - just using the right registers. Edit 2: unfortunately, generalizing this approach still fails for me for high inputs: #include <stdio.h>
#include <stdint.h>
int main(void)
{
uint32_t j = 1;
do {
printf("Testing %u\n", j);
uint32_t rec = 0xffffffffU / j;
//uint32_t rec = (1ull << 32) / j;
//uint32_t rec = ((1ull << 32) + j - 1) / j;
uint32_t i = 0;
do {
uint32_t res = ((i + 1) * rec * (uint64_t)j) >> 32;
//uint32_t res = (i * rec * (uint64_t)j) >> 32;
if (res != i % j) {
printf("%u %% %u = %u got %u\n", i, j, i % j, res);
break;
}
} while (++i);
} while (++j);
return 0;
}
The failures with powers of 2 are not a big deal - it's easy to detect powers of 2 and use bitmask - but others are a problem. |
See also code and discussion at P-H-C/phc-winner-argon2#306 |
@magnumripper Just to be clear, I think you should merge this PR as-is now. Then create issue(s) to keep track of other ideas we discussed. Thanks! |
Yeah I'll probably merge tomorrow. I just need to analyze a bunch of tests I made (including with hundreds of millions of hashes). No real problems seen but perhaps one or two bitmap size/level decisions need a little tweaking: The "4-step bitmaps" end up with different requirements and a smaller effective mask now, compared to the 128-bits version, as we only have 64 bits to pick from. It's faster than before anyway but it can be even better. That, and I'll refactor one variable name in the new file |
I have since found out that 320M hashes will not hit that 31-bit limit. Exactly 691225600 hashes would, if someone would try to fit them. I figured out very trivial changes to bump that limit and allow all 32 bits and 4G (1 GB) bitmaps (basicly passing |
Not only do we save memory, we can reverse much more as well, and reject early. We check the remaining bits in cold host code, for good measure. Closes openwall#5245
45d6a5c
to
0c253e7
Compare
Thank you, @claudioandre-br! I'd say a bigger problem here is all of those speeds are so very low. There's something very wrong going on, both before and after this PR's changes. Per the device info, I'd expect speeds of around 1/4 of Tahiti, so around 2G for the new code, but you were and still are only getting around 300M. Was |
I removed my previous comments. Correct values are: From:
To:
|
I realized that since we pass the used-to-be-runtime-variable |
Support largest bitmaps (0xffffffff) without failing kernel build. Fix a bug where some workgroup sizes would fail to properly copy bitmap to local memory. While at it, for DES/LM OpenCL formats fix a similar copy to local to avoid modulo arithmetic. Related to openwall#5246
Support largest bitmaps (0xffffffff) without failing kernel build. Fix a bug where some workgroup sizes would fail to properly copy bitmap to local memory. While at it, for DES/LM OpenCL formats fix a similar copy to local to avoid modulo arithmetic. Related to openwall#5246
Support largest bitmaps (0xffffffff) without failing kernel build. Fix a bug where some workgroup sizes would fail to properly copy bitmap to local memory. While at it, for DES/LM OpenCL formats fix a similar copy to local to avoid modulo arithmetic. Related to openwall#5246
Support largest bitmaps (0xffffffff) without failing kernel build. Fix a bug where some workgroup sizes would fail to properly copy bitmap to local memory. While at it, for DES/LM OpenCL formats fix a similar copy to local to avoid modulo arithmetic. Related to openwall#5246
Support largest bitmaps (0xffffffff) without failing kernel build. Fix a bug where some workgroup sizes would fail to properly copy bitmap to local memory. While at it, for DES/LM OpenCL formats fix a similar copy to local to avoid modulo arithmetic. Related to #5246
1 block of MD4 is up to 27 characters.
2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max).
As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters.
Closes #5245