NT-long OpenCL: Support lengths up to 125 #5246

magnumripper · 2023-03-09T22:39:52Z

1 block of MD4 is up to 27 characters.
2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max).

As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters.

solardiz · 2023-03-09T23:07:21Z

As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters.

I guess you mean with mask mode only. Transfers from host probably do become slower, but then those non-mask speeds are such that the attack better be run using the CPU format. However, then there's the case of small mask (e.g., one or two mask characters added to host-provided base words), where usage of OpenCL does help, yet transfers from host also affect the overall speed.

magnumripper · 2023-03-09T23:17:02Z

As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters.

I guess you mean with mask mode only. Transfers from host probably do become slower, but then those non-mask speeds are such that the attack better be run using the CPU format. However, then there's the case of small mask (e.g., one or two mask characters added to host-provided base words), where usage of OpenCL does help, yet transfers from host also affect the overall speed.

Right. We do have "compressed" buffer transfer though, so transfers only become slower when there are (many) actually long words in there. BTW I see this format doesn't utilize early partial transfer of the key buffer, I should try adding it as well.

magnumripper · 2023-03-10T17:19:51Z

At 1377755 we're back to a single format, and single-block speed seem to have (best of ~10 runs each on 2080ti) less than 1% penalty. I will investigate the "binary size 8" idea further but I don't really expect it to really fly so we might want to merge at this point (maybe we want to anyway).

Oh, and I'll also experient with partial key buffer transfer, but that too is somewhat bug prone so would be separate anyway and may come later.

claudioandre-br · 2023-03-10T18:36:02Z

we might want to merge at this point (maybe we want to anyway).

The patch is clean, small, and well organized. Merge it as is (to test and mature new ideas) is a very good option.

solardiz

Assuming this passes tests, it looks good to me and can be merged.

solardiz · 2023-03-12T14:49:33Z

I think we have unnecessarily too much code divergence here:

#if PLAINTEXT_LENGTH > 27
		if (md4_size > 27)
			nt_crypt_long(hash, nt_buffer, md4_size);
		else
#endif
			nt_crypt(hash, nt_buffer, md4_size);

If md4_size > 27 varies within a 64 candidate password block, the code above will probably result in a 3x+ slowdown (compared to all passwords being <= 27), whereas it should likely be possible to have 2x+ (below 3x) by having the threads execute the same code at the same time for one of the limbs of the longer passwords (assuming they fit in 2 limbs).

I arrive at the 3x+ figure as follows: the md4_size > 27 condition will result in an execution mask being set (instead of a conditional branch, which is only possible when all threads follow the same path), and so the two "calls" into nt_crypt_long and nt_crypt are made sequentially with the mask inverted. It's 2 limbs + 1 limb = processing cost equal to that of 3 limbs.

In other words, the high-level separation into nt_crypt_long vs. nt_crypt, while very efficient when all threads follow the same code path, unnecessarily duplicates and masks some code that could have been shared by threads even when they don't follow fully the same path (but do for one limb).

If addressing this, please keep the commits so far as-is, but add a commit addressing this concern. I didn't look into how exactly it can be addressed, I just suspect that it can be by implementing the check inside nt_crypt.

OTOH, with usage of on-device mask such divergence should be very rare, and when on-device mask is not used the bottleneck is in the communication with host and in the host code. So the overall difference between these approaches is probably tiny (much smaller than 3 vs. 2).

magnumripper · 2023-03-12T14:50:34Z

Unfortunately it seems to get a significant performance hit on gfx900 [Radeon RX Vega] (super's AMD-APP 2766.4). It went from 20467M to only 13930M. Limiting from 125 to 59 didn't help much (14163M).

I can't really see why it happens but it's a blocker, right? Perhaps we'll have to go back to a separate nt-long format. It would be nice to know performance with current AMD drivers though.

Maybe we could detect devices before host code decides on max. length, but that would be very confusing for users.

magnumripper · 2023-03-12T14:53:02Z

I think we have unnecessarily too much code divergence here

That sounds plausible. I'm not quite sure how to address it though. Need to think.

magnumripper · 2023-03-12T14:59:01Z

OK I think I see how to do it. Let's always use the original code for the first block. Right before we stop and skip steps, we put the branch. So we either continue with one or more extra blocks (using normal MD4 macros) or we do the single-block special magic in the end.

We'll lose a tiny bit of performance for single-block crypts as we can no longer assume last word is zero. But adding branches for that is likely worse.

solardiz · 2023-03-12T15:02:09Z

I can't really see why it happens but it's a blocker, right?

Yes, it's a pretty bad performance regression. In case it's caused by code size increase, my suggestion to share more code between the paths could help there. Having less code total could also help the compiler reduce register pressure.

Let's always use the original code for the first block. [...] we can no longer assume last block is zero.

What about using the original code for the last block instead?

magnumripper · 2023-03-12T15:04:24Z

What about using the original code for the last block instead?

That will instead switch from using INIT_A..INIT_D in the first steps (and even optimized at that but I suspect the compiler would do it for us anyway) to reading them from the buffer in the first steps.

magnumripper · 2023-03-12T15:06:25Z

What about using the original code for the last block instead?

That will instead switch from using INIT_A..INIT_D in the first steps (and even optimized at that but I suspect the compiler would do it for us anyway) to reading them from the buffer in the first steps.

Perhaps we can do a stranger thing: Start with the first half of the original MD4, then a branch for possibly more blocks "in the middle of it". But it would be hard to follow and perhaps bug prone.

solardiz · 2023-03-12T15:08:25Z

Is there any way to hint the compiler to keep code specific to longer passwords separately, like if (unlikely(md4_size > 27))? Does OpenCL have anything like that?

magnumripper · 2023-03-12T15:14:58Z

Is there any way to hint the compiler to keep code specific to longer passwords separately, like if (unlikely(md4_size > 27))? Does OpenCL have anything like that?

I never heard of that. I suspect some compilers might have it but others will fail because it's not in the standard AFAIK.

solardiz · 2023-03-12T15:31:53Z

Can we just try __builtin_expect and see whether/where it works and then possibly use it conditionally?

magnumripper · 2023-03-12T17:04:58Z

I implemented it like you said, with original code calculating the last or only block. Right now I need three branches in there.

Can we just try __builtin_expect and see whether/where it works and then possibly use it conditionally?

Interestingly enough it works on Mac (all devices), and on Linux both with AMD and nvidia. I'm not seeing a lot of difference though, but a little: AMD now does 14786M with full length support of 125.

Edit: POCL also supports it. Given that virtually all drivers use llvm under the hood AFAIK, there's probably not many platforms that don't!

magnumripper · 2023-03-12T17:25:15Z

I implemented it like you said, with original code calculating the last or only block. Right now I need three branches in there.

Strangely enough, that very change ruined performance on Apple with AMD Radeon Pro Vega 20, from 6G c/s to 2G c/s. But that platform is so flakey we should care much about it - Linux/AMD is more important.

solardiz · 2023-03-12T20:55:10Z

I implemented it like you said, with original code calculating the last or only block. Right now I need three branches in there.

It's a pity we lose optimizations in the first few steps, such as the 0x77777777 constant. I think we can accept a little longer divergence, keeping the first few steps specialized (based on whether a/b/c/d are the constants or not) in the if/else above. We should as our first priority optimize for the <=27 case, and reduce divergence as a second priority.

magnumripper · 2023-03-12T20:58:12Z

Yes I already have versions that do exactly that. Still too much hit on AMD. I'm currently trying to reverse a tad more of the MD4 in order to compensate but I fear I'll need to go the 64-bit-binary route (which, once it's all finished and bug free is a very good thing anyway of course) to get it really good.

magnumripper · 2023-03-12T21:00:36Z

I'm currently trying to reverse a tad more of the MD4

No, that's not possible.

I fear I'll need to go the 64-bit-binary route

This would make it possible though.

magnumripper · 2023-03-28T18:09:32Z

Regarding testing and benchmarking, I suggest that you try with very high hash counts as well. IIRC, we previously tested 320M (or was it 306M?), which took something like 11G of GPU RAM - perhaps repeat something similar for old vs. new code. What's the maximum that fits in your 2080Ti?

The bitmap size(s) selection logic will handle it just the same (maxes out at 512M bitmap, i.e. 64 MB). I'm thinking we'll actually disable bitmaps completely above whatever threshold where we get too many false positives from it - it only hurts performance (and suck memory) from that point.

Theoretically I thought this would actually happen already at eg. 32M loaded hashes: At that point, with the max. of 512M bitmap, we have an "effective mask" of 5 bits unless I calculate it wrongly? But it seems that even just a couple (literally) of "bits" does help - I need to investigate more.

Now just a single branch in end of function.

This is well tested code in other formats. About 10% boost on 2080ti, against 5300 hashes and pure wordlist, no mask.

Found when implementing a 64-bit version of it. The impact would be suboptimal bitmap parameters when adjusting bitmap as the number of remaining hashes decreases.

magnumripper · 2023-03-28T19:26:48Z

Theoretically I thought this would actually happen already at eg. 32M loaded hashes: At that point, with the max. of 512M bitmap, we have an "effective mask" of 5 bits unless I calculate it wrongly? But it seems that even just a couple (literally) of "bits" does help - I need to investigate more.

OK, 32M hashes doesn't even get the max, it gets "only" a 256M bitmap - but it's probably too small. According to my debug code's calculations it's effectively 3 bits (0x7). The empirical results confirm that this figure is in the correct ballpark, with an actual 1/9 false positives in a run (the debug kernel is actually counting them):

Loaded 33554432 password hashes with no different salts (NT-opencl [MD4 OpenCL])
offset table size 47437676, hash table size 268730008, loaded hashes size 268435456
33554432 hashes: bitmap size 1x268435456 bits, 33554432 bytes, mask 0xfffffff, effectively 3 bits (0x7)
LWS=256 GWS=69632 (272 blocks) x9025
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
0g 0:00:01:31 DONE (2023-03-28 20:54) 0g/s 2210Mp/s 2210Mc/s 74182TC/s Dev#2:70°C util:100% laac>|..laac>|
Session completed.
201183043750 crypts, fp from bitmap: 1/9 (11.75%), 23641282113 hash table lookups

I'll sort out the exact figures and limits later (doesn't really matter for this PR) but anyway I'm pretty sure that at some point we're much better off not using the bitmap at all - and definitely so at 320M loaded hashes...

EDIT sorry, the above was the experimental code with new algorithms. Here's bleeding-jumbo - just with the debug code added:

Loaded 33554432 password hashes with no different salts (NT-opencl [MD4 OpenCL])
offset table size 47437644, hash table size 268730008, loaded hashes size 536870912
33554432 hashes: bitmap size 1x268435456 bits, 33554432 bytes, mask 0xfffffff, effectively 3 bits (0x7)
LWS=256 GWS=69632 (272 blocks) x9025 
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
0g 0:00:01:31 DONE (2023-03-28 21:31) 0g/s 2210Mp/s 2210Mc/s 74182TC/s Dev#2:69°C util:100% laac>|..laac>|
Session completed. 
201183043750 crypts, fp from bitmap: 1/9 (11.75%), 23640549466 hash table lookups

The selections were the same though, and the speed - but bleeding uses more GPU memory for full binaries.

magnumripper · 2023-03-29T00:06:31Z

The bitmap size(s) selection logic will handle it just the same (maxes out at 512M bitmap, i.e. 64 MB)

Actually the code (even the current in bleeding) has it wrong - if bitmap size really goes to the hardcoded limit of 512M, the kernel will fail with #error BITMAP_MASK too large in a macro assertion, because we're hitting 2**32 and current code use 32-bit integers.

The actual number of hashes you and/or Sayantan tested at some point appears to be 320294464, it's even in there:

$ git grep -FB1 'This many number of hashes'
bt.c-   if (num_ld_hashes > 320294464)
bt.c:           fprintf(stderr, "This many number of hashes have never been tested before and might not succeed!!\n");

...but if that worked, the limit must have been different at the time: We can only go to 256M. I think I will leave this as-is for now - like I said I'm planning more changes soon (in much smaller portions than this PR) and they will get it right - plus possibly disable bitmaps at some point.

EDIT the hardcoded limit is 512MB bitmap in bytes which means 4 G bits (hence overflowing int32). But what I wrote above is basically correct - 320294464 would end up oversized and current kernel would bug out.

Build log: <program source>:42:2: error: BITMAP_SIZE_BITS_LESS_ONE too large
#error BITMAP_SIZE_BITS_LESS_ONE too large
 ^

So the correct kernel limit is 256 MB bitmap, which is 2 Gbits and a mask of 0x7fffffff. Besides, I doubt a bitmap that large will be more effective than just looking up the two integers, but I will test that. After all, such lookup includes modulus operations that might be expensive.

solardiz · 2023-03-29T09:01:03Z

Maybe add a comment on log2_128 and pow128 that "faster algorithms exist, but these are optimized for size and simplicity". BTW, we do implement a faster algorithm in powi in charset.c.

solardiz · 2023-03-29T14:59:27Z

lookup includes modulus operations that might be expensive.

A way to speed them up would be to precompute (at runtime, but out of loop - perhaps on host) a multiplicative reciprocal, then compute the remainder via division-optimized-into-multiplication, multiplication, and subtraction. Here's an example by @ch3root:

unsigned mod3(unsigned x) { return x - (x * 2863311531ull >> 33) * 3; }

magnumripper · 2023-03-29T20:19:31Z

Maybe add a comment on log2_128 and pow128 that "faster algorithms exist, but these are optimized for size and simplicity". BTW, we do implement a faster algorithm in powi in charset.c.

Yeah they don't need to be fast at all. That powi is simple enough though - I'd just use uint128_t instead of double. Perhaps I'll throw it in while I amend something else.

lookup includes modulus operations that might be expensive.

A way to speed them up would be to precompute (at runtime, but out of loop - perhaps on host) a multiplicative reciprocal, then compute the remainder via division-optimized-into-multiplication, multiplication, and subtraction. Here's an example by @ch3root:
unsigned mod3(unsigned x) { return x - (x * 2863311531ull >> 33) * 3; }

That is interesting, provided I can wrap my head around it. On the other hand, my tests with massive numbers of loaded hashes seem to indicate we actually do pretty fine with the bitmaps, contrary to what I thought. Especially with some tweaks I have in mind. More on that later. I seem to be limited by host RAM more than anything else right now (and we already mentioned ways to tackle that).

solardiz · 2023-03-29T20:58:07Z

provided I can wrap my head around it.

Multiplication by the large constant turns the integer number into fixed-point and at the same time multiplies it by the desired fraction. The constant is the fraction of 1.0 in that fixed-point notation. The right shift converts the fixed-point number back to integer (drops the fractional part).

solardiz · 2023-03-30T00:57:26Z

unsigned mod3(unsigned x) { return x - (x * 2863311531ull >> 33) * 3; }

This can also be done without subtraction:

unsigned mod3(unsigned x) { return (((x * 2863311531ull) & ((1ull << 33) - 1)) * 3) >> 33; }

Instead of taking the integer part of the fixed-point quotient, this takes its fractional part, brings it to the target range for the remainder, and then takes the integer part of that. It's probably almost the same performance - still two multiplies, one shift, and one other cheap operation (subtraction or bitwise-AND).

If we replace 33 with 32 and change the constant accordingly, then maybe the mask can be optimized out (just use the low-half register as input to the second multiplication) and the right shift sometimes too (just use the high-half register). This also works in my testing, but only for x up to 2**31-1:

unsigned mod3(unsigned x) { return (((x * 0x55555556u) & ((1ull << 32) - 1)) * 3) >> 32; }

Edit: here's an updated version that works for the full 32-bit input range:

unsigned mod3(unsigned x) { return ((((x + 1) * 0x55555555u) & ((1ull << 32) - 1)) * 3) >> 32; }

or the same written in a way exposing MAD:

unsigned mod3(unsigned x) { return (((x * 0x55555555u + 0x55555555u) & ((1ull << 32) - 1)) * 3) >> 32; }

On an arch with 32-bit registers, like we effectively have on GPUs, this is probably just two 32x32 to 64 widening multiplies in a row, and no other operations - just using the right registers.

Edit 2: unfortunately, generalizing this approach still fails for me for high inputs:

#include <stdio.h>
#include <stdint.h>

int main(void)
{
	uint32_t j = 1;
	do {
		printf("Testing %u\n", j);
		uint32_t rec = 0xffffffffU / j;
		//uint32_t rec = (1ull << 32) / j;
		//uint32_t rec = ((1ull << 32) + j - 1) / j;
		uint32_t i = 0;
		do {
			uint32_t res = ((i + 1) * rec * (uint64_t)j) >> 32;
			//uint32_t res = (i * rec * (uint64_t)j) >> 32;
			if (res != i % j) {
				printf("%u %% %u = %u got %u\n", i, j, i % j, res);
				break;
			}
		} while (++i);
	} while (++j);

	return 0;
}

Testing 1
Testing 2
2147483648 % 2 = 0 got 1
Testing 3
Testing 4
1073741824 % 4 = 0 got 3
Testing 5
Testing 6
1073741824 % 6 = 4 got 3
Testing 7
1073741824 % 7 = 1 got 0
Testing 8
536870912 % 8 = 0 got 7
Testing 9
1073741824 % 9 = 1 got 0
Testing 10
715827882 % 10 = 2 got 1
Testing 11
1073741824 % 11 = 1 got 0
Testing 12
1073741824 % 12 = 4 got 3
Testing 13
477218588 % 13 = 2 got 1
Testing 14
1073741824 % 14 = 8 got 7
Testing 15
Testing 16
268435456 % 16 = 0 got 15
Testing 17
Testing 18
1073741824 % 18 = 10 got 9
Testing 19
715827882 % 19 = 13 got 12
Testing 20
268435456 % 20 = 16 got 15

The failures with powers of 2 are not a big deal - it's easy to detect powers of 2 and use bitmask - but others are a problem.

solardiz · 2023-03-30T01:06:03Z

A way to speed them up would be to precompute (at runtime, but out of loop - perhaps on host) a multiplicative reciprocal

See also code and discussion at P-H-C/phc-winner-argon2#306

solardiz · 2023-03-30T15:53:30Z

@magnumripper Just to be clear, I think you should merge this PR as-is now. Then create issue(s) to keep track of other ideas we discussed. Thanks!

magnumripper · 2023-03-30T18:16:56Z

@magnumripper Just to be clear, I think you should merge this PR as-is now. Then create issue(s) to keep track of other ideas we discussed. Thanks!

Yeah I'll probably merge tomorrow. I just need to analyze a bunch of tests I made (including with hundreds of millions of hashes). No real problems seen but perhaps one or two bitmap size/level decisions need a little tweaking: The "4-step bitmaps" end up with different requirements and a smaller effective mask now, compared to the 128-bits version, as we only have 64 bits to pick from. It's faster than before anyway but it can be even better.

That, and I'll refactor one variable name in the new file opencl_hash_check_64.c. It ended up different from opencl_hash_check_128.c due to code copying from LM/DES files (bitmap_size vs. bitmap_size_bits), and that makes it unnecessary hard to compare them side by side. Better fix that before the new file enters the repo.

magnumripper · 2023-03-30T18:27:32Z

the hardcoded limit is 512MB bitmap in bytes which means 4 G bits (hence overflowing int32). But what I wrote above is basically correct - 320294464 would end up oversized and current kernel would bug out.
(...)
So the correct kernel limit is 256 MB bitmap, which is 2 Gbits and a mask of 0x7fffffff. Besides, I doubt a bitmap that large will be more effective than just looking up the two integers, but I will test that. After all, such lookup includes modulus operations that might be expensive.

I have since found out that 320M hashes will not hit that 31-bit limit. Exactly 691225600 hashes would, if someone would try to fit them. I figured out very trivial changes to bump that limit and allow all 32 bits and 4G (1 GB) bitmaps (basicly passing 0xffffffffU to the kernel instead of 0x100000000 which would need UL or overflow). I will include those fixes for NT-opencl in this PR, but the fixes to other 128-bit formats (which I have ready) will go in a separate PR.

Not only do we save memory, we can reverse much more as well, and reject early. We check the remaining bits in cold host code, for good measure. Closes openwall#5245

solardiz · 2023-03-31T19:16:24Z

Thank you, @claudioandre-br! I'd say a bigger problem here is all of those speeds are so very low. There's something very wrong going on, both before and after this PR's changes. Per the device info, I'd expect speeds of around 1/4 of Tahiti, so around 2G for the new code, but you were and still are only getting around 300M. Was GWS=512 autodetected? Can you try larger? Can you show -v=5 of the autodetection? Optimal might be GWS=32768.

claudioandre-br · 2023-04-01T17:33:37Z

I removed my previous comments. Correct values are:

From:

C:\Temp\rolling> run\john --test=5 --format=nt-opencl --gws=32768
Device 1: gfx902 [AMD Radeon(TM) Vega 8 Graphics]
Benchmarking: NT-opencl [MD4 OpenCL/mask accel]... LWS=64 GWS=32768 (512 blocks) x2470 DONE
Raw:    1433M c/s real, 34609M c/s virtual

To:

C:\Temp\debug> run\john --test=5 --format=nt-opencl --verb=6
initUnicode(UNICODE, RAW/RAW)
RAW -> RAW -> RAW
Device 1: gfx902 [AMD Radeon(TM) Vega 8 Graphics]
Benchmarking: NT-opencl [MD4 OpenCL/mask accel]... Loaded 44 hashes with 1 different salts to test db from test vectors
TestDB LWS=8 GWS=48 (6 blocks) PASS,
Test mask: ?a?a?l?u?d?d?s
Expecting "no" false positives
44 hashes: bitmap 4x32768 bits, mask 0x7fff, effectively 38 bits (0x3fffffffff), 16 KiB (local)
Offset tbl 112 B, Hash tbl 360 B, Results 532 B, Dupe bmp 8 B, TOTAL on GPU: 17396 B
Internal mask, multiplier: 2470 (target: 2048)
Calculating best GWS for LWS=64; max. 200 ms single kernel invocation.
Raw speed figures including buffer transfers:
Tuning for password length 7
key xfer: 3.680 us, idx xfer: 1.919 us, crypt: 4.018 ms, res xfer: 3.360 us
gws:      1024  627949K c/s   627949471 rounds/s    4.027 ms per crypt_all()!
key xfer: 4.640 us, idx xfer: 2.880 us, crypt: 4.018 ms, res xfer: 3.520 us
gws:      2048    1255M c/s  1255400254 rounds/s    4.029 ms per crypt_all()+
key xfer: 5.440 us, idx xfer: 3.200 us, crypt: 12.256 ms, res xfer: 3.360 us
gws:      4096  824611K c/s   824611050 rounds/s   12.268 ms per crypt_all()
key xfer: 7.200 us, idx xfer: 4.800 us, crypt: 22.505 ms, res xfer: 3.040 us
gws:      8192  898468K c/s   898468970 rounds/s   22.520 ms per crypt_all()
key xfer: 12 us, idx xfer: 6.560 us, crypt: 38.292 ms, res xfer: 3.360 us
gws:     16384    1056M c/s  1056234392 rounds/s   38.313 ms per crypt_all()
key xfer: 20.480 us, idx xfer: 10.720 us, crypt: 65.417 ms, res xfer: 9.440 us
gws:     32768    1236M c/s  1236470119 rounds/s   65.458 ms per crypt_all()
key xfer: 35.040 us, idx xfer: 19.200 us, crypt: 111.924 ms, res xfer: 3.200 us
gws:     65536    1445M c/s  1445536207 rounds/s  111.981 ms per crypt_all()+
key xfer: 67.840 us, idx xfer: 33.920 us, crypt: 252.095 ms (exceeds 200 ms)
key xfer: 21.120 us, idx xfer: 10.880 us, crypt: 61.302 ms, res xfer: 3.040 us
gws:     32768    1319M c/s  1319529112 rounds/s   61.337 ms per crypt_all()-
Calculating best LWS for GWS=65536
Note: Profiling timers seem buggy
Testing LWS=64 GWS=65536 ... 1266874552.150 s
Testing LWS=128 GWS=65536 ... 1266874552.118 s
Testing LWS=256 GWS=65536 ... 1266874552.101 s
Calculating best GWS for LWS=64; max. 200 ms single kernel invocation.
Raw speed figures including buffer transfers:
key xfer: 3.520 us, idx xfer: 1.919 us, crypt: 3.967 ms, res xfer: 3.200 us
gws:       512  318055K c/s   318055611 rounds/s    3.976 ms per crypt_all()!
key xfer: 4 us, idx xfer: 1.760 us, crypt: 4.019 ms, res xfer: 3.040 us
gws:      1024  627949K c/s   627949471 rounds/s    4.027 ms per crypt_all()+
key xfer: 4.320 us, idx xfer: 2.720 us, crypt: 4.019 ms, res xfer: 3.200 us
gws:      2048    1255M c/s  1255450105 rounds/s    4.029 ms per crypt_all()+
key xfer: 5.760 us, idx xfer: 3.360 us, crypt: 10.686 ms, res xfer: 4.160 us
gws:      4096  945525K c/s   945525233 rounds/s   10.700 ms per crypt_all()
key xfer: 22.560 us, idx xfer: 14.080 us, crypt: 23.233 ms, res xfer: 3.360 us
gws:      8192  869395K c/s   869395443 rounds/s   23.273 ms per crypt_all()
key xfer: 12.480 us, idx xfer: 7.200 us, crypt: 36.503 ms, res xfer: 4.320 us
gws:     16384    1107M c/s  1107904719 rounds/s   36.527 ms per crypt_all()
key xfer: 60.480 us, idx xfer: 33.120 us, crypt: 70.694 ms, res xfer: 3.040 us
gws:     32768    1143M c/s  1143324676 rounds/s   70.790 ms per crypt_all()
key xfer: 35.200 us, idx xfer: 18.560 us, crypt: 123.492 ms, res xfer: 3.200 us
gws:     65536    1310M c/s  1310197194 rounds/s  123.549 ms per crypt_all()+
key xfer: 208.160 us, idx xfer: 106.560 us, crypt: 252.464 ms (exceeds 200 ms)
key xfer: 20.160 us, idx xfer: 10.720 us, crypt: 61.423 ms, res xfer: 3.200 us
gws:     32768    1316M c/s  1316952651 rounds/s   61.457 ms per crypt_all()!!
key xfer: 35.840 us, idx xfer: 20.800 us, crypt: 34.542 ms, res xfer: 3.360 us
gws:     16384    1169M c/s  1169511901 rounds/s   34.602 ms per crypt_all()-
TestDB LWS=64 GWS=32768 (512 blocks) x2470 DONE
Raw:    1479M c/s real, 50635M c/s virtual

solardiz · 2023-04-02T13:25:14Z

precompute (at runtime, but out of loop - perhaps on host) a multiplicative reciprocal

I realized that since we pass the used-to-be-runtime-variable HASH_TABLE_SIZE from host to OpenCL kernel as its build-time constant, the OpenCL compiler has opportunity to perform the same optimization on its own - and maybe it already does!

Support largest bitmaps (0xffffffff) without failing kernel build. Fix a bug where some workgroup sizes would fail to properly copy bitmap to local memory. While at it, for DES/LM OpenCL formats fix a similar copy to local to avoid modulo arithmetic. Related to openwall#5246

Support largest bitmaps (0xffffffff) without failing kernel build. Fix a bug where some workgroup sizes would fail to properly copy bitmap to local memory. While at it, for DES/LM OpenCL formats fix a similar copy to local to avoid modulo arithmetic. Related to #5246

magnumripper marked this pull request as draft March 9, 2023 22:40

magnumripper force-pushed the nt-opencl-long branch from 65b658c to edcd16e Compare March 9, 2023 23:12

magnumripper force-pushed the nt-opencl-long branch 2 times, most recently from 64fd4b0 to e2126c7 Compare March 10, 2023 15:44

magnumripper marked this pull request as ready for review March 10, 2023 17:19

solardiz approved these changes Mar 11, 2023

View reviewed changes

magnumripper force-pushed the nt-opencl-long branch from e45ad09 to 9bc21a4 Compare March 12, 2023 17:17

magnumripper force-pushed the nt-opencl-long branch 2 times, most recently from 7fa91bc to df2d50a Compare March 12, 2023 17:57

magnumripper marked this pull request as draft March 12, 2023 18:02

magnumripper mentioned this pull request Mar 12, 2023

Alternative solution nt kernel for long plaintexts #5250

Closed

magnumripper added 3 commits March 28, 2023 20:33

NT-opencl: Optimize code and avoid diverging threads

1c637f4

Now just a single branch in end of function.

NT-OpenCL: Early partial transfer of keybuffer

8ef84a4

This is well tested code in other formats. About 10% boost on 2080ti, against 5300 hashes and pure wordlist, no mask.

ocl_hc_128_select_bitmap(): Bug fixes

e17ce07

Found when implementing a 64-bit version of it. The impact would be suboptimal bitmap parameters when adjusting bitmap as the number of remaining hashes decreases.

magnumripper force-pushed the nt-opencl-long branch from 9267abb to 3e57fb0 Compare March 28, 2023 18:34

magnumripper force-pushed the nt-opencl-long branch from 3e57fb0 to 45d6a5c Compare March 28, 2023 23:43

NT-opencl: 64-bit binary size

0c253e7

Not only do we save memory, we can reverse much more as well, and reject early. We check the remaining bits in cold host code, for good measure. Closes openwall#5245

magnumripper force-pushed the nt-opencl-long branch from 45d6a5c to 0c253e7 Compare March 31, 2023 11:36

magnumripper merged commit 6ed33a7 into openwall:bleeding-jumbo Mar 31, 2023

magnumripper deleted the nt-opencl-long branch March 31, 2023 11:55

solardiz mentioned this pull request Dec 10, 2023

Optimization: Removed unnecessary divides P-H-C/phc-winner-argon2#306

Open

NT-long OpenCL: Support lengths up to 125 #5246

NT-long OpenCL: Support lengths up to 125 #5246

Conversation

magnumripper commented Mar 9, 2023

solardiz commented Mar 9, 2023

magnumripper commented Mar 9, 2023

magnumripper commented Mar 10, 2023 • edited Loading

claudioandre-br commented Mar 10, 2023

solardiz left a comment

Choose a reason for hiding this comment

solardiz commented Mar 12, 2023

magnumripper commented Mar 12, 2023

magnumripper commented Mar 12, 2023

magnumripper commented Mar 12, 2023 • edited Loading

solardiz commented Mar 12, 2023

magnumripper commented Mar 12, 2023

magnumripper commented Mar 12, 2023

solardiz commented Mar 12, 2023

magnumripper commented Mar 12, 2023

solardiz commented Mar 12, 2023 • edited Loading

magnumripper commented Mar 12, 2023 • edited Loading

magnumripper commented Mar 12, 2023

solardiz commented Mar 12, 2023

magnumripper commented Mar 12, 2023

magnumripper commented Mar 12, 2023

magnumripper commented Mar 28, 2023

magnumripper commented Mar 28, 2023 • edited Loading

magnumripper commented Mar 29, 2023 • edited Loading

solardiz commented Mar 29, 2023

solardiz commented Mar 29, 2023

magnumripper commented Mar 29, 2023

solardiz commented Mar 29, 2023

solardiz commented Mar 30, 2023 • edited Loading

solardiz commented Mar 30, 2023

solardiz commented Mar 30, 2023

magnumripper commented Mar 30, 2023

magnumripper commented Mar 30, 2023 • edited Loading

solardiz commented Mar 31, 2023

claudioandre-br commented Apr 1, 2023

solardiz commented Apr 2, 2023

magnumripper commented Mar 10, 2023 •

edited

Loading

magnumripper commented Mar 12, 2023 •

edited

Loading

solardiz commented Mar 12, 2023 •

edited

Loading

magnumripper commented Mar 12, 2023 •

edited

Loading

magnumripper commented Mar 28, 2023 •

edited

Loading

magnumripper commented Mar 29, 2023 •

edited

Loading

solardiz commented Mar 30, 2023 •

edited

Loading

magnumripper commented Mar 30, 2023 •

edited

Loading