Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NT-long OpenCL: Support lengths up to 125 #5246

Merged
merged 5 commits into from
Mar 31, 2023

Conversation

magnumripper
Copy link
Member

1 block of MD4 is up to 27 characters.
2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max).

As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters.

Closes #5245

@magnumripper magnumripper marked this pull request as draft March 9, 2023 22:40
@solardiz
Copy link
Member

solardiz commented Mar 9, 2023

As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters.

I guess you mean with mask mode only. Transfers from host probably do become slower, but then those non-mask speeds are such that the attack better be run using the CPU format. However, then there's the case of small mask (e.g., one or two mask characters added to host-provided base words), where usage of OpenCL does help, yet transfers from host also affect the overall speed.

@magnumripper
Copy link
Member Author

As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters.

I guess you mean with mask mode only. Transfers from host probably do become slower, but then those non-mask speeds are such that the attack better be run using the CPU format. However, then there's the case of small mask (e.g., one or two mask characters added to host-provided base words), where usage of OpenCL does help, yet transfers from host also affect the overall speed.

Right. We do have "compressed" buffer transfer though, so transfers only become slower when there are (many) actually long words in there. BTW I see this format doesn't utilize early partial transfer of the key buffer, I should try adding it as well.

@magnumripper magnumripper force-pushed the nt-opencl-long branch 2 times, most recently from 64fd4b0 to e2126c7 Compare March 10, 2023 15:44
@magnumripper
Copy link
Member Author

magnumripper commented Mar 10, 2023

At 1377755 we're back to a single format, and single-block speed seem to have (best of ~10 runs each on 2080ti) less than 1% penalty. I will investigate the "binary size 8" idea further but I don't really expect it to really fly so we might want to merge at this point (maybe we want to anyway).

Oh, and I'll also experient with partial key buffer transfer, but that too is somewhat bug prone so would be separate anyway and may come later.

@magnumripper magnumripper marked this pull request as ready for review March 10, 2023 17:19
@claudioandre-br
Copy link
Member

we might want to merge at this point (maybe we want to anyway).

The patch is clean, small, and well organized. Merge it as is (to test and mature new ideas) is a very good option.

Copy link
Member

@solardiz solardiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this passes tests, it looks good to me and can be merged.

@solardiz
Copy link
Member

I think we have unnecessarily too much code divergence here:

#if PLAINTEXT_LENGTH > 27
		if (md4_size > 27)
			nt_crypt_long(hash, nt_buffer, md4_size);
		else
#endif
			nt_crypt(hash, nt_buffer, md4_size);

If md4_size > 27 varies within a 64 candidate password block, the code above will probably result in a 3x+ slowdown (compared to all passwords being <= 27), whereas it should likely be possible to have 2x+ (below 3x) by having the threads execute the same code at the same time for one of the limbs of the longer passwords (assuming they fit in 2 limbs).

I arrive at the 3x+ figure as follows: the md4_size > 27 condition will result in an execution mask being set (instead of a conditional branch, which is only possible when all threads follow the same path), and so the two "calls" into nt_crypt_long and nt_crypt are made sequentially with the mask inverted. It's 2 limbs + 1 limb = processing cost equal to that of 3 limbs.

In other words, the high-level separation into nt_crypt_long vs. nt_crypt, while very efficient when all threads follow the same code path, unnecessarily duplicates and masks some code that could have been shared by threads even when they don't follow fully the same path (but do for one limb).

If addressing this, please keep the commits so far as-is, but add a commit addressing this concern. I didn't look into how exactly it can be addressed, I just suspect that it can be by implementing the check inside nt_crypt.

OTOH, with usage of on-device mask such divergence should be very rare, and when on-device mask is not used the bottleneck is in the communication with host and in the host code. So the overall difference between these approaches is probably tiny (much smaller than 3 vs. 2).

@magnumripper
Copy link
Member Author

Unfortunately it seems to get a significant performance hit on gfx900 [Radeon RX Vega] (super's AMD-APP 2766.4). It went from 20467M to only 13930M. Limiting from 125 to 59 didn't help much (14163M).

I can't really see why it happens but it's a blocker, right? Perhaps we'll have to go back to a separate nt-long format. It would be nice to know performance with current AMD drivers though.

Maybe we could detect devices before host code decides on max. length, but that would be very confusing for users.

@magnumripper
Copy link
Member Author

I think we have unnecessarily too much code divergence here

That sounds plausible. I'm not quite sure how to address it though. Need to think.

@magnumripper
Copy link
Member Author

magnumripper commented Mar 12, 2023

OK I think I see how to do it. Let's always use the original code for the first block. Right before we stop and skip steps, we put the branch. So we either continue with one or more extra blocks (using normal MD4 macros) or we do the single-block special magic in the end.

We'll lose a tiny bit of performance for single-block crypts as we can no longer assume last word is zero. But adding branches for that is likely worse.

@solardiz
Copy link
Member

I can't really see why it happens but it's a blocker, right?

Yes, it's a pretty bad performance regression. In case it's caused by code size increase, my suggestion to share more code between the paths could help there. Having less code total could also help the compiler reduce register pressure.

Let's always use the original code for the first block. [...] we can no longer assume last block is zero.

What about using the original code for the last block instead?

@magnumripper
Copy link
Member Author

What about using the original code for the last block instead?

That will instead switch from using INIT_A..INIT_D in the first steps (and even optimized at that but I suspect the compiler would do it for us anyway) to reading them from the buffer in the first steps.

@magnumripper
Copy link
Member Author

What about using the original code for the last block instead?

That will instead switch from using INIT_A..INIT_D in the first steps (and even optimized at that but I suspect the compiler would do it for us anyway) to reading them from the buffer in the first steps.

Perhaps we can do a stranger thing: Start with the first half of the original MD4, then a branch for possibly more blocks "in the middle of it". But it would be hard to follow and perhaps bug prone.

@solardiz
Copy link
Member

Is there any way to hint the compiler to keep code specific to longer passwords separately, like if (unlikely(md4_size > 27))? Does OpenCL have anything like that?

@magnumripper
Copy link
Member Author

Is there any way to hint the compiler to keep code specific to longer passwords separately, like if (unlikely(md4_size > 27))? Does OpenCL have anything like that?

I never heard of that. I suspect some compilers might have it but others will fail because it's not in the standard AFAIK.

@solardiz
Copy link
Member

solardiz commented Mar 12, 2023

Can we just try __builtin_expect and see whether/where it works and then possibly use it conditionally?

@magnumripper
Copy link
Member Author

magnumripper commented Mar 12, 2023

I implemented it like you said, with original code calculating the last or only block. Right now I need three branches in there.

Can we just try __builtin_expect and see whether/where it works and then possibly use it conditionally?

Interestingly enough it works on Mac (all devices), and on Linux both with AMD and nvidia. I'm not seeing a lot of difference though, but a little: AMD now does 14786M with full length support of 125.

Edit: POCL also supports it. Given that virtually all drivers use llvm under the hood AFAIK, there's probably not many platforms that don't!

@magnumripper
Copy link
Member Author

I implemented it like you said, with original code calculating the last or only block. Right now I need three branches in there.

Strangely enough, that very change ruined performance on Apple with AMD Radeon Pro Vega 20, from 6G c/s to 2G c/s. But that platform is so flakey we should care much about it - Linux/AMD is more important.

@magnumripper magnumripper force-pushed the nt-opencl-long branch 2 times, most recently from 7fa91bc to df2d50a Compare March 12, 2023 17:57
@magnumripper magnumripper marked this pull request as draft March 12, 2023 18:02
@solardiz
Copy link
Member

I implemented it like you said, with original code calculating the last or only block. Right now I need three branches in there.

It's a pity we lose optimizations in the first few steps, such as the 0x77777777 constant. I think we can accept a little longer divergence, keeping the first few steps specialized (based on whether a/b/c/d are the constants or not) in the if/else above. We should as our first priority optimize for the <=27 case, and reduce divergence as a second priority.

@magnumripper
Copy link
Member Author

Yes I already have versions that do exactly that. Still too much hit on AMD. I'm currently trying to reverse a tad more of the MD4 in order to compensate but I fear I'll need to go the 64-bit-binary route (which, once it's all finished and bug free is a very good thing anyway of course) to get it really good.

@magnumripper
Copy link
Member Author

I'm currently trying to reverse a tad more of the MD4

No, that's not possible.

I fear I'll need to go the 64-bit-binary route

This would make it possible though.

@magnumripper
Copy link
Member Author

Regarding testing and benchmarking, I suggest that you try with very high hash counts as well. IIRC, we previously tested 320M (or was it 306M?), which took something like 11G of GPU RAM - perhaps repeat something similar for old vs. new code. What's the maximum that fits in your 2080Ti?

The bitmap size(s) selection logic will handle it just the same (maxes out at 512M bitmap, i.e. 64 MB). I'm thinking we'll actually disable bitmaps completely above whatever threshold where we get too many false positives from it - it only hurts performance (and suck memory) from that point.

Theoretically I thought this would actually happen already at eg. 32M loaded hashes: At that point, with the max. of 512M bitmap, we have an "effective mask" of 5 bits unless I calculate it wrongly? But it seems that even just a couple (literally) of "bits" does help - I need to investigate more.

Now just a single branch in end of function.
This is well tested code in other formats.
About 10% boost on 2080ti, against 5300 hashes and pure wordlist, no mask.
Found when implementing a 64-bit version of it.  The impact would be
suboptimal bitmap parameters when adjusting bitmap as the number of
remaining hashes decreases.
@magnumripper
Copy link
Member Author

magnumripper commented Mar 28, 2023

Theoretically I thought this would actually happen already at eg. 32M loaded hashes: At that point, with the max. of 512M bitmap, we have an "effective mask" of 5 bits unless I calculate it wrongly? But it seems that even just a couple (literally) of "bits" does help - I need to investigate more.

OK, 32M hashes doesn't even get the max, it gets "only" a 256M bitmap - but it's probably too small. According to my debug code's calculations it's effectively 3 bits (0x7). The empirical results confirm that this figure is in the correct ballpark, with an actual 1/9 false positives in a run (the debug kernel is actually counting them):

Loaded 33554432 password hashes with no different salts (NT-opencl [MD4 OpenCL])
offset table size 47437676, hash table size 268730008, loaded hashes size 268435456
33554432 hashes: bitmap size 1x268435456 bits, 33554432 bytes, mask 0xfffffff, effectively 3 bits (0x7)
LWS=256 GWS=69632 (272 blocks) x9025
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
0g 0:00:01:31 DONE (2023-03-28 20:54) 0g/s 2210Mp/s 2210Mc/s 74182TC/s Dev#2:70°C util:100% laac>|..laac>|
Session completed.
201183043750 crypts, fp from bitmap: 1/9 (11.75%), 23641282113 hash table lookups

I'll sort out the exact figures and limits later (doesn't really matter for this PR) but anyway I'm pretty sure that at some point we're much better off not using the bitmap at all - and definitely so at 320M loaded hashes...

EDIT sorry, the above was the experimental code with new algorithms. Here's bleeding-jumbo - just with the debug code added:

Loaded 33554432 password hashes with no different salts (NT-opencl [MD4 OpenCL])
offset table size 47437644, hash table size 268730008, loaded hashes size 536870912
33554432 hashes: bitmap size 1x268435456 bits, 33554432 bytes, mask 0xfffffff, effectively 3 bits (0x7)
LWS=256 GWS=69632 (272 blocks) x9025 
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
0g 0:00:01:31 DONE (2023-03-28 21:31) 0g/s 2210Mp/s 2210Mc/s 74182TC/s Dev#2:69°C util:100% laac>|..laac>|
Session completed. 
201183043750 crypts, fp from bitmap: 1/9 (11.75%), 23640549466 hash table lookups

The selections were the same though, and the speed - but bleeding uses more GPU memory for full binaries.

@magnumripper
Copy link
Member Author

magnumripper commented Mar 29, 2023

The bitmap size(s) selection logic will handle it just the same (maxes out at 512M bitmap, i.e. 64 MB)

Actually the code (even the current in bleeding) has it wrong - if bitmap size really goes to the hardcoded limit of 512M, the kernel will fail with #error BITMAP_MASK too large in a macro assertion, because we're hitting 2**32 and current code use 32-bit integers.

The actual number of hashes you and/or Sayantan tested at some point appears to be 320294464, it's even in there:

$ git grep -FB1 'This many number of hashes'
bt.c-   if (num_ld_hashes > 320294464)
bt.c:           fprintf(stderr, "This many number of hashes have never been tested before and might not succeed!!\n");

...but if that worked, the limit must have been different at the time: We can only go to 256M. I think I will leave this as-is for now - like I said I'm planning more changes soon (in much smaller portions than this PR) and they will get it right - plus possibly disable bitmaps at some point.

EDIT the hardcoded limit is 512MB bitmap in bytes which means 4 G bits (hence overflowing int32). But what I wrote above is basically correct - 320294464 would end up oversized and current kernel would bug out.

Build log: <program source>:42:2: error: BITMAP_SIZE_BITS_LESS_ONE too large
#error BITMAP_SIZE_BITS_LESS_ONE too large
 ^

So the correct kernel limit is 256 MB bitmap, which is 2 Gbits and a mask of 0x7fffffff. Besides, I doubt a bitmap that large will be more effective than just looking up the two integers, but I will test that. After all, such lookup includes modulus operations that might be expensive.

@solardiz
Copy link
Member

Maybe add a comment on log2_128 and pow128 that "faster algorithms exist, but these are optimized for size and simplicity". BTW, we do implement a faster algorithm in powi in charset.c.

@solardiz
Copy link
Member

lookup includes modulus operations that might be expensive.

A way to speed them up would be to precompute (at runtime, but out of loop - perhaps on host) a multiplicative reciprocal, then compute the remainder via division-optimized-into-multiplication, multiplication, and subtraction. Here's an example by @ch3root:

unsigned mod3(unsigned x) { return x - (x * 2863311531ull >> 33) * 3; }

@magnumripper
Copy link
Member Author

Maybe add a comment on log2_128 and pow128 that "faster algorithms exist, but these are optimized for size and simplicity". BTW, we do implement a faster algorithm in powi in charset.c.

Yeah they don't need to be fast at all. That powi is simple enough though - I'd just use uint128_t instead of double. Perhaps I'll throw it in while I amend something else.

lookup includes modulus operations that might be expensive.

A way to speed them up would be to precompute (at runtime, but out of loop - perhaps on host) a multiplicative reciprocal, then compute the remainder via division-optimized-into-multiplication, multiplication, and subtraction. Here's an example by @ch3root:

unsigned mod3(unsigned x) { return x - (x * 2863311531ull >> 33) * 3; }

That is interesting, provided I can wrap my head around it. On the other hand, my tests with massive numbers of loaded hashes seem to indicate we actually do pretty fine with the bitmaps, contrary to what I thought. Especially with some tweaks I have in mind. More on that later. I seem to be limited by host RAM more than anything else right now (and we already mentioned ways to tackle that).

@solardiz
Copy link
Member

provided I can wrap my head around it.

Multiplication by the large constant turns the integer number into fixed-point and at the same time multiplies it by the desired fraction. The constant is the fraction of 1.0 in that fixed-point notation. The right shift converts the fixed-point number back to integer (drops the fractional part).

@solardiz
Copy link
Member

solardiz commented Mar 30, 2023

unsigned mod3(unsigned x) { return x - (x * 2863311531ull >> 33) * 3; }

This can also be done without subtraction:

unsigned mod3(unsigned x) { return (((x * 2863311531ull) & ((1ull << 33) - 1)) * 3) >> 33; }

Instead of taking the integer part of the fixed-point quotient, this takes its fractional part, brings it to the target range for the remainder, and then takes the integer part of that. It's probably almost the same performance - still two multiplies, one shift, and one other cheap operation (subtraction or bitwise-AND).

If we replace 33 with 32 and change the constant accordingly, then maybe the mask can be optimized out (just use the low-half register as input to the second multiplication) and the right shift sometimes too (just use the high-half register). This also works in my testing, but only for x up to 2**31-1:

unsigned mod3(unsigned x) { return (((x * 0x55555556u) & ((1ull << 32) - 1)) * 3) >> 32; }

Edit: here's an updated version that works for the full 32-bit input range:

unsigned mod3(unsigned x) { return ((((x + 1) * 0x55555555u) & ((1ull << 32) - 1)) * 3) >> 32; }

or the same written in a way exposing MAD:

unsigned mod3(unsigned x) { return (((x * 0x55555555u + 0x55555555u) & ((1ull << 32) - 1)) * 3) >> 32; }

On an arch with 32-bit registers, like we effectively have on GPUs, this is probably just two 32x32 to 64 widening multiplies in a row, and no other operations - just using the right registers.

Edit 2: unfortunately, generalizing this approach still fails for me for high inputs:

#include <stdio.h>
#include <stdint.h>

int main(void)
{
	uint32_t j = 1;
	do {
		printf("Testing %u\n", j);
		uint32_t rec = 0xffffffffU / j;
		//uint32_t rec = (1ull << 32) / j;
		//uint32_t rec = ((1ull << 32) + j - 1) / j;
		uint32_t i = 0;
		do {
			uint32_t res = ((i + 1) * rec * (uint64_t)j) >> 32;
			//uint32_t res = (i * rec * (uint64_t)j) >> 32;
			if (res != i % j) {
				printf("%u %% %u = %u got %u\n", i, j, i % j, res);
				break;
			}
		} while (++i);
	} while (++j);

	return 0;
}
Testing 1
Testing 2
2147483648 % 2 = 0 got 1
Testing 3
Testing 4
1073741824 % 4 = 0 got 3
Testing 5
Testing 6
1073741824 % 6 = 4 got 3
Testing 7
1073741824 % 7 = 1 got 0
Testing 8
536870912 % 8 = 0 got 7
Testing 9
1073741824 % 9 = 1 got 0
Testing 10
715827882 % 10 = 2 got 1
Testing 11
1073741824 % 11 = 1 got 0
Testing 12
1073741824 % 12 = 4 got 3
Testing 13
477218588 % 13 = 2 got 1
Testing 14
1073741824 % 14 = 8 got 7
Testing 15
Testing 16
268435456 % 16 = 0 got 15
Testing 17
Testing 18
1073741824 % 18 = 10 got 9
Testing 19
715827882 % 19 = 13 got 12
Testing 20
268435456 % 20 = 16 got 15

The failures with powers of 2 are not a big deal - it's easy to detect powers of 2 and use bitmask - but others are a problem.

@solardiz
Copy link
Member

A way to speed them up would be to precompute (at runtime, but out of loop - perhaps on host) a multiplicative reciprocal

See also code and discussion at P-H-C/phc-winner-argon2#306

@solardiz
Copy link
Member

@magnumripper Just to be clear, I think you should merge this PR as-is now. Then create issue(s) to keep track of other ideas we discussed. Thanks!

@magnumripper
Copy link
Member Author

@magnumripper Just to be clear, I think you should merge this PR as-is now. Then create issue(s) to keep track of other ideas we discussed. Thanks!

Yeah I'll probably merge tomorrow. I just need to analyze a bunch of tests I made (including with hundreds of millions of hashes). No real problems seen but perhaps one or two bitmap size/level decisions need a little tweaking: The "4-step bitmaps" end up with different requirements and a smaller effective mask now, compared to the 128-bits version, as we only have 64 bits to pick from. It's faster than before anyway but it can be even better.

That, and I'll refactor one variable name in the new file opencl_hash_check_64.c. It ended up different from opencl_hash_check_128.c due to code copying from LM/DES files (bitmap_size vs. bitmap_size_bits), and that makes it unnecessary hard to compare them side by side. Better fix that before the new file enters the repo.

@magnumripper
Copy link
Member Author

magnumripper commented Mar 30, 2023

the hardcoded limit is 512MB bitmap in bytes which means 4 G bits (hence overflowing int32). But what I wrote above is basically correct - 320294464 would end up oversized and current kernel would bug out.
(...)
So the correct kernel limit is 256 MB bitmap, which is 2 Gbits and a mask of 0x7fffffff. Besides, I doubt a bitmap that large will be more effective than just looking up the two integers, but I will test that. After all, such lookup includes modulus operations that might be expensive.

I have since found out that 320M hashes will not hit that 31-bit limit. Exactly 691225600 hashes would, if someone would try to fit them. I figured out very trivial changes to bump that limit and allow all 32 bits and 4G (1 GB) bitmaps (basicly passing 0xffffffffU to the kernel instead of 0x100000000 which would need UL or overflow). I will include those fixes for NT-opencl in this PR, but the fixes to other 128-bit formats (which I have ready) will go in a separate PR.

Not only do we save memory, we can reverse much more as well, and reject
early.  We check the remaining bits in cold host code, for good measure.

Closes openwall#5245
@magnumripper magnumripper merged commit 6ed33a7 into openwall:bleeding-jumbo Mar 31, 2023
@magnumripper magnumripper deleted the nt-opencl-long branch March 31, 2023 11:55
@solardiz
Copy link
Member

Thank you, @claudioandre-br! I'd say a bigger problem here is all of those speeds are so very low. There's something very wrong going on, both before and after this PR's changes. Per the device info, I'd expect speeds of around 1/4 of Tahiti, so around 2G for the new code, but you were and still are only getting around 300M. Was GWS=512 autodetected? Can you try larger? Can you show -v=5 of the autodetection? Optimal might be GWS=32768.

@claudioandre-br
Copy link
Member

I removed my previous comments. Correct values are:

From:

C:\Temp\rolling> run\john --test=5 --format=nt-opencl --gws=32768
Device 1: gfx902 [AMD Radeon(TM) Vega 8 Graphics]
Benchmarking: NT-opencl [MD4 OpenCL/mask accel]... LWS=64 GWS=32768 (512 blocks) x2470 DONE
Raw:    1433M c/s real, 34609M c/s virtual

To:

C:\Temp\debug> run\john --test=5 --format=nt-opencl --verb=6
initUnicode(UNICODE, RAW/RAW)
RAW -> RAW -> RAW
Device 1: gfx902 [AMD Radeon(TM) Vega 8 Graphics]
Benchmarking: NT-opencl [MD4 OpenCL/mask accel]... Loaded 44 hashes with 1 different salts to test db from test vectors
TestDB LWS=8 GWS=48 (6 blocks) PASS,
Test mask: ?a?a?l?u?d?d?s
Expecting "no" false positives
44 hashes: bitmap 4x32768 bits, mask 0x7fff, effectively 38 bits (0x3fffffffff), 16 KiB (local)
Offset tbl 112 B, Hash tbl 360 B, Results 532 B, Dupe bmp 8 B, TOTAL on GPU: 17396 B
Internal mask, multiplier: 2470 (target: 2048)
Calculating best GWS for LWS=64; max. 200 ms single kernel invocation.
Raw speed figures including buffer transfers:
Tuning for password length 7
key xfer: 3.680 us, idx xfer: 1.919 us, crypt: 4.018 ms, res xfer: 3.360 us
gws:      1024  627949K c/s   627949471 rounds/s    4.027 ms per crypt_all()!
key xfer: 4.640 us, idx xfer: 2.880 us, crypt: 4.018 ms, res xfer: 3.520 us
gws:      2048    1255M c/s  1255400254 rounds/s    4.029 ms per crypt_all()+
key xfer: 5.440 us, idx xfer: 3.200 us, crypt: 12.256 ms, res xfer: 3.360 us
gws:      4096  824611K c/s   824611050 rounds/s   12.268 ms per crypt_all()
key xfer: 7.200 us, idx xfer: 4.800 us, crypt: 22.505 ms, res xfer: 3.040 us
gws:      8192  898468K c/s   898468970 rounds/s   22.520 ms per crypt_all()
key xfer: 12 us, idx xfer: 6.560 us, crypt: 38.292 ms, res xfer: 3.360 us
gws:     16384    1056M c/s  1056234392 rounds/s   38.313 ms per crypt_all()
key xfer: 20.480 us, idx xfer: 10.720 us, crypt: 65.417 ms, res xfer: 9.440 us
gws:     32768    1236M c/s  1236470119 rounds/s   65.458 ms per crypt_all()
key xfer: 35.040 us, idx xfer: 19.200 us, crypt: 111.924 ms, res xfer: 3.200 us
gws:     65536    1445M c/s  1445536207 rounds/s  111.981 ms per crypt_all()+
key xfer: 67.840 us, idx xfer: 33.920 us, crypt: 252.095 ms (exceeds 200 ms)
key xfer: 21.120 us, idx xfer: 10.880 us, crypt: 61.302 ms, res xfer: 3.040 us
gws:     32768    1319M c/s  1319529112 rounds/s   61.337 ms per crypt_all()-
Calculating best LWS for GWS=65536
Note: Profiling timers seem buggy
Testing LWS=64 GWS=65536 ... 1266874552.150 s
Testing LWS=128 GWS=65536 ... 1266874552.118 s
Testing LWS=256 GWS=65536 ... 1266874552.101 s
Calculating best GWS for LWS=64; max. 200 ms single kernel invocation.
Raw speed figures including buffer transfers:
key xfer: 3.520 us, idx xfer: 1.919 us, crypt: 3.967 ms, res xfer: 3.200 us
gws:       512  318055K c/s   318055611 rounds/s    3.976 ms per crypt_all()!
key xfer: 4 us, idx xfer: 1.760 us, crypt: 4.019 ms, res xfer: 3.040 us
gws:      1024  627949K c/s   627949471 rounds/s    4.027 ms per crypt_all()+
key xfer: 4.320 us, idx xfer: 2.720 us, crypt: 4.019 ms, res xfer: 3.200 us
gws:      2048    1255M c/s  1255450105 rounds/s    4.029 ms per crypt_all()+
key xfer: 5.760 us, idx xfer: 3.360 us, crypt: 10.686 ms, res xfer: 4.160 us
gws:      4096  945525K c/s   945525233 rounds/s   10.700 ms per crypt_all()
key xfer: 22.560 us, idx xfer: 14.080 us, crypt: 23.233 ms, res xfer: 3.360 us
gws:      8192  869395K c/s   869395443 rounds/s   23.273 ms per crypt_all()
key xfer: 12.480 us, idx xfer: 7.200 us, crypt: 36.503 ms, res xfer: 4.320 us
gws:     16384    1107M c/s  1107904719 rounds/s   36.527 ms per crypt_all()
key xfer: 60.480 us, idx xfer: 33.120 us, crypt: 70.694 ms, res xfer: 3.040 us
gws:     32768    1143M c/s  1143324676 rounds/s   70.790 ms per crypt_all()
key xfer: 35.200 us, idx xfer: 18.560 us, crypt: 123.492 ms, res xfer: 3.200 us
gws:     65536    1310M c/s  1310197194 rounds/s  123.549 ms per crypt_all()+
key xfer: 208.160 us, idx xfer: 106.560 us, crypt: 252.464 ms (exceeds 200 ms)
key xfer: 20.160 us, idx xfer: 10.720 us, crypt: 61.423 ms, res xfer: 3.200 us
gws:     32768    1316M c/s  1316952651 rounds/s   61.457 ms per crypt_all()!!
key xfer: 35.840 us, idx xfer: 20.800 us, crypt: 34.542 ms, res xfer: 3.360 us
gws:     16384    1169M c/s  1169511901 rounds/s   34.602 ms per crypt_all()-
TestDB LWS=64 GWS=32768 (512 blocks) x2470 DONE
Raw:    1479M c/s real, 50635M c/s virtual

@solardiz
Copy link
Member

solardiz commented Apr 2, 2023

precompute (at runtime, but out of loop - perhaps on host) a multiplicative reciprocal

I realized that since we pass the used-to-be-runtime-variable HASH_TABLE_SIZE from host to OpenCL kernel as its build-time constant, the OpenCL compiler has opportunity to perform the same optimization on its own - and maybe it already does!

magnumripper added a commit to magnumripper/john that referenced this pull request Apr 12, 2023
Support largest bitmaps (0xffffffff) without failing kernel build.
Fix a bug where some workgroup sizes would fail to properly copy bitmap
to local memory.
While at it, for DES/LM OpenCL formats fix a similar copy to local to
avoid modulo arithmetic.

Related to openwall#5246
magnumripper added a commit to magnumripper/john that referenced this pull request Apr 12, 2023
Support largest bitmaps (0xffffffff) without failing kernel build.
Fix a bug where some workgroup sizes would fail to properly copy bitmap
to local memory.
While at it, for DES/LM OpenCL formats fix a similar copy to local to
avoid modulo arithmetic.

Related to openwall#5246
magnumripper added a commit to magnumripper/john that referenced this pull request Apr 12, 2023
Support largest bitmaps (0xffffffff) without failing kernel build.
Fix a bug where some workgroup sizes would fail to properly copy bitmap
to local memory.
While at it, for DES/LM OpenCL formats fix a similar copy to local to
avoid modulo arithmetic.

Related to openwall#5246
magnumripper added a commit to magnumripper/john that referenced this pull request Apr 12, 2023
Support largest bitmaps (0xffffffff) without failing kernel build.
Fix a bug where some workgroup sizes would fail to properly copy bitmap
to local memory.
While at it, for DES/LM OpenCL formats fix a similar copy to local to
avoid modulo arithmetic.

Related to openwall#5246
magnumripper added a commit that referenced this pull request Apr 12, 2023
Support largest bitmaps (0xffffffff) without failing kernel build.
Fix a bug where some workgroup sizes would fail to properly copy bitmap
to local memory.
While at it, for DES/LM OpenCL formats fix a similar copy to local to
avoid modulo arithmetic.

Related to #5246
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NT-long-opencl
3 participants