. #332

vipvip1811 · 2021-06-07T09:24:25Z

No description provided.

…tion

rename kernels

…to dig deeper

Maroc-OS · 2021-06-09T13:41:07Z

it works on AMD GPU but cl version is known to problems and not work good, it skips addresses. here is a log on a single GPU, and BTW someone reached that number of MKey/s on same gpu :)

[2021-06-09.13:29:31] [Info] CleanedBitCrack
                             
[2021-06-09.13:29:31] [Info] Compression: both
[2021-06-09.13:29:31] [Info] Starting at: 00000000000000000000000000000000000000000000000017915F2D9CC2CD98
[2021-06-09.13:29:31] [Info] Ending at:   00000000000000000000000000000000000000000000000018F612FF88F64830
[2021-06-09.13:29:31] [Info] Counting by: 0000000000000000000000000000000000000000000000000000000000000001
[2021-06-09.13:29:31] [Info] Threads: 128
[2021-06-09.13:29:31] [Info] Blocks: 64
[2021-06-09.13:29:31] [Info] Points per Thread: 4000
[2021-06-09.13:29:31] [Info] Compiling OpenCL kernels...
[2021-06-09.13:29:31] [Info] Initializing AMD Radeon Pro 560X Compute Engine
[2021-06-09.13:29:31] [Info] Allocating Memory for Buffers (4000.0MB)
[2021-06-09.13:29:32] [Info] Generating 32,768,000 starting points (1250.0MB)
[2021-06-09.13:29:35] [Info] 10.0%
[2021-06-09.13:29:36] [Info] 20.0%
[2021-06-09.13:29:36] [Info] 30.0%
[2021-06-09.13:29:36] [Info] 40.0%
[2021-06-09.13:29:36] [Info] 50.0%
[2021-06-09.13:29:36] [Info] 60.0%
[2021-06-09.13:29:36] [Info] 70.0%
[2021-06-09.13:29:36] [Info] 80.0%
[2021-06-09.13:29:36] [Info] 90.0%
[2021-06-09.13:29:36] [Info] 100.0%
[2021-06-09.13:29:36] [Info] Done
[2021-06-09.13:29:36] [Info] Loading addresses from '/Users/research/btctest/in.txt'
[2021-06-09.13:30:46] [Info] xxxxxxxxx address(es) loaded (234MB)
                             0 address(es) ignored
[2021-06-09.13:30:47] [Info] Initializing BloomFilter (512.0MB)
[00:03:05] 6512/4096MB | xxxxxxxxx targets 8014.80 MKey/s (100,401,335,142,415,000 remaining) [ETA 10.08 months]

xxxxxxxxx address(es) loaded (234MB) modded here.

This fork is 80% improved in coding style, speed and cross-platform compatibility and added almost 50% of @Uzlopak 's additions from past week.

Uzlopak · 2021-06-10T22:48:28Z

Wow.

First of all... I think this is not mergable, as I remoevd cuda files, because I could not guarantee that the cuda part would still work after my modifications. So probably if we fork again and overwrite the files with the changed ones it will be an acceptable merge.

Secondly: How did you get it to 8014.80 MKey/s? I just get 360 MKey/s at best from my Vega56. And my Vega56 should be about 250% faster than your Radeon 560?

Or is maybe my Card stronger than I know and my overall system is too slow (too old CPU and DDR3 RAM?)

Maroc-OS · 2021-06-11T00:08:49Z

i already sent you a message in gmail when you started the pull.
okay my tool is working both cuda and cl and it was a cleaned version from two years or so plus my changes and some of ppl changes like yours.

i have keept the -r and must of it's features. and maybe cuda must get some tweaks too.

i have some modds here with your cl files and more fixes and tweaks i will push them now without rebasing and making them small pushes. take a look at it.

https://github.com/MarocOS/CleanedBitcrack

okay there still a lot of things must be fixed and/or merged manually like error reporting and some other stuff to port all of your changes but you don't have to remove cuda as it's the perfect working one cl version always skip a lot of keys.

Maroc-OS · 2021-06-11T00:19:13Z

optimal parameters that i found is this : blocks = 64, threads = (blocks or double blocks), pointsperthread = 3/4 card ram or even 4/4 ram you will get the best performance : 8014.80 MKey/s

keep in mind treat 4GB as 4000mb not 4096mb

i was trying automatic solution based on hardware capabilities but it seems not working good.

Maroc-OS · 2021-06-11T00:27:49Z

Secondly: How did you get it to 8014.80 MKey/s? I just get 360 MKey/s at best from my Vega56. And my Vega56 should be about 250% faster than your Radeon 560?

My system is a iMac i5 4k 2019 with AMD Radeon Pro 560X Compute Engine

Uzlopak · 2021-06-11T03:00:08Z

Hi @Maroc-OS

I did not get any E-Mail. I checked my E-Mails but nothing :/. Maybe you send me again an E-Mail? [email protected] .

Btw. I deleted in my Branch _stepKernelWithDouble. In hindsight it was maybe wrong. So dont remove it from your Branch.

I personally think maxed out the possibilities from my side. Maybe some dynamic parallelism in ripemd160 hash, because you can do the rounds in parallel. Thats why I prepared it to be two separate functions. Had hired a dev on fiverr, because I had no time to figure out how to do the dynamic parallism implemention but he never did it.

Other than that, we will get not further performance gains without more math. E.g. invModP could be improved by using the extended euclidian algorithm to get the inverse in non-time-linear (=faster) manner, as the little fermat solution is time-linear and is used by the secp256k1 solutions to ensure that there is no side-channel-attack. Also invModP is using like 256 x multiplication. So it is the biggest bottleneck in the whole algorithm as it is called n-times. But how to implement it with the extended euclidian? Dont know.

Maybe also multiplication is also slow.

Other than that I suppose there is still some speed to gain, by reducing the global variables into private variables. If I understand __constant correctly it is an alias for the global memory. So potentially by using them directly per #define and creating e.g. sub256kP method where you directly use P_7, P_6... instead of the memory. This could reduce the time for the lookup in the global memory, and speed up the whole calculations significantly.

I also suspect that I "improved" greaterOrEqualToP wrong. Should be of course:

#define greaterOrEqualToP(a)    \
    (                           \
        (a[0] == 0xffffffff) && \
        (a[1] == 0xffffffff) && \
        (a[2] == 0xffffffff) && \
        (a[3] == 0xffffffff) && \
        (a[4] == 0xffffffff) && \
        (a[5] == 0xffffffff) && \
        (a[6] >= 0xfffffffe) && \
        (a[7] >= 0xfffffc2f)    \
    )

I also implemented a generatePublicKey function. You can see this in my "different"-branch. To test it run it with -b 1 -p 1 or else it will not work. It is super slow, as it is using the super slow inverseMod and mulModP implementation. I renamed some functions, so probably will not work by transplanting it directly into bitcoin.cl or secp256k1.cl. Maybe it is useful, as it could make bitcrack more modular and starting point of future modifications and new products.

My actual assumption is, that we currently use multiple GB of global Memory on the GPU. So the biggest bottleneck currently is the constant read and write on the Global Memory. So to gain speed it would be necessary to reduce the memory lookup. So calculating the first public key is very expensive. But then we could actually do the point addition, what we currently anyway do when calling those batch functions, on the fly. Tbh I did not figure out how original bitcrack does keep the inverse in those batch methods. So if we could have this smart inverse methodic kept, we could avoid doing the costly inverse. So we would just calculate the public key at the beginning of the stepKernel and then we would calculate all pubkeys on the fly.

So what we would do then is instead of doing just 4000 points per thread we would crank it up to e.g. 65536 keys per thread.

Would be glad if someone would find a way to improve the generatePublicKey function. E.g. by using Jacobian Points, so you have to do only one invModP at the end and not all the time.

Maroc-OS · 2021-06-11T23:00:07Z

hello again,

Hi @Maroc-OS

I did not get any E-Mail. I checked my E-Mails but nothing :/. Maybe you send me again an E-Mail? [email protected] .

yeah i used the one on your github account [email protected] :)

Btw. I deleted in my Branch _stepKernelWithDouble. In hindsight it was maybe wrong. So dont remove it from your Branch.

nop it still there i did not removed it.

I personally think maxed out the possibilities from my side. Maybe some dynamic parallelism in ripemd160 hash, because you can do the rounds in parallel. Thats why I prepared it to be two separate functions. Had hired a dev on fiverr, because I had no time to figure out how to do the dynamic parallism implemention but he never did it.

and

Maybe also multiplication is also slow.

Other than that I suppose there is still some speed to gain, by reducing the global variables into private variables. If I understand __constant correctly it is an alias for the global memory. So potentially by using them directly per #define and creating e.g. sub256kP method where you directly use P_7, P_6... instead of the memory. This could reduce the time for the lookup in the global memory, and speed up the whole calculations significantly.

I also suspect that I "improved" greaterOrEqualToP wrong. Should be of course:
#define greaterOrEqualToP(a)    \
    (                           \
        (a[0] == 0xffffffff) && \
        (a[1] == 0xffffffff) && \
        (a[2] == 0xffffffff) && \
        (a[3] == 0xffffffff) && \
        (a[4] == 0xffffffff) && \
        (a[5] == 0xffffffff) && \
        (a[6] >= 0xfffffffe) && \
        (a[7] >= 0xfffffc2f)    \
    )
I also implemented a generatePublicKey function. You can see this in my "different"-branch. To test it run it with -b 1 -p 1 or else it will not work. It is super slow, as it is using the super slow inverseMod and mulModP implementation. I renamed some functions, so probably will not work by transplanting it directly into bitcoin.cl or secp256k1.cl. Maybe it is useful, as it could make bitcrack more modular and starting point of future modifications and new products.

the whole cl files must get a review and you started that task and we can make that better.

Other than that, we will get not further performance gains without more math. E.g. invModP could be improved by using the extended euclidian algorithm to get the inverse in non-time-linear (=faster) manner, as the little fermat solution is time-linear and is used by the secp256k1 solutions to ensure that there is no side-channel-attack. Also invModP is using like 256 x multiplication. So it is the biggest bottleneck in the whole algorithm as it is called n-times. But how to implement it with the extended euclidian? Dont know.

we should take a look at nvidia,amd and intel hardware capabilities and implementations, we could use faster cl code by targeting each vendor separately and try to use the hardcoded capabilities on those compute engines if we can say that term.

My actual assumption is, that we currently use multiple GB of global Memory on the GPU. So the biggest bottleneck currently is the constant read and write on the Global Memory. So to gain speed it would be necessary to reduce the memory lookup. So calculating the first public key is very expensive. But then we could actually do the point addition, what we currently anyway do when calling those batch functions, on the fly. Tbh I did not figure out how original bitcrack does keep the inverse in those batch methods. So if we could have this smart inverse methodic kept, we could avoid doing the costly inverse. So we would just calculate the public key at the beginning of the stepKernel and then we would calculate all pubkeys on the fly.

So what we would do then is instead of doing just 4000 points per thread we would crank it up to e.g. 65536 keys per thread.

it depends on the gpu memory, and as you see in my last examplewhen used 4000 points per thread the calculations got us to 6512/4096MB we are here using more memory than the actual gpu i don't know how this was possible but when adding more you get cl_invalid_value or maybe cl_cannot_allocate_memory.
also i can already say that read and write a restricted using CL_MEMORY_READ/WRITE etc. but your idea can be done if we optimize the code more. i was trying to get AMD CODE_XL to try optimizing the code but no macOS versions there.

Would be glad if someone would find a way to improve the generatePublicKey function. E.g. by using Jacobian Points, so you have to do only one invModP at the end and not all the time.

we could try this if you want

PS: some resources that can help
OCLoptimizer: an iterative optimization tool for OpenCL
AMD CodeXL

vipvip1811 · 2021-06-12T02:11:10Z

Hello, Thank you so much. I'll try and report my results to you at tomorrow. :D Vào 6:00, Th 7, 12 thg 6, 2021 Yahya Lmallas ***@***.***> đã viết:

…

hello again, Hi @Maroc-OS <https://github.com/Maroc-OS> I did not get any E-Mail. I checked my E-Mails but nothing :/. Maybe you send me again an E-Mail? ***@***.*** . yeah i used the one on your github account ***@***.*** :) Btw. I deleted in my Branch _stepKernelWithDouble. In hindsight it was maybe wrong. So dont remove it from your Branch. nop it still there i did not removed it. I personally think maxed out the possibilities from my side. Maybe some dynamic parallelism in ripemd160 hash, because you can do the rounds in parallel. Thats why I prepared it to be two separate functions. Had hired a dev on fiverr, because I had no time to figure out how to do the dynamic parallism implemention but he never did it. and Maybe also multiplication is also slow. Other than that I suppose there is still some speed to gain, by reducing the global variables into private variables. If I understand __constant correctly it is an alias for the global memory. So potentially by using them directly per #define and creating e.g. sub256kP method where you directly use P_7, P_6... instead of the memory. This could reduce the time for the lookup in the global memory, and speed up the whole calculations significantly. I also suspect that I "improved" greaterOrEqualToP wrong. Should be of course: #define greaterOrEqualToP(a) \ ( \ (a[0] == 0xffffffff) && \ (a[1] == 0xffffffff) && \ (a[2] == 0xffffffff) && \ (a[3] == 0xffffffff) && \ (a[4] == 0xffffffff) && \ (a[5] == 0xffffffff) && \ (a[6] >= 0xfffffffe) && \ (a[7] >= 0xfffffc2f) \ ) I also implemented a generatePublicKey function. You can see this in my "different"-branch. To test it run it with -b 1 -p 1 or else it will not work. It is super slow, as it is using the super slow inverseMod and mulModP implementation. I renamed some functions, so probably will not work by transplanting it directly into bitcoin.cl or secp256k1.cl. Maybe it is useful, as it could make bitcrack more modular and starting point of future modifications and new products. the whole cl files must get a review and you started that task and we can make that better. Other than that, we will get not further performance gains without more math. E.g. invModP could be improved by using the extended euclidian algorithm to get the inverse in non-time-linear (=faster) manner, as the little fermat solution is time-linear and is used by the secp256k1 solutions to ensure that there is no side-channel-attack. Also invModP is using like 256 x multiplication. So it is the biggest bottleneck in the whole algorithm as it is called n-times. But how to implement it with the extended euclidian? Dont know. we should take a look at nvidia,amd and intel hardware capabilities and implementations, we could use faster cl code by targeting each vendor separately and try to use the hardcoded capabilities on those compute engines if we can say that term. My actual assumption is, that we currently use multiple GB of global Memory on the GPU. So the biggest bottleneck currently is the constant read and write on the Global Memory. So to gain speed it would be necessary to reduce the memory lookup. So calculating the first public key is very expensive. But then we could actually do the point addition, what we currently anyway do when calling those batch functions, on the fly. Tbh I did not figure out how original bitcrack does keep the inverse in those batch methods. So if we could have this smart inverse methodic kept, we could avoid doing the costly inverse. So we would just calculate the public key at the beginning of the stepKernel and then we would calculate all pubkeys on the fly. So what we would do then is instead of doing just 4000 points per thread we would crank it up to e.g. 65536 keys per thread. it depends on the gpu memory, and as you see in my last examplewhen used *4000* points per thread the calculations got us to *6512/4096MB* we are here using more memory than the actual gpu i don't know how this was possible but when adding more you get cl_invalid_value or maybe cl_cannot_allocate_memory. also i can already say that read and write a restricted using CL_MEMORY_READ/WRITE etc. but your idea can be done if we optimize the code more. i was trying to get AMD CODE_XL to try optimizing the code but no macOS versions there. Would be glad if someone would find a way to improve the generatePublicKey function. E.g. by using Jacobian Points, so you have to do only one invModP at the end and not all the time. we could try this if you want — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#332 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJMNHYNQKE5DKUJWTFPWD5TTSKIQJANCNFSM46HHW6LQ> .

Uzlopak added 30 commits May 25, 2021 04:29

Refactor AddressUtil

4400ddf

refactored clMath

abda527

remove cuda files

cca5e41

improve Logger

1b7e86d

improve KeyFinder

198f9ed

improve secp256k1lib

b55ed97

improve CryptoUtil

22cd79f

improve clUtil

4030d2d

improve AddrGen

8ca20dc

improve clKeyFinder

b4388c3

add CLUnitTests

eb19331

improve CmdParse

4f14fe9

fix utils

bbfbb81

the rest

b82f76b

rename BitCrack to BitCrackOpenCL

1bd24de

Update README.md

2a04135

remove unused code in ripemd160.cl to improve on-the-fly-compile time

f51a38d

pass directly the kernel instead of wrapping them in doIteration func…

d6e2019

…tion

Merge branch 'master' of https://github.com/Uzlopak/BitCrackOpenCL

d5208bc

simplify ripemd160

147d98d

simplify ripemd160 by using only one set of scalar variables

e2bef00

separate runs in ripemd160.cl

49e67de

make error log correct

9ac843f

prepare

6b71a75

fix ripemd160

b37f2ef

fix mulModP256k

800de6f

fix project files

3cdf089

fix

93397f4

improve a little

e4f92d6

reduce cycles in isInBloomfilter

998e4d6

Uzlopak added 20 commits June 3, 2021 17:05

more info

3070ee3

separate some structures

daff067

minor changes

3828b81

remove DeviceType

faa8505

remove useless memory information

d4f4c6e

make it more userfriendly

5e2e8d5

make threads depending on maxWorkingGroupSize

2ad6449

remove selfTest

ca5d1b5

reduce memory footprint

6502006

remove numTargets,

df4bd01

rename kernels

modify a little

1144ad6

minor changes

2a9a5bd

simplify

c19fba0

remove _stepKernelWithDouble

513e161

format

7d56441

minor formatting

72f40f9

remove unnecessary variable

e923ee6

use again 256 threads by default as it throws in a nvidia P620, have …

5c4b864

…to dig deeper

expose memory usage, ignore invalid Addresses

0f76bc2

Merge branch 'remove-WithDouble-Kernel'

a734f65

vipvip1811 changed the title ~~error not run on AMD RX570. After load, it make my PC not responding.~~ . Jun 7, 2021

haymac approved these changes Oct 19, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

. #332

. #332

vipvip1811 commented Jun 7, 2021 •

edited

Loading

Maroc-OS commented Jun 9, 2021

Uzlopak commented Jun 10, 2021

Maroc-OS commented Jun 11, 2021

Maroc-OS commented Jun 11, 2021

Maroc-OS commented Jun 11, 2021

Uzlopak commented Jun 11, 2021

Maroc-OS commented Jun 11, 2021 •

edited

Loading

vipvip1811 commented Jun 12, 2021 via email

. #332

Are you sure you want to change the base?

. #332

Conversation

vipvip1811 commented Jun 7, 2021 • edited Loading

Maroc-OS commented Jun 9, 2021

Uzlopak commented Jun 10, 2021

Maroc-OS commented Jun 11, 2021

Maroc-OS commented Jun 11, 2021

Maroc-OS commented Jun 11, 2021

Uzlopak commented Jun 11, 2021

Maroc-OS commented Jun 11, 2021 • edited Loading

vipvip1811 commented Jun 12, 2021 via email

vipvip1811 commented Jun 7, 2021 •

edited

Loading

Maroc-OS commented Jun 11, 2021 •

edited

Loading