-
Notifications
You must be signed in to change notification settings - Fork 520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
. #332
base: master
Are you sure you want to change the base?
. #332
Conversation
rename kernels
it works on AMD GPU but cl version is known to problems and not work good, it skips addresses. here is a log on a single GPU, and BTW someone reached that number of MKey/s on same gpu :)
xxxxxxxxx address(es) loaded (234MB) modded here. This fork is 80% improved in coding style, speed and cross-platform compatibility and added almost 50% of @Uzlopak 's additions from past week. |
Wow. First of all... I think this is not mergable, as I remoevd cuda files, because I could not guarantee that the cuda part would still work after my modifications. So probably if we fork again and overwrite the files with the changed ones it will be an acceptable merge. Secondly: How did you get it to 8014.80 MKey/s? I just get 360 MKey/s at best from my Vega56. And my Vega56 should be about 250% faster than your Radeon 560? Or is maybe my Card stronger than I know and my overall system is too slow (too old CPU and DDR3 RAM?) |
i already sent you a message in gmail when you started the pull. i have keept the -r and must of it's features. and maybe cuda must get some tweaks too. i have some modds here with your cl files and more fixes and tweaks i will push them now without rebasing and making them small pushes. take a look at it. https://github.com/MarocOS/CleanedBitcrack okay there still a lot of things must be fixed and/or merged manually like error reporting and some other stuff to port all of your changes but you don't have to remove cuda as it's the perfect working one cl version always skip a lot of keys. |
optimal parameters that i found is this : blocks = 64, threads = (blocks or double blocks), pointsperthread = 3/4 card ram or even 4/4 ram you will get the best performance : 8014.80 MKey/s keep in mind treat 4GB as 4000mb not 4096mb i was trying automatic solution based on hardware capabilities but it seems not working good. |
My system is a |
Hi @Maroc-OS I did not get any E-Mail. I checked my E-Mails but nothing :/. Maybe you send me again an E-Mail? [email protected] . Btw. I deleted in my Branch _stepKernelWithDouble. In hindsight it was maybe wrong. So dont remove it from your Branch. I personally think maxed out the possibilities from my side. Maybe some dynamic parallelism in ripemd160 hash, because you can do the rounds in parallel. Thats why I prepared it to be two separate functions. Had hired a dev on fiverr, because I had no time to figure out how to do the dynamic parallism implemention but he never did it. Other than that, we will get not further performance gains without more math. E.g. invModP could be improved by using the extended euclidian algorithm to get the inverse in non-time-linear (=faster) manner, as the little fermat solution is time-linear and is used by the secp256k1 solutions to ensure that there is no side-channel-attack. Also invModP is using like 256 x multiplication. So it is the biggest bottleneck in the whole algorithm as it is called n-times. But how to implement it with the extended euclidian? Dont know. Maybe also multiplication is also slow. Other than that I suppose there is still some speed to gain, by reducing the global variables into private variables. If I understand __constant correctly it is an alias for the global memory. So potentially by using them directly per #define and creating e.g. sub256kP method where you directly use P_7, P_6... instead of the memory. This could reduce the time for the lookup in the global memory, and speed up the whole calculations significantly. I also suspect that I "improved" greaterOrEqualToP wrong. Should be of course: #define greaterOrEqualToP(a) \
( \
(a[0] == 0xffffffff) && \
(a[1] == 0xffffffff) && \
(a[2] == 0xffffffff) && \
(a[3] == 0xffffffff) && \
(a[4] == 0xffffffff) && \
(a[5] == 0xffffffff) && \
(a[6] >= 0xfffffffe) && \
(a[7] >= 0xfffffc2f) \
) I also implemented a generatePublicKey function. You can see this in my "different"-branch. To test it run it with -b 1 -p 1 or else it will not work. It is super slow, as it is using the super slow inverseMod and mulModP implementation. I renamed some functions, so probably will not work by transplanting it directly into bitcoin.cl or secp256k1.cl. Maybe it is useful, as it could make bitcrack more modular and starting point of future modifications and new products. My actual assumption is, that we currently use multiple GB of global Memory on the GPU. So the biggest bottleneck currently is the constant read and write on the Global Memory. So to gain speed it would be necessary to reduce the memory lookup. So calculating the first public key is very expensive. But then we could actually do the point addition, what we currently anyway do when calling those batch functions, on the fly. Tbh I did not figure out how original bitcrack does keep the inverse in those batch methods. So if we could have this smart inverse methodic kept, we could avoid doing the costly inverse. So we would just calculate the public key at the beginning of the stepKernel and then we would calculate all pubkeys on the fly. So what we would do then is instead of doing just 4000 points per thread we would crank it up to e.g. 65536 keys per thread. Would be glad if someone would find a way to improve the generatePublicKey function. E.g. by using Jacobian Points, so you have to do only one invModP at the end and not all the time. |
hello again,
yeah i used the one on your github account [email protected] :)
nop it still there i did not removed it.
and
the whole cl files must get a review and you started that task and we can make that better.
we should take a look at nvidia,amd and intel hardware capabilities and implementations, we could use faster cl code by targeting each vendor separately and try to use the hardcoded capabilities on those compute engines if we can say that term.
it depends on the gpu memory, and as you see in my last examplewhen used 4000 points per thread the calculations got us to 6512/4096MB we are here using more memory than the actual gpu i don't know how this was possible but when adding more you get cl_invalid_value or maybe cl_cannot_allocate_memory.
we could try this if you want PS: some resources that can help |
Hello,
Thank you so much. I'll try and report my results to you at tomorrow. :D
Vào 6:00, Th 7, 12 thg 6, 2021 Yahya Lmallas ***@***.***> đã
viết:
… hello again,
Hi @Maroc-OS <https://github.com/Maroc-OS>
I did not get any E-Mail. I checked my E-Mails but nothing :/. Maybe you
send me again an E-Mail? ***@***.*** .
yeah i used the one on your github account ***@***.*** :)
Btw. I deleted in my Branch _stepKernelWithDouble. In hindsight it was
maybe wrong. So dont remove it from your Branch.
nop it still there i did not removed it.
I personally think maxed out the possibilities from my side. Maybe some
dynamic parallelism in ripemd160 hash, because you can do the rounds in
parallel. Thats why I prepared it to be two separate functions. Had hired a
dev on fiverr, because I had no time to figure out how to do the dynamic
parallism implemention but he never did it.
and
Maybe also multiplication is also slow.
Other than that I suppose there is still some speed to gain, by reducing
the global variables into private variables. If I understand __constant
correctly it is an alias for the global memory. So potentially by using
them directly per #define and creating e.g. sub256kP method where you
directly use P_7, P_6... instead of the memory. This could reduce the time
for the lookup in the global memory, and speed up the whole calculations
significantly.
I also suspect that I "improved" greaterOrEqualToP wrong. Should be of
course:
#define greaterOrEqualToP(a) \
( \
(a[0] == 0xffffffff) && \
(a[1] == 0xffffffff) && \
(a[2] == 0xffffffff) && \
(a[3] == 0xffffffff) && \
(a[4] == 0xffffffff) && \
(a[5] == 0xffffffff) && \
(a[6] >= 0xfffffffe) && \
(a[7] >= 0xfffffc2f) \
)
I also implemented a generatePublicKey function. You can see this in my
"different"-branch. To test it run it with -b 1 -p 1 or else it will not
work. It is super slow, as it is using the super slow inverseMod and
mulModP implementation. I renamed some functions, so probably will not work
by transplanting it directly into bitcoin.cl or secp256k1.cl. Maybe it is
useful, as it could make bitcrack more modular and starting point of future
modifications and new products.
the whole cl files must get a review and you started that task and we can
make that better.
Other than that, we will get not further performance gains without more
math. E.g. invModP could be improved by using the extended euclidian
algorithm to get the inverse in non-time-linear (=faster) manner, as the
little fermat solution is time-linear and is used by the secp256k1
solutions to ensure that there is no side-channel-attack. Also invModP is
using like 256 x multiplication. So it is the biggest bottleneck in the
whole algorithm as it is called n-times. But how to implement it with the
extended euclidian? Dont know.
we should take a look at nvidia,amd and intel hardware capabilities and
implementations, we could use faster cl code by targeting each vendor
separately and try to use the hardcoded capabilities on those compute
engines if we can say that term.
My actual assumption is, that we currently use multiple GB of global
Memory on the GPU. So the biggest bottleneck currently is the constant read
and write on the Global Memory. So to gain speed it would be necessary to
reduce the memory lookup. So calculating the first public key is very
expensive. But then we could actually do the point addition, what we
currently anyway do when calling those batch functions, on the fly. Tbh I
did not figure out how original bitcrack does keep the inverse in those
batch methods. So if we could have this smart inverse methodic kept, we
could avoid doing the costly inverse. So we would just calculate the public
key at the beginning of the stepKernel and then we would calculate all
pubkeys on the fly.
So what we would do then is instead of doing just 4000 points per thread
we would crank it up to e.g. 65536 keys per thread.
it depends on the gpu memory, and as you see in my last examplewhen used
*4000* points per thread the calculations got us to *6512/4096MB* we are
here using more memory than the actual gpu i don't know how this was
possible but when adding more you get cl_invalid_value or maybe
cl_cannot_allocate_memory.
also i can already say that read and write a restricted using
CL_MEMORY_READ/WRITE etc. but your idea can be done if we optimize the code
more. i was trying to get AMD CODE_XL to try optimizing the code but no
macOS versions there.
Would be glad if someone would find a way to improve the generatePublicKey
function. E.g. by using Jacobian Points, so you have to do only one invModP
at the end and not all the time.
we could try this if you want
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#332 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJMNHYNQKE5DKUJWTFPWD5TTSKIQJANCNFSM46HHW6LQ>
.
|
No description provided.