Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sleep before ReadBuffer to resolve Nvidia's busywait issue #60

Closed
wants to merge 3 commits into from

Conversation

Kubuxu
Copy link

@Kubuxu Kubuxu commented Nov 11, 2016

I did not observe any negative influence on performance with this patch.
The CPU core usage is reduced to about 8% from 100%.

I am getting consistent 38Sols/s on MSI Gaming X 1070 on one instance
while testing. Sols/s on AMD cards (RX480) are not affected.

The clFlush it required as AMD will only start working when there is
blocking operation waiting which is delayed in this case or the queue is flushed.

This is only suggestion and solution that works for me, it isn't perfect. With this silentarmy is the best miner for Nvidia that I know.

Don't feel obligated to accept this PR, you can use parts of it or it in
whole to engineer a solution that you think is the best.

I have tried to sched_yield trick, it seems to stopped working
unfortunately.

Resolves #54

@Kubuxu Kubuxu force-pushed the feat/flush-and-sleep branch 2 times, most recently from 159c3e2 to f29b4e0 Compare November 11, 2016 13:15
@Kubuxu Kubuxu changed the title Sleep before ReadBuffer to resolve Nvidia's busywait issueo Sleep before ReadBuffer to resolve Nvidia's busywait issue Nov 11, 2016
@Kubuxu Kubuxu force-pushed the feat/flush-and-sleep branch from f29b4e0 to 162cdef Compare November 11, 2016 13:24
@Singman33
Copy link

Tested and approved, compatible with eXternal Optimizations.

@Kubuxu Kubuxu force-pushed the feat/flush-and-sleep branch 2 times, most recently from 9bd836b to 6cdf385 Compare November 11, 2016 23:22
@mbevand
Copy link
Owner

mbevand commented Nov 12, 2016

Oh that's a good, working hack. Thanks! I might prefer your idea over krnlx's suggestion from #54

@lhl
Copy link

lhl commented Nov 12, 2016

Just merged my branch back up w/ upstream and also tested this patch, and it seems to work great - freed up a core from 100% to now using about 7% (on an i7-4790k). 970 is mining at about 45 sol/s (unchanged), XC AVX2 -t6 is up from 24sol/s to 26+ sol/s.

Tested also on my 470 dev system, performance is unchanged (~77 sol/s w/ a memstrap 1500 system), CPU < 2%.

@Kubuxu
Copy link
Author

Kubuxu commented Nov 12, 2016

The timing here is not perfect, with two timers, one measuring kernel time and one measuring the time spent in the read it should be possible to reduce the load even further.

The increased CPU load is because the the sleep time will be always 2% less than time required for the kernel to complete.

@montvid
Copy link

montvid commented Nov 12, 2016

Awesome! My weak GeForce GT 740M is making a bit more sols than before. It sure helps for the weak gpu's.

@Kubuxu
Copy link
Author

Kubuxu commented Nov 12, 2016

@mbevand something worth considering is changing the mining pipeline a bit.
Currently main.c does (simplified):

1. Enqueue equihash task
2. Block on ReadBuffer
3. Check the solutions
4. Repeat from one

In this patch I do:

1. Enqueue equihash task
2. Wait rolling average time - 2%, it takes to do one invocation
2. Block on ReadBuffer
3. Check the solutions
4. Repeat from 1.

Something worth considering:

1. Enqueue eqihash task
2. Enqueue non blocking ReadBuffer save the event E, save the estimated invocation finish time T, calculated by rolling average of previous runs
2. Check previous solutions if there are any
3. Sleep '(T - now())*0.75', 0.75 is quite arbitrary here and might have to be adjusted.
4. Check if event E is done; if not, jump to 3
5. Save the solution
6. Repeat from 1.

This is a bit more complex but has two benefits, 1. there is even less downtime (it might turn out that with this running with one instance will be better), 2. the timing adjustment is even better.

@mbevand
Copy link
Owner

mbevand commented Nov 12, 2016

Cool. Apparently zawawa was even able to include and compile your patch on Cygwin/Windows: https://bitcointalk.org/index.php?topic=1666489.msg16854647#msg16854647

I asked him to confirm that clock_gettime(CLOCK_MONOTONIC...) works on Cygwin/Windows.

@Kubuxu
Copy link
Author

Kubuxu commented Nov 12, 2016

It doesn't he replaced it by something else, also Cygwin doesn't have nanosleep.

@mbevand
Copy link
Owner

mbevand commented Nov 12, 2016

I guess we don't have to worry about Windows right now.

@Kubuxu
Copy link
Author

Kubuxu commented Nov 12, 2016

Give me 30min before merging it, I might try something else that will have lower load on CPU.

@mbevand
Copy link
Owner

mbevand commented Nov 12, 2016

Take your time. I will probably need 1-2 days before I merge it.

@Kubuxu
Copy link
Author

Kubuxu commented Nov 12, 2016

Sure.

I did not observe any negative influence on performance with this patch.
The CPU core usage is reduced to about 8% from 100%.

I am getting consistent 38Sols/s on MSI Gaming X 1070 on one instance
while testing. Sols/s on AMD cards (RX480) are not affected.

The clFlush it required as AMD will only start working when there is
blocking operation waiting which is delayed in this case.

This is only suggestion and solution that works for me, it isn't perfect
but it works for me. With this silentarmy is the best miner for Nvidia
that I know.

Don't feel obligated to accept this PR, you can use parts of it or it in
whole to engineer a solution that you think is the best.

I have tried to sched_yield trick, it seems to stopped working
unfortunately.
@Kubuxu Kubuxu force-pushed the feat/flush-and-sleep branch 5 times, most recently from 6601f0b to acb7e15 Compare November 12, 2016 22:46
@Kubuxu Kubuxu force-pushed the feat/flush-and-sleep branch from acb7e15 to c00bab0 Compare November 12, 2016 22:53
@Kubuxu
Copy link
Author

Kubuxu commented Nov 12, 2016

I reworked it. Getting the same hash rates as without the patch, the CPU usage is down to 4% from 100%. Parameters are tweakable.

@mbevand
Copy link
Owner

mbevand commented Nov 12, 2016

Awesome! Have you made sure it doesn't negatively impact AMD GPUs? Should we leave the code as-is or turn it on only when running against an Nvidia GPU?

zawawa replied me saying clock_gettime() does seem to work on Cygwin/Windows.

@Kubuxu
Copy link
Author

Kubuxu commented Nov 12, 2016

I have run tests on AMD too, I couldn't see any difference, the whole thing can be disabled by setting SLEEP_SKIP_RATIO to 1.

@Kubuxu
Copy link
Author

Kubuxu commented Nov 12, 2016

MSI RX480 X 8GB at test run 1000 nonces:
master: Total 1825 solutions in 25147.7 ms (72.6 Sol/s)
with patch: Total 1825 solutions in 25333.1 ms (72.0 Sol/s)
with patch and SLEEP_SKIP_RATIO=1: Total 1825 solutions in 25246.9 ms (72.3 Sol/s)

So the overhead is 0.6Sol/s or (0.3 with a tweak), it will be less on slower GPUs (as invocation time is longer).
This would be solved by my proposal here: #60 (comment) but I won't have time to implement it.

@Kubuxu
Copy link
Author

Kubuxu commented Nov 12, 2016

It might also have no influence at all with two instances running.

@Kubuxu Kubuxu mentioned this pull request Nov 13, 2016
@Kubuxu Kubuxu force-pushed the feat/flush-and-sleep branch from 38648a3 to d41eca4 Compare November 13, 2016 15:32
@krnlx
Copy link

krnlx commented Nov 15, 2016

wait wait, I tested the last solution, it seems a little performance decrease. Celeron 1840, 6x1070, I think I will stay at Kubuxu v1 solution, it is fastest.

@Kubuxu
Copy link
Author

Kubuxu commented Nov 15, 2016

@krnlx try playing with SLEEP_SKIP_RATIO it previously was at 0.02

@mbevand
Copy link
Owner

mbevand commented Nov 15, 2016

Thanks so much, I merged a cleaned-up version of your fix:
a6c3517

Please not that you had 2 clFlush() and only 1 is needed. Verify on Nvidia hardware I didn't screw things up :)

@mbevand mbevand closed this Nov 15, 2016
@Kubuxu
Copy link
Author

Kubuxu commented Nov 15, 2016

I had two flushes as I wanted the GPU to start work ASAP (applies to AMD) but it didn't seem to have any difference IIRC.

@Kubuxu Kubuxu deleted the feat/flush-and-sleep branch November 15, 2016 17:08
@mbevand
Copy link
Owner

mbevand commented Nov 15, 2016

That makes sense. You had a flush before enqueuing kernel_sols and the
non-blocking buffer read, but these 2 calls should take at most tens or
hundreds of usec, so it won't make a difference for a 10-20 msec equihash
run.

On Tue, Nov 15, 2016 at 11:08 AM, Jakub Sztandera [email protected]
wrote:

I had two flushes as I wanted the GPU to start work ASAP (applies to AMD)
but it didn't seem to have any difference IIRC.


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#60 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AC20HJxRjtYYomKXB5Q9dcBZ9QsfGV-mks5q-ecegaJpZM4KvvJ7
.

@Kubuxu
Copy link
Author

Kubuxu commented Nov 15, 2016

Also confirming on Nvidia, CPU usage is ~3%.

@Kubuxu
Copy link
Author

Kubuxu commented Nov 15, 2016

Final comparison:
GTX1070: with 69.6Sol/s (4% CPU), without 66.7Sol/s (100% CPU)
RX480: with 72.0 Sol/s, without 72.0Sol/s

Tested with LD_PRELOAD=/usr/lib/libOpenCL.so.1 ./sa-solver --use X --nonces 1000.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants