mbevand · justvanbloom · Nov 7, 2016 · Nov 7, 2016 · Nov 9, 2016 · Nov 9, 2016
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,8 +1,38 @@
 # Current tip
 
-* Add nicehash compatibility (stratum servers fixing 17 bytes of the nonce)
-* Add nerdralph's optimization (OPTIM_FOR_FGLRX)
+* Implement mining.extranonce.subscribe (kenshirothefist)
+* Optimization: +10% speedup, increase collision items tracked per thread
+  (nerdralph). 'make test' finds 196 sols again.
+
+# Version 5 (11 Nov 2016)
+
+* Optimization: major 2x speedup (eXtremal) by storing 8 atomic counters in
+  1 uint, and by reducing branch divergence when iterating over and XORing Xi's.
+  Note that as a result of these optimizations, sa-solver compiled with
+  NR_ROWS_LOG=20 now only finds 182 out of 196 existing solutions ("make test"
+  verification data was adjusted accordingly)
+* Defaulting OPTIM_SIMPLIFY_ROUND to 1; GPU memory usage down to 0.8 GB per
+  instance
+* Optimization: significantly reduce CPU usage and PCIe bandwidth (before:
+  ~100 MB/s/GPU, after: 0.5 MB/s/GPU), accomplished by filtering invalid
+  solutions on-device
+* Optimization: reduce size of collisions[] array; +7% speed increase measured
+  on RX 480 and R9 Nano using AMDGPU-PRO 16.40
+* Implement stratum method client.reconnect
+* Avoid segfault when encountering an out-of-range input
+* For simplicity `-i <header>` now only accepts 140-byte headers
+* Update README.md with Nvidia performance numbers
+* Fix mining on Xeon Phi and CPUs (fix OpenCL warnings)
+* Fix compilation warnings and 32-bit platforms
+
+# Version 4 (08 Nov 2016)
+
+* Add Nvidia GPU support (fix more unaligned memory accesses)
+* Add nerdralph's optimization (OPTIM_SIMPLIFY_ROUND) for potential +30%
+  speedup, especially useful on Nvidia GPUs
+* Drop the Python 3.5 dependency; now requires only Python 3.3 or above (lhl)
 * Drop the libsodium dependency; instead use our own SHA256 implementation
+* Add nicehash compatibility (stratum servers fixing 17 bytes of the nonce)
 * Only apply set_target to *next* mining job
 * Do not abandon previous mining jobs if clean_jobs is false
 * Fix KeyError's when displaying stats

diff --git a/Makefile b/Makefile
@@ -1,19 +1,41 @@
+#Detect OS
+UNAME := $(shell uname)
+ifeq ($(UNAME), Darwin)
+# Mac OS Frameworks
+OPENCL_HEADERS = "/System/Library/Frameworks/OpenCL.framework/Headers/"
+LIBOPENCL = "/System/Library/Frameworks/OpenCL.framework/Versions/Current/Libraries"
+LDLIBS = -framework OpenCL
+# gcc installed with brew or macports cause xcode gcc is only clang wrapper
+CC = gcc-6
+else
 # Change this path if the SDK was installed in a non-standard location
 OPENCL_HEADERS = "/opt/AMDAPPSDK-3.0/include"
 # By default libOpenCL.so is searched in default system locations, this path
 # lets you adds one more directory to the search path.
 LIBOPENCL = "/opt/amdgpu-pro/lib/x86_64-linux-gnu"
-
+LDLIBS = -lOpenCL
 CC = gcc
-CPPFLAGS = -std=gnu99 -pedantic -Wextra -Wall -ggdb \
+endif
+CPPFLAGS = -I${OPENCL_HEADERS}
+CFLAGS = -O2 -std=gnu99 -pedantic -Wextra -Wall \
     -Wno-deprecated-declarations \
-    -Wno-overlength-strings \
-    -I${OPENCL_HEADERS}
+    -Wno-overlength-strings
 LDFLAGS = -rdynamic -L${LIBOPENCL}
-LDLIBS = -lOpenCL
+
 OBJ = main.o blake.o sha256.o
 INCLUDES = blake.h param.h _kernel.h sha256.h
 
+
+CPPFLAGS = -I${OPENCL_HEADERS}
+CFLAGS = -O2 -std=gnu99 -pedantic -Wextra -Wall -ggdb \
+    -Wno-deprecated-declarations \
+    -Wno-overlength-strings
+LDFLAGS = -rdynamic -L${LIBOPENCL}
+
+OBJ = main.o blake.o sha256.o
+INCLUDES = blake.h param.h _kernel.h sha256.h
+
+
 all : sa-solver
 
 sa-solver : ${OBJ}
@@ -27,13 +49,17 @@ _kernel.h : input.cl param.h
 	echo ')_mrb_";' >>$@
 
 test : sa-solver
-	./sa-solver --nonces 100 -v -v 2>&1 | grep Soln: | \
-	    diff -u testing/sols-100 - | cut -c 1-75
+	@echo Testing...
+	@if res=`./sa-solver --nonces 100 -v -v 2>&1 | grep Soln: | \
+	    diff -u testing/sols-100 -`; then \
+	    echo "Test: success"; \
+	else \
+	    echo "$$res\nTest: FAILED" | cut -c 1-75 >&2; \
+	fi
+#	When compiling with NR_ROWS_LOG != 20, the solutions it finds are
+#	different: testing/sols-100
 
 clean :
 	rm -f sa-solver _kernel.h *.o _temp_*
 
 re : clean all
-
-.cpp.o :
-	${CC} ${CPPFLAGS} -o $@ -c $<
diff --git a/README.md b/README.md
@@ -2,13 +2,9 @@
 
 Official site: https://github.com/mbevand/silentarmy
 
-SILENTARMY is a [Zcash](https://z.cash) miner for Linux written in OpenCL with
-multi-GPU support. The
-[Stratum](https://github.com/str4d/zips/blob/77-zip-stratum/drafts/str4d-stratum/draft1.rst) protocol is implemented for connecting to mining pools. It runs
-best on AMD GPUs but has also been reported to work on other OpenCL devices such
-as Xeon Phi, Intel GPUs, and through OpenCL CPU drivers. (Nvidia GPUs are not
-currently supported due to an
-[issue](https://github.com/mbevand/silentarmy/issues/6).)
+SILENTARMY is a free open source [Zcash](https://z.cash) miner for Linux
+with multi-GPU and [Stratum](https://github.com/str4d/zips/blob/77-zip-stratum/drafts/str4d-stratum/draft1.rst) support. It is written in OpenCL and has been tested
+on AMD/Nvidia/Intel GPUs, Xeon Phi, and more.
 
 After compiling SILENTARMY, list the available OpenCL devices:
 
@@ -80,57 +76,40 @@ quick test/benchmark is simply:
 `$ sa-solver --nonces 100`
 
 Note: due to BLAKE2b optimizations in my implementation, if the header is
-specified it must be 140 bytes and its last 12 bytes **must** be zero. For
-convenience, `-i` can also specify a 108-byte nonceless header to which
-`sa-solver` adds an implicit nonce of 32 zero bytes.
+specified it must be 140 bytes and its last 12 bytes **must** be zero.
 
 Use the verbose (`-v`) and very verbose (`-v -v`) options to show the solutions
 and statistics in progressively more and more details.
 
 # Performance
 
-* 47.5 Sol/s with one R9 Nano
-* 45.0 Sol/s with one R9 290X
-* 41.0 Sol/s with one RX 480 8GB
+* 115.0 sol/s with one R9 Nano
+* 75.0 sol/s with one RX 480 8GB
+* (TODO: add Nvidia performance numbers)
 
 Note: the `silentarmy` **miner** automatically achieves this performance level,
 however the `sa-solver` **command-line solver** by design runs only 1 instance
-of the Equihash proof-of-work algorithm causing it to underperform. One must
-manually run 2 instances of `sa-solver` (eg. in 2 terminal consoles) to
-achieve the same performance level as the `silentarmy` **miner**.
-
-Troubleshooting performance issues:
-* By default SILENTARMY mines with only one device/GPU; make sure to specify
-  all the GPUs in the `--use` option, for example `silentarmy --use 0,1,2`
-  if the host has three devices with IDs 0, 1, and 2.
-* If some GPUs have less than ~2.4 GB of GPU memory, run
-  `silentarmy --instances 1` (2 instances use ~2.4 GB of GPU memory,
-  1 instance uses ~1.2 GB of GPU memory.)
-* If you are using an AMD GPU with the **Radeon Software Crimson Edition**
-  driver, as opposed to the **AMDGPU-PRO** driver, then edit param.h and set
-  `OPTIM_FOR_FGLRX` to 1. This will improve performance by +5% and reduce
-  GPU memory usage from 1.2 GB per instance to 805 MB per instance. But do
-  **not** set it if you are using the AMDGPU-PRO driver or else it will
-  degrade performance by -15% or more.
-* If 1 instance still requires too much memory, edit `param.h` and set
-  `NR_ROWS_LOG` to `19` (this reduces the per-instance memory usage to ~670 MB)
-  and run with `--instances 1`.
+of the Equihash proof-of-work algorithm causing it to slightly underperform by
+5-10%. One must manually run 2 instances of `sa-solver` (eg. in 2 terminal
+consoles) to achieve the same performance level as the `silentarmy` **miner**.
 
 # Dependencies
 
-SILENTARMY has primarily been tested with AMD GPUs on 64-bit Linux with
-the **AMDGPU-PRO** driver (amdgpu.ko, for newer GPUs) and the **Radeon Software
-Crimson Edition** driver (fglrx.ko, for older GPUs). Its only build
-dependency is an OpenCL implementation.
+SILENTARMY has only one build dependency: an OpenCL implementation. And it
+has only one runtime dependency: Python 3.3 or later (needed to support the
+use of the `yield from` syntax.)
 
-Installation of the drivers and SDK can be error-prone, so below are
-step-by-step instructions for the AMD OpenCL implementation (**AMD APP SDK**),
-for Ubuntu 16.04 as well as Ubuntu 14.04 (beware: the `silentarmy` miner makes
-use of Python's `ensure_future()` which requires Python 3.4.4, however Ubuntu
-14.04 ships 3.4.3, therefore only the `sa-solver` tool is usable on Ubuntu
-14.04.)
+When running on AMD GPUs, install the **AMD APP SDK** (OpenCL implementation)
+and either:
+* the **AMDGPU-PRO** driver (amdgpu.ko, for newer GPUs), or
+* the **Radeon Software Crimson Edition** driver (fglrx.ko, for older GPUs)
 
-## Ubuntu 16.04
+When running on Nvidia GPUs, install the Nvidia OpenCL development files,
+and their binary driver.
+
+Instructions are provided below for a few Linux versions.
+
+## Ubuntu 16.04 / amdgpu
 
 1. Download the [AMDGPU-PRO Driver](http://support.amd.com/en-us/kb-articles/Pages/AMDGPU-PRO-Install.aspx)
 (as of 30 Oct 2016, the latest version is 16.40)
@@ -155,17 +134,28 @@ use of Python's `ensure_future()` which requires Python 3.4.4, however Ubuntu
 8. Install system-wide by running as root (accept all the default options):
   `$ sudo ./AMD-APP-SDK-v3.0.130.136-GA-linux64.sh`
 
-9. Install compiler dependencies which you will need to compile SILENTARMY:
+9. Install compiler dependencies in order to compile SILENTARMY:
   `$ sudo apt-get install build-essential`
 
-## Ubuntu 14.04
+## Ubuntu 14.04 / fglrx
 
 1. Install the official Ubuntu package:
    `$ sudo apt-get install fglrx`
    (as of 30 Oct 2016, the latest version is 2:15.201-0ubuntu0.14.04.1)
 
 2. Follow steps 5-9 above.
 
+## Ubuntu 16.04 / Nvidia
+
+1. Install the OpenCL development files and the latest driver:
+   `$ sudo apt-get install nvidia-opencl-dev nvidia-361`
+
+2. Either reboot, or load the kernel driver:
+   `$ modprobe nvidia_361`
+
+3. Install compiler dependencies in order to compile SILENTARMY:
+  `$ sudo apt-get install build-essential`
+
 ## Arch Linux
 
 1. Install the [silentarmy AUR package](https://aur.archlinux.org/packages/silentarmy/).
@@ -177,9 +167,9 @@ Compiling SILENTARMY is easy:
 `$ make`
 
 You may need to specify the paths to the locations of your OpenCL C headers
-and libOpenCL.so if the Makefile does not find them:
+and libOpenCL.so if the compiler does not find them, eg.:
 
-`$ make OPENCL_HEADERS=/path/here LIBOPENCL=/path/there`
+`$ make OPENCL_HEADERS=/usr/local/cuda-8.0/targets/x86_64-linux/include LIBOPENCL=/usr/local/cuda-8.0/targets/x86_64-linux/lib`
 
 Self-testing the command-line solver (solves 100 all-zero 140-byte blocks with
 their nonces varying from 0 to 99):
@@ -244,6 +234,8 @@ almost certainly bits 180-199), this is also discarded as a likely invalid
 solution because this is statistically guaranteed to be all inputs repeated
 at least once. This check is implemented in `kernel_sols()` (see
 `likely_invalids`.)
+* When input references are expanded on-GPU by `expand_refs()`, the code
+checks if the last (512th) input is repeated at least once.
 * Finally when the GPU returns potential solutions, the CPU also checks for
 invalid solutions with duplicate inputs. This check is implemented in
 `verify_sol()`.
@@ -261,7 +253,12 @@ Donations welcome: t1cVviFvgJinQ4w3C2m2CfRxgP5DnHYaoFC
 
 I would like to thank these persons for their contributions to SILENTARMY,
 in alphabetical order:
+* eXtremal
+* kenshirothefist
+* lhl
 * nerdralph
+* poiuty
+* solardiz
 
 # License
 

diff --git a/TROUBLESHOOTING.md b/TROUBLESHOOTING.md
@@ -0,0 +1,97 @@
+# Troubleshooting
+
+Follow this checklist to verify that your entire hardware and software
+stack works (drivers, OpenCL, SILENTARMY).
+
+## Driver / OpenCL installation
+
+Run `clinfo` to list all the OpenCL devices. If it does not find all your
+devices, something is wrong with your drivers and/or OpenCL stack. Uninstall
+and reinstall your drivers. Here are good instructions:
+https://hashcat.net/wiki/doku.php?id=frequently_asked_questions#i_may_have_the_wrong_driver_installed_what_should_i_do
+
+## Check silentarmy
+
+Does `./silentarmy --list` list your devices? If `clinfo` does, silentarmy
+should list them as well.
+
+## Basic operation 
+
+Run the Equihash solver `sa-solver` to solve the all-zero block. It should
+report 2 solutions. Specify the device ID to test with `--use ID`
+
+```
+$ ./sa-solver --use 0
+Solving default all-zero 140-byte header
+Building program
+Hash tables will use 805.3 MB
+Running...
+Nonce 0000000000000000000000000000000000000000000000000000000000000000: 2 sols
+Total 2 solutions in 205.3 ms (9.7 Sol/s)
+```
+
+Note that `sa-solver` only supports 1 device at a time. It will not recognize
+eg. `--use 0,1,2`.
+
+## Correct results
+
+Verify that `make test` reports valid Equihash solutions for 100 different
+blocks:
+
+```
+$ make test
+./sa-solver --nonces 100 -v -v 2>&1 | grep Soln: | \
+    diff -u testing/sols-100 - | cut -c 1-75
+```
+
+It should output nothing else. If you see a bunch of lines with numbers,
+something is wrong with your hardware and/or drivers.
+
+## Sustained operation on one device
+
+Let the Equihash solver `sa-solver` run for multiple hours:
+
+```
+$ ./sa-solver --nonces 100000000
+Solving default all-zero 140-byte header
+Building program
+Hash tables will use 1208.0 MB
+Running...
+Nonce 0000000000000000000000000000000000000000000000000000000000000000: 2 sols
+Nonce 0100000000000000000000000000000000000000000000000000000000000000: 0 sols
+...
+```
+
+It should not crash or hang.
+
+## Mining
+
+Run the miner without options. By default it will use the first device,
+and connect to flypool with my donation address. These known-good parameters
+should let you know easily if your machine can mine properly:
+
+```
+$ ./silentarmy
+Connecting to us1-zcash.flypool.org:3333
+Stratum server sent us the first job
+Mining on 1 device
+Total 0.0 sol/s [dev0 0.0] 0 shares
+Total 48.9 sol/s [dev0 48.9] 1 share
+Total 44.9 sol/s [dev0 44.9] 1 share
+...
+```
+
+Verify that the number of shares increases over time.
+
+## Performance
+
+Not achieving the performance you expected?
+
+* By default SILENTARMY mines with only one device/GPU; make sure to specify
+  all the GPUs in the `--use` option, for example `silentarmy --use 0,1,2`
+  if the host has three devices with IDs 0, 1, and 2.
+* If a GPU has less than 2 GB of GPU memory, run `silentarmy --instances 1`
+  (1 instance uses ~0.8 GB of memory, 2 instances use ~1.6 GB of memory.)
+* If 1 instance still requires too much memory, edit `param.h` and set
+  `NR_ROWS_LOG` to `19` (this reduces the per-instance memory usage to ~670 MB)
+  and run with `--instances 1`.