Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize numerator() #14

Open
DeDeRon opened this issue Apr 12, 2024 · 4 comments
Open

optimize numerator() #14

DeDeRon opened this issue Apr 12, 2024 · 4 comments

Comments

@DeDeRon
Copy link

DeDeRon commented Apr 12, 2024

hello deluan,

first thank you for this very useful library.

i rewrote the numerator() function. usually one would send a pull request, but i know norhing about git (mercurial user).

so i just paste my version here:

func numerator(img *imageBinaryChannel, template *imageBinaryChannel, offsetX int, offsetY int) float64 {
	imgArray := img.zeroMeanImage
	templateArray := template.zeroMeanImage

	iidx := offsetY*img.width + offsetX
	idelta := img.width - template.width
	tidx := 0
	var sum float64 = 0
	for y := 0; y < template.height; y++ {
		for x := 0; x < template.width; x++ {
			sum += imgArray[iidx] * templateArray[tidx]
			iidx++
			tidx++
		}
		iidx += idelta
	}
	return sum
}

two big changes:
1st: i pulled constanst computations out of the loop, as they don't change.
2nd:i swapped the loop order - y outer loop, x inner loop.
this way the loop is a lot more cache friendly

the new code is not only faster but a lot simpler and easier to understand.

on my machine i got a speed improvement of 20%. hope you find this useful!

regards,

dederon

@deluan
Copy link
Owner

deluan commented Apr 12, 2024

Thanks for looking into this. I agree your version is easier to read, but I could not reproduce the performance gains in my environment. I modified the BenchmarkNumerator to test both numerator functions, and the results are basically the same (yours is numerator2):

BenchmarkNumerator
BenchmarkNumerator/numerator
BenchmarkNumerator/numerator-8         	 1251135	       938.0 ns/op
BenchmarkNumerator/numerator2
BenchmarkNumerator/numerator2-8        	 1283866	       935.1 ns/op

This obviously using the benchmark image, not your images. I also don't see how swapping the loops would make any difference, as it would depend on width being smaller than height, which we can't control and it is not always de case.

What Go version and architecture are you using? My tests were done in a MacBook M2, Go 1.22.2.

@DeDeRon
Copy link
Author

DeDeRon commented Apr 15, 2024

hi deluan,

thanks four your quick reply. i could not answer earlier as there where issues with the github login.

first - as requested - your benchmarks on my machine:

goos: linux
goarch: amd64
pkg: github.com/deluan/lookup
cpu: AMD Ryzen 7 7735HS with Radeon Graphics
BenchmarkNewIntegralImage-16             7547535               164.9 ns/op           336 B/op          3 allocs/op
BenchmarkLookupAll-16                        388           3118168 ns/op              48 B/op          2 allocs/op
BenchmarkNumerator-16                    2062993               578.1 ns/op             0 B/op          0 allocs/op
BenchmarkOCR-16                              274           4434511 ns/op          128565 B/op        121 allocs/op
BenchmarkOCRParallel-16                      882           1336423 ns/op          129252 B/op        133 allocs/op
PASS
ok      github.com/deluan/lookup        7.714s

i have three patches lookup.zip, which are based on each other, so they have to be applied in order:

  1. 00_speedup_numerator.patch
    this is the code i postet in my first message.

benchmark:

BenchmarkNewIntegralImage-16             7562330               165.6 ns/op           336 B/op          3 allocs/op
BenchmarkLookupAll-16                        397           3030824 ns/op              48 B/op          2 allocs/op
BenchmarkNumerator-16                    2132992               556.7 ns/op             0 B/op          0 allocs/op
BenchmarkOCR-16                              292           4101976 ns/op          128508 B/op        121 allocs/op
BenchmarkOCRParallel-16                      975           1165681 ns/op          129243 B/op        133 allocs/op

compared to current version on github:

benchmark                        old ns/op     new ns/op     delta
BenchmarkNewIntegralImage-16     165           166           +0.42%
BenchmarkLookupAll-16            3118168       3030824       -2.80%
BenchmarkNumerator-16            578           557           -3.70%
BenchmarkOCR-16                  4434511       4101976       -7.50%
BenchmarkOCRParallel-16          1336423       1165681       -12.78%

so my estimation of 20 % speed improvements where too optimistic.

  1. 01_pullup_channel_check.patch
    i pulled the check for channel compatibility from lookup() up to lookupAll(), because lookup() would do this test for every pixel, which is not necessary. i don't expect a speedup, as the branch predictor will optimize this branch away. what's more important is that lookup() does not return an error anymore. this brings me to

  2. 02_concurrent_lookup.patch
    lookup() will do the search concurrently now. i wrote a func lookupCol(), which runs in its own goroutine for every image column. as expected, this leads to a tremendous speedup (compared to github version):

BenchmarkNewIntegralImage-16             7221274               166.4 ns/op           336 B/op          3 allocs/op
BenchmarkLookupAll-16                       3542            352004 ns/op            5889 B/op         73 allocs/op
BenchmarkNumerator-16                    2136513               559.8 ns/op             0 B/op          0 allocs/op
BenchmarkOCR-16                              970           1230779 ns/op          211939 B/op       1140 allocs/op
BenchmarkOCRParallel-16                     2125            576892 ns/op          212741 B/op       1153 allocs/op
benchmark                        old ns/op     new ns/op     delta
BenchmarkNewIntegralImage-16     165           166           +0.91%
BenchmarkLookupAll-16            3118168       352004        -88.71%
BenchmarkNumerator-16            578           560           -3.17%
BenchmarkOCR-16                  4434511       1230779       -72.25%
BenchmarkOCRParallel-16          1336423       576892        -56.83%

the 8-fold speed improvement of LookupAll() reflects the fact that my machine has 8 cores, so it scales linearly. Even the ocr routines benefit from this improvement. what bothers my is that lookupCol() has 9 (!) input parameters, which is awful.

that's all for now.

one last thing: when running go test on the lookup package there seems to be an error, event though the test suite is flagged as PASS:

......
6 total assertions

..
8 total assertions

..
10 total assertions

........
18 total assertions

....incompatible channels 0 <> 1
.
23 total assertions

.
24 total assertions

........
32 total assertions

PASS
ok      github.com/deluan/lookup        0.032s

regards,

heiko

@deluan
Copy link
Owner

deluan commented Apr 26, 2024

Thanks and sorry for the late reply. I'll take a look at the patches.

usually one would send a pull request, but i know norhing about git (mercurial user).

You could use GitHub's own interface to create a PR. Just browse to the file you want to change and edit it:

Screenshot 2024-04-26 at 5 39 16 PM

If you use VSCode or IntelliJ, that should be very easy to create PR's from their own UI.

Anyway, I'll check the patches and will see if I can make the lookupCol better (less parameters). Thanks!

@DeDeRon
Copy link
Author

DeDeRon commented Apr 29, 2024

hi deluan,

thanks for taking your time to educate me!

please ignore my patch 02, i have something better in mind. my patch makes the code slower when using a small image in a small search window (because of the goroutine/channel overhead). i will report back soon (tm) with a much better approach.

regards,

heiko

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants