-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use muladd in matmul and improved operation order. #818
base: master
Are you sure you want to change the base?
Conversation
Overall it is an improvement. But if you actually want it to be fast, it needs SIMD intrinsics. |
This partially conflicts with #814 , did you look at my PR? It is more or less complete now, I'm just working on slightly better heuristics for picking the right multiplication method. Now I'm wondering if I should integrate this PR first. |
I can definitely change my PR to use In any case I will merge your branch into mine. |
The reordering was supposed to make things a little more similar to PaddedMatrices, which is far faster at most sizes: https://github.com/chriselrod/PaddedMatrices.jl But good idea to test the three different implementations. Skylake-X (AVX512): Worth pointing all of these are done on the same computer by starting Julia with different Script: using StaticArrays, BenchmarkTools, Plots; plotly();
function bench_muls(sizerange)
res = Matrix{Float64}(undef, length(sizerange), 3)
for (i,r) ∈ enumerate(sizerange)
A = @SMatrix rand(r,r)
B = @SMatrix rand(r,r)
res[i,1] = @belapsed StaticArrays.mul_unrolled($(Size(A)), $(Size(B)), $(Ref(A))[], $(Ref(B))[])
res[i,2] = @belapsed StaticArrays.mul_unrolled_chunks($(Size(A)), $(Size(B)), $(Ref(A))[], $(Ref(B))[])
res[i,3] = @belapsed StaticArrays.mul_loop($(Size(A)), $(Size(B)), $(Ref(A))[], $(Ref(B))[])
end
res
end
benchtimes = bench_muls(1:24);
plot(1:24, 2e-9 .* (1:24) .^ 3 ./ benchtimes, labels = ["mul_unrolled" "mul_unrolled_chunks" "mul_loop"]);
xlabel!("Matrix Dimensions"); ylabel!("GFLOPS") |
Hmm... I'll have to re-run my benchmarks with |
Also, worth pointing out that a difference between running a recent CPU with On skylake, addition's throughput was doubled, so you only have half instead of 1/3. Meaning real Haswell should benefit a lot from muladd. |
Performance should generally be better under this PR for
Float64
inputs given AVX(2).Here, I ran the benchmark in different Julia sessions and then
println
the results from one.Benefits are minor with AVX512, but I'll run AVX2 benchmarks too: