Use muladd in matmul and improved operation order. #818

chriselrod · 2020-08-24T14:01:35Z

Performance should generally be better under this PR for Float64 inputs given AVX(2).

Here, I ran the benchmark in different Julia sessions and then println the results from one.

function benchesmat(rn)
    res = Vector{Float64}(undef, length(rn));
    for (i,r) ∈ enumerate(rn)
        A = @SMatrix rand(r,r)
        B = @SMatrix rand(r,r)
        res[i] = @belapsed $(Ref(A))[] * $(Ref(B))[]
    end
    res
end

res = benchesmat(1:24);

Benefits are minor with AVX512, but I'll run AVX2 benchmarks too:

chriselrod · 2020-08-24T15:07:36Z

AVX2 results:

chriselrod · 2020-08-24T15:13:47Z

Overall it is an improvement. But if you actually want it to be fast, it needs SIMD intrinsics.
To be optimal, it'd also have to change behavior in a hardware and eltype dependent manner.

mateuszbaran · 2020-08-24T19:40:30Z

This partially conflicts with #814 , did you look at my PR? It is more or less complete now, I'm just working on slightly better heuristics for picking the right multiplication method. Now I'm wondering if I should integrate this PR first.

chriselrod · 2020-08-24T20:42:16Z

This PR has two changes: changing the order of expressions in the generated code, and using muladd. We can start Julia with -Civybridge to get an idea of the performance impact of the new order in isolation, because ivybridge does not have fma instructions:

There is a performance regression at 8x8 for some reason, but otherwise performance is either equal or better.

I think muladd is a straightforward and obvious optimization. Make of the reordering what you will.

If you just use muladd in your PR instead of a * b + c, I'd be fine with closing this PR. The reordering changes help much less than I expected.

mateuszbaran · 2020-08-25T07:17:39Z

I can definitely change my PR to use muladd, this is a good idea. Reordering in mul_loop is also something that I can easily do because I didn't touch that function. I can also keep your version of mul_unrolled_chunks for non-wrapped matrices. By the way, did you check whether using mul_unrolled and mul_unrolled_chunks for smaller matrices is actually beneficial? In this PR: #514 it was suggested to use mul_loop for all sizes and from my tests (on i7-4800MQ and recently also on i7-9700KF) Julia 1.5 is pretty good at unrolling mul_loop and other variants aren't consistently faster for any matrix sizes I've tried. Your changes may of course change the outcome here.

In any case I will merge your branch into mine.

chriselrod · 2020-08-25T08:02:02Z

The reordering was supposed to make things a little more similar to PaddedMatrices, which is far faster at most sizes: https://github.com/chriselrod/PaddedMatrices.jl
The problem is we can't count on LLVM doing what we want.

But good idea to test the three different implementations.
Ivybridge (AVX, no FMA):

Haswell (AVX, FMA):

Skylake-X (AVX512):

Note that in this last plot, at 14x14 where we're at around 25 GFLOPS, PaddedMatrices were at 90 for statically sized arrays on the same computer.

Worth pointing all of these are done on the same computer by starting Julia with different -C, so they're sort of comparable between charts as well.
Unfortunately that trick isn't compatible with LoopVectorization (it doesn't realize the architecture has been "demoted" and thus makes bad decisions), otherwise I'd have added it as a baseline.

Script:

using StaticArrays, BenchmarkTools, Plots; plotly();

function bench_muls(sizerange)
    res = Matrix{Float64}(undef, length(sizerange), 3)
    for (i,r) ∈ enumerate(sizerange)
        A = @SMatrix rand(r,r)
        B = @SMatrix rand(r,r)
        res[i,1] = @belapsed StaticArrays.mul_unrolled($(Size(A)), $(Size(B)), $(Ref(A))[], $(Ref(B))[])
        res[i,2] = @belapsed StaticArrays.mul_unrolled_chunks($(Size(A)), $(Size(B)), $(Ref(A))[], $(Ref(B))[])
        res[i,3] = @belapsed StaticArrays.mul_loop($(Size(A)), $(Size(B)), $(Ref(A))[], $(Ref(B))[])
    end
    res
end

benchtimes = bench_muls(1:24);

plot(1:24, 2e-9 .* (1:24) .^ 3 ./ benchtimes, labels = ["mul_unrolled" "mul_unrolled_chunks" "mul_loop"]);
xlabel!("Matrix Dimensions"); ylabel!("GFLOPS")

mateuszbaran · 2020-08-25T10:12:15Z

Hmm... I'll have to re-run my benchmarks with muladd.

chriselrod · 2020-08-25T10:33:30Z

Also, worth pointing out that a difference between running a recent CPU with -Chaswell and an actual Haswell is that on actual Haswell addition has half the instruction throughput as fma, meaning addition + multiplication has 1/3 the throughput as fma instead of half.

On skylake, addition's throughput was doubled, so you only have half instead of 1/3. Meaning real Haswell should benefit a lot from muladd.

matmul muladd and improved order.

960beb6

chriselrod added 2 commits August 24, 2020 11:24

14 -> 12 for loopmul decisions.

ef1ba23

BLAS decision should be for larger than 14x14, if anything.

86eab40

mateuszbaran mentioned this pull request Aug 25, 2020

Structured matrix multiplication #814

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use muladd in matmul and improved operation order. #818

Use muladd in matmul and improved operation order. #818

chriselrod commented Aug 24, 2020 •

edited

Loading

chriselrod commented Aug 24, 2020 •

edited

Loading

chriselrod commented Aug 24, 2020

mateuszbaran commented Aug 24, 2020

chriselrod commented Aug 24, 2020

mateuszbaran commented Aug 25, 2020

chriselrod commented Aug 25, 2020

mateuszbaran commented Aug 25, 2020

chriselrod commented Aug 25, 2020

Use muladd in matmul and improved operation order. #818

Are you sure you want to change the base?

Use muladd in matmul and improved operation order. #818

Conversation

chriselrod commented Aug 24, 2020 • edited Loading

chriselrod commented Aug 24, 2020 • edited Loading

chriselrod commented Aug 24, 2020

mateuszbaran commented Aug 24, 2020

chriselrod commented Aug 24, 2020

mateuszbaran commented Aug 25, 2020

chriselrod commented Aug 25, 2020

mateuszbaran commented Aug 25, 2020

chriselrod commented Aug 25, 2020

chriselrod commented Aug 24, 2020 •

edited

Loading

chriselrod commented Aug 24, 2020 •

edited

Loading