Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple 3- and 4-arg * methods #919

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mcabbott
Copy link
Collaborator

@mcabbott mcabbott commented Jun 7, 2021

This PR repairs a small slowdown due to JuliaLang/julia#37898. When all matrices are the same size, that wastes a few ns thinking about their sizes, while this PR restores the left-to-right behaviour:

julia> s1,s2,s3,s4,s5 = fill(3,5);

julia> a=@SMatrix rand(s1,s2); b=@SMatrix rand(s2,s3); c=@SMatrix rand(s3,s4); d=@SMatrix rand(s4,s5);

julia> @btime *($(Ref(a))[],$(Ref(b))[],$(Ref(c))[],$(Ref(d))[]);  # after 37898, this thinks about the sizes
  9.291 ns (0 allocations: 0 bytes)

julia> @btime (($(Ref(a))[] * $(Ref(b))[]) * $(Ref(c))[]) * $(Ref(d))[];  # simply left to right
  5.208 ns (0 allocations: 0 bytes)

When the matrices are of different sizes, things aren't so clear-cut. Sometimes the re-ordering does speed things up, sometimes it doesn't.

@mateuszbaran
Copy link
Collaborator

I think ideally 3- and 4-arg methods of * would decide during compilation what the best order is, using generated functions and your cost model from that PR. Could you add that or do you need some help?

@@ -15,11 +15,15 @@ import LinearAlgebra: BlasFloat, matprod, mul!
@inline *(A::StaticArray{Tuple{N,1},<:Any,2}, B::Adjoint{<:Any,<:StaticVector}) where {N} = vec(A) * B
@inline *(A::StaticArray{Tuple{N,1},<:Any,2}, B::Transpose{<:Any,<:StaticVector}) where {N} = vec(A) * B

# Avoid LinearAlgebra._quad_matmul's order calculation on equal sizes
@inline *(A::StaticMatrix{N,N}, B::StaticMatrix{N,N}, C::StaticMatrix{N,N}) where {N} = (A*B)*C
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also use StaticMatMulLike here to be a bit more generic, although currently StaticMatMulLike doesn't swap sizes for adjoints and transposes so that would have to be fixed when working on the cost model.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that StaticArrays.StaticMatMulLike{3,10} is a union of three different sizes, is that deliberate or an oversight? Seems like it really ought to be StaticSquareLike{3} for the all the square ones, and StaticMatLike{3,10} allowing non-square but always size(A) == (3,10).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately I ended up not using the size part of StaticMatMulLike so the fact that s1 and s2 don't correspond to sizes is essentially an oversight.

@mateuszbaran
Copy link
Collaborator

BTW, I think you should reduce the cost of multiplying by diagonal matrices (and probably also things like bidiagonal or tridiagonal) in that PR to base.

@mcabbott
Copy link
Collaborator Author

mcabbott commented Jun 8, 2021

The base PR doesn't do anything for special matrices, since they are all square. But the cost of checking their sizes unnecessarily is usually negligible -- only tiny SMatrices are fast enough that it matters, which is the case this PR wants to fix. I agree that things like Diagonal * Diagonal * Diagonal could be further optimised, there's quite a large set of possibilities and I haven't given any thought to how that might work.

I didn't know about this, but StaticArrays.StaticMatMulLike{N,N} looks like a union of static square things which could be used in place of StaticMatrix{N,N} here.

For non-square matrices, I agree you could pull the cost calculation into a generated function. I think I'm going to leave that for another PR, or for someone else if they need it. This one just wants to avoid slowing down a common case.

@dkarrasch
Copy link
Contributor

I'm a bit surprised the cost computation isn't compiled away, actually, given that sizes of SMatrices are known at compile time and that this is all that is required to compute the costs.

@mateuszbaran
Copy link
Collaborator

Yes, right, optimization of computation of products of two small matrices was complicated enough, and developing a reliable set of heuristics for products of more matrices would be even harder.

Anyway, it's indeed interesting that constant propagation didn't manage to simplify size checks here.

@mcabbott
Copy link
Collaborator Author

mcabbott commented Jun 8, 2021

I was also surprised it didn't get compiled away. Maybe it just needs more things inlined, if I try @eval LinearAlgebra @inline function _quad_matmul(A,B,C,D) ... then the time difference in the first message above goes away. Maybe that's a better idea than this PR!

Even with that change, the re-ordering doesn't always speed up SMatrix chains. Here's my test, sometimes one is faster, sometimes the other. Unless the cost calculation is taking as long as the matrix multiplication, it seems that this shows that the optimal order isn't in fact optimal -- so a different cost model would be better for SMatrices?

julia> using StaticArrays, Random

julia> s1,s2,s3,s4,s5 = shuffle([2,3,5,7,11])
5-element Vector{Int64}:
  2
  3
  7
  5
 11

julia> a=@SMatrix rand(s1,s2); b=@SMatrix rand(s2,s3); c=@SMatrix rand(s3,s4); d=@SMatrix rand(s4,s5);

julia> @btime *($(Ref(a))[],$(Ref(b))[],$(Ref(c))[],$(Ref(d))[]);  # after 37898, this thinks about the sizes
  29.648 ns (0 allocations: 0 bytes)

julia> @btime (($(Ref(a))[] * $(Ref(b))[]) * $(Ref(c))[]) * $(Ref(d))[];  # simply left to right
  14.640 ns (0 allocations: 0 bytes)

@mateuszbaran
Copy link
Collaborator

So maybe let's no worry too much now about reordering of multiplication for static matrices? Just eliminating run-time size check would definitely be enough.

if I try @eval LinearAlgebra @inline function _quad_matmul(A,B,C,D) ... then the time difference in the first message above goes away. Maybe that's a better idea than this PR!

That indicates that the overhead may be just from the single additional call, hard to tell without looking at @code_typed. I think _quad_matmul can be safely @inline since it's just called in one place in a non-@inline function.

@mcabbott
Copy link
Collaborator Author

mcabbott commented Jun 8, 2021

The @inline seems safe, but besides solving the above benchmark, I wonder whether it would solve real-world equivalents? Or whether * not inlining would reproduce the problem?

Just eliminating run-time size check

Whether the existing size check is a good idea overall or not will depend on the distribution of non-square cases in "overall". My shuffle([2,3,5,7,11]) above is sort-of designed to produce marginal cases (and I picked a bad run); if we think the real world has more 10x1 matrices etc. then probably the check will pay for itself. (It would of course be easy to remove it entirely, e.g. by overloading _quad_matmul)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants