Improve reduce performance by passing CartesianIndices and length statically #100

maxwindiff · 2023-02-21T07:57:00Z

Improve indexing performance by passing CartesianIndices statically, using a similar trick as JuliaGPU/GPUArrays.jl#454. Still slow, but not as bad as before. Helps with #46.

Before:

julia> a = fill(Float32(1.0), 4096 * 4096);
julia> da = MtlArray(a);
julia> b = fill(Float32(1.0), 4096, 4096);
julia> db = MtlArray(b);

julia> @btime sum(a)
  1.393 ms (1 allocation: 16 bytes)
1.6777216f7

julia> @btime sum(b)
  1.392 ms (1 allocation: 16 bytes)
1.6777216f7

julia> @btime sum(da)
  4.026 ms (868 allocations: 23.95 KiB)
1.6777216f7

julia> @btime sum(db)
  11.196 ms (873 allocations: 25.23 KiB)
1.6777216f7

After:

julia> @btime sum(da)
  1.811 ms (754 allocations: 20.80 KiB)
1.6777216f7

julia> @btime sum(db)
  2.181 ms (759 allocations: 21.33 KiB)
1.6777216f7

Passing length(Rother) as Rlen may look redundant, but the 2D case (sum(db)) runs 3x slower without it.

julia> @btime sum(db)
  6.648 ms (759 allocations: 21.33 KiB)
1.6777216f7

There were some test failures, but they also happen on main (complains about symbol not found) and seems unrelated to this PR -- https://gist.github.com/maxwindiff/fe0480dcfd1bcd4cb28e91f2c1a0cfa6

…tically

src/mapreduce.jl

maleadt · 2023-02-21T08:43:59Z

LGTM, for now at least. This isn't something we want to apply everywhere due to the increased compile times, it's better to figure out a way to encode dynamic Cartesian indices in a way that Metal can handle them somewhat performantly.

maleadt · 2023-02-21T08:49:54Z

Did you explore adding back some of the information that gets lost by @inbounds? That improved performance significantly in JuliaGPU/GPUArrays.jl#454.

Co-authored-by: Tim Besard <[email protected]>

maxwindiff · 2023-02-22T06:08:39Z

The linear indexing at https://github.com/JuliaGPU/Metal.jl/blob/main/src/mapreduce.jl#L105 and https://github.com/JuliaGPU/Metal.jl/blob/main/src/mapreduce.jl#L120 were guarded by range checks already. What other bounds info should I try?

I tried this but there's no improvement:

diff --git a/src/mapreduce.jl b/src/mapreduce.jl
index e878851..29010ae 100644
--- a/src/mapreduce.jl
+++ b/src/mapreduce.jl
@@ -96,6 +96,7 @@ function partial_mapreduce_device(f, op, neutral, maxthreads, ::Val{Rreduce},
     # and possibly groups if it doesn't fit) and other elements (remaining groups)
     localIdx_reduce = thread_position_in_threadgroup_1d()
     localDim_reduce = threads_per_threadgroup_1d()
+    assume(1 <= Rlen)
     groupIdx_reduce, groupIdx_other = fldmod1(threadgroup_position_in_grid_1d(), Rlen)
     groupDim_reduce = threadgroups_per_grid_1d() ÷ Rlen
 
@@ -103,6 +104,7 @@ function partial_mapreduce_device(f, op, neutral, maxthreads, ::Val{Rreduce},
     # (that means we can safely synchronize items within this group)
     iother = groupIdx_other
     @inbounds if iother <= length(Rother)
+        assume(1 <= iother <= length(Rother))
         Iother = Rother[iother]
 
         # load the neutral value
@@ -118,6 +120,7 @@ function partial_mapreduce_device(f, op, neutral, maxthreads, ::Val{Rreduce},
         # reduce serially across chunks of input vector that don't fit in a group
         ireduce = localIdx_reduce + (groupIdx_reduce - 1) * localDim_reduce
         while ireduce <= length(Rreduce)
+            assume(1 <= ireduce <= length(Rreduce))
             Ireduce = Rreduce[ireduce]
             J = max(Iother, Ireduce)
             val = op(val, f(_map_getindex(As, J)...))

maleadt · 2023-02-22T08:00:54Z

Yeah I guess that covers all of them already.

Improve reduce performance by passing CartesianIndices and length sta…

634e5f9

…tically

maleadt reviewed Feb 21, 2023

View reviewed changes

src/mapreduce.jl Outdated Show resolved Hide resolved

maleadt added the performance Gotta go fast. label Feb 21, 2023

maleadt mentioned this pull request Feb 21, 2023

Improve performance of Cartesian indexing #101

Open

Remove stray log

2a02d7a

Co-authored-by: Tim Besard <[email protected]>

maleadt merged commit 25a7930 into JuliaGPU:main Feb 22, 2023

maxwindiff deleted the reduce branch February 26, 2023 07:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve reduce performance by passing CartesianIndices and length statically #100

Improve reduce performance by passing CartesianIndices and length statically #100

maxwindiff commented Feb 21, 2023 •

edited

Loading

maleadt commented Feb 21, 2023

maleadt commented Feb 21, 2023

maxwindiff commented Feb 22, 2023

maleadt commented Feb 22, 2023

Improve reduce performance by passing CartesianIndices and length statically #100

Improve reduce performance by passing CartesianIndices and length statically #100

Conversation

maxwindiff commented Feb 21, 2023 • edited Loading

maleadt commented Feb 21, 2023

maleadt commented Feb 21, 2023

maxwindiff commented Feb 22, 2023

maleadt commented Feb 22, 2023

maxwindiff commented Feb 21, 2023 •

edited

Loading