Poor performance of mapreduce #46

FuZhiyu · 2022-08-24T14:59:35Z

Probably a known issue by the devs but just for the record:

using Metal, BenchmarkTools
N = 10_000_000
a = rand(Float32, N)
Ma = MtlArray(a)
@btime sum($a)
# 757.209 μs
@btime sum($Ma)
# 3.173 ms

An in place operation will yield even slower performance:

r = Metal.zeros(Float32, 1)
@btime Metal.@sync sum!($r, $Ma)
# 1.603 s (167108 allocations: 4.20 MiB)

Platform: Mac Studio with Apple M1 Max, v1.8.0,

just realized I'm not on Ventura, but Monterey instead. I don't know whether this is the cause of the performance. Other matrix operations are pretty fast though.

mchitre · 2022-11-07T17:19:30Z

Similar results on Ventura as well, so that's not the cause.

maxwindiff · 2023-02-07T07:15:22Z

On my computer:

julia> a = fill(Float32(1.0), 10*1024*1024);
julia> da = MtlArray(a);
julia> @btime sum(a)
  844.500 μs (1 allocation: 16 bytes)
1.048576f7
julia> @btime sum(da)
  2.707 ms (857 allocations: 23.66 KiB)
1.048576f7

Now, if we do this:

diff --git a/src/mapreduce.jl b/src/mapreduce.jl
index 1d84d78..900f21d 100644
--- a/src/mapreduce.jl
+++ b/src/mapreduce.jl
@@ -123,7 +123,7 @@ function partial_mapreduce_device(f, op, neutral, maxthreads, Rreduce, Rother, s
             ireduce += localDim_reduce * groupDim_reduce
         end
 
-        val = reduce_group(op, val, neutral, shuffle, maxthreads)
+        val = 1 # reduce_group(op, val, neutral, shuffle, maxthreads)
 
         # write back to memory
         if localIdx_reduce == 1

It still takes 2ms to simply loop over the input/output arrays!

julia> @btime sum(da)
  2.015 ms (857 allocations: 23.66 KiB)
1.0f0

My guess is that the slowdown is from all the indexing calculations (same as #41). But it's even harder to eliminate the cartesian indexing because the reduction process itself can add additional dimensions...

maxwindiff · 2023-02-26T07:17:32Z

I tried writing a reduction kernel which only supports 1d arrays, and it's about 4x as fast as the current implementation. I'll try to see if the generic implementation can be further improved.

maxwindiff · 2023-03-10T06:37:39Z

Reductions are generally faster now, however in-place is still very slow:

julia> @btime sum($a)
  760.000 μs (0 allocations: 0 bytes)
5.001241f6

julia> @btime sum($Ma)
  708.083 μs (1197 allocations: 27.76 KiB)
5.001241f6

julia> @btime Metal.@sync sum!($r, $Ma)
  376.325 ms (101199 allocations: 2.00 MiB)
1-element MtlVector{Float32}:
 5.001241f6

maxwindiff · 2023-03-10T07:32:56Z

In-place is slow because it's hitting the init === nothing code path: https://github.com/JuliaGPU/Metal.jl/blob/main/src/mapreduce.jl#L230-L237

If GPUArrays.neutral_element() returned nothing by default, we may be able to something like:

-Base.mapreducedim!(f, op, R::AnyGPUArray, A::AbstractArray) = mapreducedim!(f, op, R, A)
+Base.mapreducedim!(f, op, R::AnyGPUArray{T}, A::AbstractArray) where {T} =
+  mapreducedim!(f, op, R, A; init=neutral_element(op, T))

With my limited Julia fundamentals knowledge, I don't know how to extend neutral_element without breaking compatibility. Let me try other ways of initializing the partial reduction array...

maleadt · 2023-11-24T11:02:23Z

Good article: https://betterprogramming.pub/optimizing-parallel-reduction-in-metal-for-apple-m1-8e8677b49b01

rveltz · 2023-11-24T16:25:43Z

It would be good to write a similar blog using Metal.jl

maleadt · 2023-12-18T17:33:39Z

Another good source: https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/reduce.metal + https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/reduce.cpp

This was referenced Jan 19, 2023

Add basic SIMD shuffle up/down #73

Merged

Optimize warp reduction for mapreduce #75

Merged

maleadt added the performance Gotta go fast. label Feb 8, 2023

maleadt mentioned this issue Feb 8, 2023

mapreduce has poor performance #87

Closed

maxwindiff mentioned this issue Feb 21, 2023

Improve reduce performance by passing CartesianIndices and length statically #100

Merged

maxwindiff mentioned this issue Mar 4, 2023

Reduce multiple consecutive values in each thread to improve efficiency #112

Merged

5 tasks

maxwindiff mentioned this issue Mar 10, 2023

Faster in-place reduction by using broadcasting to initialize partial… #123

Merged

maleadt changed the title ~~low performance of mapreduce over MtlArray~~ Poor performance of mapreduce May 22, 2023

maleadt added the arrays Things about the array abstraction. label May 22, 2023

maleadt mentioned this issue Jun 22, 2023

mapreduce allocates a lot on the CPU #211

Closed

maleadt mentioned this issue Mar 5, 2024

Minor mapreduce improvements #303

Merged

maleadt closed this as completed in #303 Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance of mapreduce #46

Poor performance of mapreduce #46

FuZhiyu commented Aug 24, 2022 •

edited

Loading

mchitre commented Nov 7, 2022

maxwindiff commented Feb 7, 2023

maxwindiff commented Feb 26, 2023

maxwindiff commented Mar 10, 2023

maxwindiff commented Mar 10, 2023

maleadt commented Nov 24, 2023

rveltz commented Nov 24, 2023

maleadt commented Dec 18, 2023

Poor performance of mapreduce #46

Poor performance of mapreduce #46

Comments

FuZhiyu commented Aug 24, 2022 • edited Loading

mchitre commented Nov 7, 2022

maxwindiff commented Feb 7, 2023

maxwindiff commented Feb 26, 2023

maxwindiff commented Mar 10, 2023

maxwindiff commented Mar 10, 2023

maleadt commented Nov 24, 2023

rveltz commented Nov 24, 2023

maleadt commented Dec 18, 2023

FuZhiyu commented Aug 24, 2022 •

edited

Loading