-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance of mapreduce #46
Comments
Similar results on Ventura as well, so that's not the cause. |
On my computer: julia> a = fill(Float32(1.0), 10*1024*1024);
julia> da = MtlArray(a);
julia> @btime sum(a)
844.500 μs (1 allocation: 16 bytes)
1.048576f7
julia> @btime sum(da)
2.707 ms (857 allocations: 23.66 KiB)
1.048576f7 Now, if we do this: diff --git a/src/mapreduce.jl b/src/mapreduce.jl
index 1d84d78..900f21d 100644
--- a/src/mapreduce.jl
+++ b/src/mapreduce.jl
@@ -123,7 +123,7 @@ function partial_mapreduce_device(f, op, neutral, maxthreads, Rreduce, Rother, s
ireduce += localDim_reduce * groupDim_reduce
end
- val = reduce_group(op, val, neutral, shuffle, maxthreads)
+ val = 1 # reduce_group(op, val, neutral, shuffle, maxthreads)
# write back to memory
if localIdx_reduce == 1 It still takes 2ms to simply loop over the input/output arrays! julia> @btime sum(da)
2.015 ms (857 allocations: 23.66 KiB)
1.0f0 My guess is that the slowdown is from all the indexing calculations (same as #41). But it's even harder to eliminate the cartesian indexing because the reduction process itself can add additional dimensions... |
I tried writing a reduction kernel which only supports 1d arrays, and it's about 4x as fast as the current implementation. I'll try to see if the generic implementation can be further improved. |
Reductions are generally faster now, however in-place is still very slow: julia> @btime sum($a)
760.000 μs (0 allocations: 0 bytes)
5.001241f6
julia> @btime sum($Ma)
708.083 μs (1197 allocations: 27.76 KiB)
5.001241f6
julia> @btime Metal.@sync sum!($r, $Ma)
376.325 ms (101199 allocations: 2.00 MiB)
1-element MtlVector{Float32}:
5.001241f6 |
In-place is slow because it's hitting the If -Base.mapreducedim!(f, op, R::AnyGPUArray, A::AbstractArray) = mapreducedim!(f, op, R, A)
+Base.mapreducedim!(f, op, R::AnyGPUArray{T}, A::AbstractArray) where {T} =
+ mapreducedim!(f, op, R, A; init=neutral_element(op, T)) With my limited Julia fundamentals knowledge, I don't know how to extend |
It would be good to write a similar blog using |
Probably a known issue by the devs but just for the record:
An in place operation will yield even slower performance:
Platform: Mac Studio with Apple M1 Max, v1.8.0,
just realized I'm not on Ventura, but Monterey instead. I don't know whether this is the cause of the performance. Other matrix operations are pretty fast though.
The text was updated successfully, but these errors were encountered: