Improve performance of Cartesian indexing #101

maleadt · 2023-02-21T08:49:18Z

Metal GPUs suffer from the way we encode Cartesian indices, presumably because of the integer division that happens when mapping a linear index to a Cartesian, but there may be other causes. In #100 and JuliaGPU/GPUArrays.jl#454, we worked around some of the more egregious performance issues by putting the indices in the type domain such that are known to LLVM, allowing the back-end compiler to optimize code (again, presumably avoiding the division by a constant integer by mapping them onto a bunch of bit operations).

This isn't ideal because it results in significantly more kernels being compiled. Ideally we figure out a way to better encode Cartesian indices, although it's obviously hard to avoid the integer division at all.

Alternatively, we might want to improve https://github.com/maleadt/StaticCartesian.jl, or something similar, so that we can perform this optimization ourselves instead of relying on the Metal back-end compiler, because relying on such an optimization might be fragile (as observed in JuliaGPU/GPUArrays.jl#454 where we needed additional bounds information for the optimization to trigger).

N5N3 · 2023-02-28T01:21:20Z

Does Base.MultiplicativeInverses.SignedMultiplicativeInverse helps here?
If so we can form the vec(::CartesianIndices) on host side and let GPU do magic divrem

maleadt · 2023-03-02T08:22:54Z

Does Base.MultiplicativeInverses.SignedMultiplicativeInverse helps here?

I'm not sure, I don't have experience with that tool. Can you elaborate? Your PR looked pretty breaking on non-1.10 CI.

N5N3 · 2023-03-02T08:31:56Z

pretty breaking on non-1.10 CI.
My fault, a careless typo.

My further trial shows that SignedMultiplicativeInverse{Int64} is slow on my 1660.
Perhaps do Int128 Multiplication is not a good idea here.

I tried to force every index (Edit: And the array length) to UInt32, then it has similar runtime performance as static CartesianIndices.
But I guess that not acceptable?

maleadt added the performance Gotta go fast. label Feb 21, 2023

maleadt added the kernels Things about kernels and how they are compiled. label May 22, 2023

maleadt changed the title ~~Systemetically improve performance of Cartesian indexing~~ Improve performance of Cartesian indexing Feb 28, 2024

charleskawczynski mentioned this issue Oct 10, 2024

Improve CUDA performance for multi-dimensional arrays CliMA/MultiBroadcastFusion.jl#48

Open

vchuravy mentioned this issue Oct 21, 2024

Use MultiplicativeInverse to speedup Linear to Cartesian indexing operations JuliaGPU/KernelAbstractions.jl#539

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of Cartesian indexing #101

Improve performance of Cartesian indexing #101

maleadt commented Feb 21, 2023

N5N3 commented Feb 28, 2023

maleadt commented Mar 2, 2023

N5N3 commented Mar 2, 2023 •

edited

Loading

Improve performance of Cartesian indexing #101

Improve performance of Cartesian indexing #101

Comments

maleadt commented Feb 21, 2023

N5N3 commented Feb 28, 2023

maleadt commented Mar 2, 2023

N5N3 commented Mar 2, 2023 • edited Loading

N5N3 commented Mar 2, 2023 •

edited

Loading