Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of Cartesian indexing #101

Open
maleadt opened this issue Feb 21, 2023 · 3 comments
Open

Improve performance of Cartesian indexing #101

maleadt opened this issue Feb 21, 2023 · 3 comments
Labels
kernels Things about kernels and how they are compiled. performance Gotta go fast.

Comments

@maleadt
Copy link
Member

maleadt commented Feb 21, 2023

Metal GPUs suffer from the way we encode Cartesian indices, presumably because of the integer division that happens when mapping a linear index to a Cartesian, but there may be other causes. In #100 and JuliaGPU/GPUArrays.jl#454, we worked around some of the more egregious performance issues by putting the indices in the type domain such that are known to LLVM, allowing the back-end compiler to optimize code (again, presumably avoiding the division by a constant integer by mapping them onto a bunch of bit operations).

This isn't ideal because it results in significantly more kernels being compiled. Ideally we figure out a way to better encode Cartesian indices, although it's obviously hard to avoid the integer division at all.

Alternatively, we might want to improve https://github.com/maleadt/StaticCartesian.jl, or something similar, so that we can perform this optimization ourselves instead of relying on the Metal back-end compiler, because relying on such an optimization might be fragile (as observed in JuliaGPU/GPUArrays.jl#454 where we needed additional bounds information for the optimization to trigger).

@maleadt maleadt added the performance Gotta go fast. label Feb 21, 2023
@N5N3
Copy link

N5N3 commented Feb 28, 2023

Does Base.MultiplicativeInverses.SignedMultiplicativeInverse helps here?
If so we can form the vec(::CartesianIndices) on host side and let GPU do magic divrem

@maleadt
Copy link
Member Author

maleadt commented Mar 2, 2023

Does Base.MultiplicativeInverses.SignedMultiplicativeInverse helps here?

I'm not sure, I don't have experience with that tool. Can you elaborate? Your PR looked pretty breaking on non-1.10 CI.

@N5N3
Copy link

N5N3 commented Mar 2, 2023

pretty breaking on non-1.10 CI.
My fault, a careless typo.

My further trial shows that SignedMultiplicativeInverse{Int64} is slow on my 1660.
Perhaps do Int128 Multiplication is not a good idea here.

I tried to force every index (Edit: And the array length) to UInt32, then it has similar runtime performance as static CartesianIndices.
But I guess that not acceptable?

@maleadt maleadt added the kernels Things about kernels and how they are compiled. label May 22, 2023
@maleadt maleadt changed the title Systemetically improve performance of Cartesian indexing Improve performance of Cartesian indexing Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kernels Things about kernels and how they are compiled. performance Gotta go fast.
Projects
None yet
Development

No branches or pull requests

2 participants