Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-tuning workgroupsize when localmem consumption depends on it #215

Open
tkf opened this issue Feb 21, 2021 · 7 comments
Open

Auto-tuning workgroupsize when localmem consumption depends on it #215

tkf opened this issue Feb 21, 2021 · 7 comments

Comments

@tkf
Copy link
Collaborator

tkf commented Feb 21, 2021

Does KernelAbstractions.jl support auto-setting workgroupsize when the kernel has local memory size that depends on groupsize? For example, CUDA.launch_configuration takes a shmem callback that maps a number of threads to shared memory used. This is used for implementing mapreduce in CUDA.jl. Since shmem argument for CUDA.launch_configuration is not used in Kernel{CUDADevice}, I guess it's not implemented yet? Is it related to #19?

@vchuravy
Copy link
Member

This is #11 KA doesn't support dynamic shared memory.

@tkf
Copy link
Collaborator Author

tkf commented Feb 21, 2021

Does #11 have auto-tuning? I skimmed the code but I couldn't find any. Or it's planned but not implemented?

@vchuravy
Copy link
Member

No #11 was started before we added auto-tuning, and stalled since no-one had a clear need for it.

@tkf
Copy link
Collaborator Author

tkf commented Feb 22, 2021

oh, that sounds like I need to give a shot at it if I want it 😂

I still am not clear how to implement auto-tuning with #11, though. If I write @dynamic_localmem T (workgroupsize) -> expression_with(T, worksize), I also need to have a way to compute T from the arguments to the kernel, which can be arbitrarily complex. Since Cassette operates on untyped IR, isn't it impossible to get T given kernel arguments (types)? Doing this at the macro level is even more hopeless. Also, how about @dynamic_localmem behind an inlinable function call?

If these concerns are legit, maybe we still need the explicit shmem callback-like approach?

@tkf
Copy link
Collaborator Author

tkf commented Feb 22, 2021

I'm in particular interested in the use case combined with pre-launch workgroupsize auto-tuning #216.

@bjarthur
Copy link
Contributor

bjarthur commented Jun 7, 2024

before we added auto-tuning...

is auto tuning documented? if so, i can't find it.

@vchuravy
Copy link
Member

vchuravy commented Jun 7, 2024

When the workgroupsize=nothing the backend is free to pick a size. Most of the GPU backends have a way to ask for the appropriate size of a compiled kernel (nee Auto-Tuning) and the CPU picks 1024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants