Auto-tuning workgroupsize when localmem consumption depends on it #215

tkf · 2021-02-21T22:36:15Z

Does KernelAbstractions.jl support auto-setting workgroupsize when the kernel has local memory size that depends on groupsize? For example, CUDA.launch_configuration takes a shmem callback that maps a number of threads to shared memory used. This is used for implementing mapreduce in CUDA.jl. Since shmem argument for CUDA.launch_configuration is not used in Kernel{CUDADevice}, I guess it's not implemented yet? Is it related to #19?

The text was updated successfully, but these errors were encountered:

vchuravy · 2021-02-21T22:44:35Z

This is #11 KA doesn't support dynamic shared memory.

tkf · 2021-02-21T23:15:46Z

Does #11 have auto-tuning? I skimmed the code but I couldn't find any. Or it's planned but not implemented?

vchuravy · 2021-02-21T23:40:15Z

No #11 was started before we added auto-tuning, and stalled since no-one had a clear need for it.

tkf · 2021-02-22T00:00:43Z

oh, that sounds like I need to give a shot at it if I want it 😂

I still am not clear how to implement auto-tuning with #11, though. If I write @dynamic_localmem T (workgroupsize) -> expression_with(T, worksize), I also need to have a way to compute T from the arguments to the kernel, which can be arbitrarily complex. Since Cassette operates on untyped IR, isn't it impossible to get T given kernel arguments (types)? Doing this at the macro level is even more hopeless. Also, how about @dynamic_localmem behind an inlinable function call?

If these concerns are legit, maybe we still need the explicit shmem callback-like approach?

tkf · 2021-02-22T00:28:30Z

I'm in particular interested in the use case combined with pre-launch workgroupsize auto-tuning #216.

bjarthur · 2024-06-07T13:49:27Z

before we added auto-tuning...

is auto tuning documented? if so, i can't find it.

vchuravy · 2024-06-07T13:55:18Z

When the workgroupsize=nothing the backend is free to pick a size. Most of the GPU backends have a way to ask for the appropriate size of a compiled kernel (nee Auto-Tuning) and the CPU picks 1024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-tuning workgroupsize when localmem consumption depends on it #215

Auto-tuning workgroupsize when localmem consumption depends on it #215

tkf commented Feb 21, 2021

vchuravy commented Feb 21, 2021

tkf commented Feb 21, 2021

vchuravy commented Feb 21, 2021

tkf commented Feb 22, 2021

tkf commented Feb 22, 2021

bjarthur commented Jun 7, 2024

vchuravy commented Jun 7, 2024

Auto-tuning workgroupsize when localmem consumption depends on it #215

Auto-tuning workgroupsize when localmem consumption depends on it #215

Comments

tkf commented Feb 21, 2021

vchuravy commented Feb 21, 2021

tkf commented Feb 21, 2021

vchuravy commented Feb 21, 2021

tkf commented Feb 22, 2021

tkf commented Feb 22, 2021

bjarthur commented Jun 7, 2024

vchuravy commented Jun 7, 2024