Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exposing warp-level semantics #420

Open
leios opened this issue Sep 8, 2023 · 12 comments
Open

exposing warp-level semantics #420

leios opened this issue Sep 8, 2023 · 12 comments

Comments

@leios
Copy link
Contributor

leios commented Sep 8, 2023

I had a request from a user to use warp-level semantics from CUDA: sync_warp, warpsize, and stuff here: https://cuda.juliagpu.org/stable/api/kernel/#Warp-level-functions.

They seem to be available here: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/warp_ops/index.html, but I don't know where they exist in AMDGPU.jl or how to use them in KA.

They might be available, but I couldn't find "warp" or "wavefront" or anything else in either the AMDGPU or KernelAbstractions docs. I mean, there was this page: https://amdgpu.juliagpu.org/stable/wavefront_ops/ ... but it's a bit sparse ^^

If this is already available in KA, I'm happy to add a bit to the docs explaining how they are used. If it is not available, I guess I need to put some PRs forward for CUDA(kernels), ROC(kernels), and here with the new syntax.

Related discussion: JuliaMolSim/Molly.jl#147

Putting it here because I think I found kinda what I was looking for for AMDGPU here: https://github.com/JuliaGPU/AMDGPU.jl/blob/master/test/device/wavefront.jl

  • wavefrontsize = warpsize
  • wfred = wavefront reduce
  • wfscan = wafecron scan
  • wfany = ???
  • wfail = ???
  • wfsame = ???
  • ??? = warp_sync
@vchuravy
Copy link
Member

vchuravy commented Sep 8, 2023

There currently is no support in KA for wavefront/warp level programming.

Two immediate questions:

  1. What would the semantics be on the CPU level?
  2. Does Intel and Metal also have primitives.

If the goal is to expose warp level reduce operations, maybe we can get away with defining a workgroup @reduce and leave it up to the backends to implement that reduction efficiently?

x-ref: #419

@leios
Copy link
Contributor Author

leios commented Sep 9, 2023

I'm struggling to find much at all on warp-level semantics for metal or even OneAPI.

It seems like OpenCL just ignores it(?): https://stackoverflow.com/questions/42259118/is-there-any-guarantee-that-all-of-threads-in-wavefront-opencl-always-synchron

To be honest, I haven't seen an application that really needs the warp_sync and warpsize is often just inferred from the host and passed in.

Here's a question I don't have an answer to. Do other (non-NVIDIA) cords even need sync_warps? I think this comes from the fact that Volta architecture deals with warp divergence differently: https://forums.developer.nvidia.com/t/why-syncwarp-is-necessary-in-undivergent-warp-reduction/209893

Do other architectures (Intel, AMD, Metal), even allow this? I guess they might in the future if they don't already.

That means for the short term that we would need CUDA-specific tooling where sync_warps() does what it's supposed to do for CUDA, but then does nothing for other architectures

@vchuravy
Copy link
Member

vchuravy commented Sep 9, 2023

Coming back to my question: What's the reason you want to access this functionality?

Generally speaking I don't think warpsize is something we should expose in KA, but there are of course workgroup operations we are missing. Reduction is the core one.

#421 is introducing the notion of a subgroup, but I want to understand the reasoning behind that better.

Exposing functionality for one backend only has the risk that the user writes a let el that is actually not portable.

@ToucheSir
Copy link

Reading through pages such as https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-0/sub-groups-and-simd-vectorization.html and https://intel.github.io/llvm-docs/cuda/opencl-subgroup-vs-cuda-crosslane-op.html, I was under the impression that subgroups do map pretty closely to warps/wavefronts? If that's the case, then having a cross-platform abstraction for working with them seems useful.

@vchuravy
Copy link
Member

vchuravy commented Sep 9, 2023

KernelAbstractions is not One API, so the meaning of subgroup needs to be defined clearly and independently.

If often comes down to can we expose these semantics without to much of a performance loss on other hardware? Users are always free to use CUDA.jl directly, but writing a kernel should have a reasonable expectation of performance across all backends.

KernelAbstractions is a common denominator not a superset of behavior.

@leios
Copy link
Contributor Author

leios commented Sep 9, 2023

I would guess the outlier here is Metal (and parallel CPU) then? I think AMD (wavefronts), CUDA (warps), and Intel (subgroups) all have some concept of warp-level operations; however, I agree with @vchuravy here. None of the warp-level semantics seem standardized enough to put them into KA at this time.

What is the plan with #421, though? I mean, if it's already introducing a subgroup, I guess we can use that for the other backends?

On my end, I was trying to do a simple port of: JuliaMolSim/Molly.jl#133 so we could completely remove the CUDA backend.

@vchuravy
Copy link
Member

vchuravy commented Sep 9, 2023

For the CPU I had long hoped to use SIMD.jl or a compiler pass to perform vectorization.

Would a subgroupsize of 1 be legal?

@ToucheSir
Copy link

KernelAbstractions is not One API, so the meaning of subgroup needs to be defined clearly and independently.

Yes, which is why I found the second link interesting. Digging around a bit more turned up some pages from the SYCL spec (1, 2, 3) which appears to be trying to standardize this. I have no idea how integration on the AMD and Nvidia side works (if at all), but perhaps it could serve as inspiration for creating a common denominator interface in KA.

@leios
Copy link
Contributor Author

leios commented Sep 10, 2023

It also looks like vulkan is trying to standardize the terminology as well: https://www.khronos.org/blog/vulkan-subgroup-tutorial. Their API is supposed to be similar to OpenCL for compute, but I cannot find such topics in OpenCL.

For me, I can obviously see a use for warp reduce, scan, etc. I also find myself wanting to get warp_size a bunch because (in general) AMD has a warpsize of 64 while NVIDIA has a warpsize of 32. For the CPU, a warpsize of 1 sounds correct, right?

It's just that sync_warp would come with caveats:

  1. It probably doesn't work on a mac
  2. It probably doesn't do anything on parallel CPU
  3. It is probably useless pre-volta

But I don't think it will provide any wrong results on any of these platforms if there are dummy calls. I mean the sync_warp functionality was (is) the default for most GPUs. reduce, scan, etc are all functions that have existed on that level for a while, so... I guess it is fine?

@ToucheSir
Copy link

Their API is supposed to be similar to OpenCL for compute, but I cannot find such topics in OpenCL.

It's kind of hidden away and took a while for me to find, but the OpenCL spec does touch on sub-groups (they like the hyphen) in a few places. https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#_mapping_work_items_onto_an_ndrange introduces them and subsequent sections like https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#execution-model-sync look relevant. There's also some more info about the actual kernel-level API in the OpenCL C spec, e.g. https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html#subgroup-functions.

@simonbyrne
Copy link
Collaborator

I think having some "subgroup sync" op would be helpful (it could fall back on a full sync if not)

@eschnett
Copy link
Contributor

I tend to think of "warps on a CPU" as the SIMD vector size. The semantics are quite similar:

  • SIMD instructions execute all or nothing, with individual SIMD lines possibly disabled (e.g. on AVX512)
  • all SIMD lanes always execute in sync
  • there are certain special instructions for SIMD-wide reductions (e.g. "horizontal add")
    Thus, when using SIMD instructions on a CPU, it can be useful to know the SIMD vector size used in a kernel.

It might also make sense to let the caller specify (statically) the SIMD vector size for a kernel and pass this as optimization hint to the compiler. Alternatively, the compiler could choose a SIMD vector size statically (depending on the target CPU capabilities).

I understand that the match to "warp size" isn't perfect since the SIMD vector size depends on the type (64bit vs 32bit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants