-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify semantics of urKernelSuggestMaxCooperativeGroupCountExp
#1687
Comments
For intel devices they seem to call SMs "Xe cores" or "sub-slices" https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2024-1/gpu-offload-flow.html |
@JackAKirk, thanks for pointing this out. I agree that what we have now is confusing. For CUDA, the documentation says the following:
The maximum number of groups depends on the maximum number of groups per multiprocessor and the number of multiprocessors. I think it's possible to implement
I understood that both query results are device-dependent. The call to
I don't think these are the intended semantics. I'm not sure if L0 has any query for getting the occupancy information at that granularity. |
Sure we could work out the device that has been last used from the CUcontext that is currently set, but is this really the semantics of the query? This assumes that the kernel that the user wants to execute on is the last set cuContext/cuDevice, but the user could (and it seems reasonable to expect that they generally will) choose to execute the kernel on a different device? It isn't how the corresponding query from cuda runtime would be used. If you really want to we could implement it as you suggest, but you'd have to make it clear that the query will only make sense if they ensure that their program has executed such that the last set cuContext/cuDevice corresponds to the sycl device that they actually want to execute the kernel on. They would then have to make sure that they do some kind of gpu operation immediately preceding this that uses the desired gpu for the device wide kernel sync. This seems extremely awkward and non desirable to me. |
This documentation might be useful to you: |
How does the
I agree that this makes more sense, but I didn't think that's how
I don't have a strong preference here. I agree that using the device from the current context is awkward for the user, but I'm not sure how to fix that. If |
Good point: apparently it also uses the last set device: https://stackoverflow.com/questions/68982996/why-is-cudaoccupancymaxactiveblockspermultiprocessor-independent-of-device Because sycl doesn't expose a means of setting the device in a manner similar to cuda runtime, I think that you are right that irrespective of whether the semantics is "max blocks per sm" or "max blocks per device", we need to pass a device parameter to this interface.
Yeah I think this is the way to go. |
In the CUDA and HIP adapters a
|
OK great, in that case the interface is fine as it is. Closing the issue. |
CC @0x12CC @nrspruit @JackAKirk Result granularity of
|
A question for l0 developers is whether l0 driver api for this function has taken into account the use cases of this functionality (e.g. in the use cases similar to ones using |
@0x12CC @nrspruit I have updated my comment now after recent findings invalidated one part of it but the semantic of the query whether it is per compute unit or per device is the important matter, the other was a trivial one. I would appreciate feedback. Thanks! After thinking about this more, this may be a different use case to be honest. It is correct that the maximum cooperative group count is device-wide which for Cuda means per SM occupancy multiplied by the number of SMs, so it is valid that to retrieve the max actively executing group count per SM would have to be implemented separately from this API. However, this API can aid the implementation by giving us the total number that we just have to divide by the number of SMs for Cuda to get the SM Occupancy. |
CC @0x12CC @nrspruit
In the discussion from here: #1246 (comment)
it was described that
urKernelSuggestMaxCooperativeGroupCountExp
maps tocudaOccupancyMaxActiveBlocksPerMultiprocessor
which takes a kernel and other params, and returns the maximum number of blocks that can be simultaneously executed in a streaming multiprocessor (SM).
However I found this in the l0 documentation:
"Use zeKernelSuggestMaxCooperativeGroupCount to recommend max group count for device for cooperative functions that device supports."
The "device" word implies that the semantics of of
urKernelSuggestMaxCooperativeGroupCountExp
is the maximum number of blocks that can be simultaneously executed in a device. A device consists of multiple streaming multiprocessors. In such a case you need to multiply the max number of blocks that can be simultanously executed in a SM by the number of SMs in a device.The number of SMs can only be retrieved by querying the device the kernel is to be run on. This information (the device to be run on) is not passed to
urKernelSuggestMaxCooperativeGroupCountExp
, nor can it be inferred from any of the other parameters.Therefore, there are two possibilities:
The text was updated successfully, but these errors were encountered: