-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] cl_exp_tensor #1006
base: main
Are you sure you want to change the base?
[RFC] cl_exp_tensor #1006
Conversation
|
The idea of an 'implementation defined memory layout' is a fallacy. The optimal layout is always contextual. As an example here is a high-performance OpenCL SGEMM kernel for Intel GPUs which can get 75% of peak (and a good candidate for a 'built in'). However, to perform The key thing, however, is the optimal layout for a parameter depends on the parameter itself. Unless you know exactly what you want to do with a tensor/matrix you can not put it into an optimal format ahead of time. Thus, this extra complexity solves nobodies problem (a GEMM will still likely need to rearrange some of the data before operating on it) and comes at an extreme cost (none of my other kernels can operate on these tensors without going through import/export operations). Plus, even if the language were extended to enable kernels to operate on these tensors it would not help much: since they are a black box how can I construct my kernel to ensure things like coalesced access? |
Thanks for the feedback. I think the extension needs to be improved by illustrating further on how drivers may gather context and use it to optimize memory layouts. A driver should be able to gather context from kernels and the tensor arguments given to them and use it to optimize memory layouts of the tensors. For example, in the command queue code sample at the following point:
The driver would at least know the
This could be true with command queues. With command buffers, however, we should already be able to tell drivers in precision what we want to do with the tensors and that should give the drivers an opportunity to optimize the layouts ahead of time.
The cost of using import/export tensor commands shouldn't be extreme - at least I’m unaware of the reasoning that led to the “extreme cost” conclusion. Could you elaborate? The worst case I can see is that a kernel compiler is unable to analyze kernel code for concluding optimal layout (respect to the context it can muster). In that case the driver may fallback to linear memory layout and
I think I’m missing something here. AFAIK, OpenCL does not give guarantees for coalesced accesses and they are very hardware dependent. Are you hoping that tensor extension would bring coalesced access guarantees for software defined (OpenCL C / SPIR-V) kernels? I’m not aware of doors being closed for the possible language extension regarding improving coalescing opportunities. Can you provide insight on this matter? |
Or any other built-in kernel. Or, in the future, any user defined kernel (assuming the OpenCL kernel language is extended to support tensor arguments). It provides no useful information. Moreover, it is highly likely that a single tensor will partake in multiple matrix multiplications throughout its life of varying sizes, with each multiplication wanting a different layout.
Giving opportunities in theory is very different to being useful in practice. This would add a large amount of additional complexity to the driver that goes well beyond what any of them currently do. Moreover, compare it to the alternative: pick a fixed layout for the tensors (or a couple of layouts) and have a built-in kernels for them. This would not only make it easier for vendors (just a case of wiring up their existing BLAS library kernels) but also for users (no need to wait for their implementation to support command buffers and for their application to be ported over them to get optimal performance). Fixed layouts are exactly what NVIDIA and AMD do in CUDA and HIP, respectively. It works well.
Superfluous memory copies to/from device memory. In contrast a fixed layout allows the decision about if to make a copy or not to be deferred until the last minute (and made by the kernel itself which has the most context). In contrast, having two operations requires driver magic. Further, if the stars align it may be possible for the matmul to use a so-called direct kernel which makes no temporary copies at all.
The data for matrix multiplications has to come from somewhere. Most of the time that is an OpenCL kernel. Moreover, if you don't want superfluous copies then kernels need support for tensor arguments. But, if the layout is a black box then how do you write the kernel efficiently? As a simple example consider a matrix and storing it in row-major vs. column-major. If you tell me (the kernel author) what you want, I can ensure that my kernel is mostly doing stride-1 reads and writes. If the layout is abstracted then a lot of my accesses could end up as gather/scatter operations. This is why fixed layouts make so much more sense and are the norm in BLAS libraries; |
Thanks for the input. Fair enough - this extension in conjunction with the defined built-in kernel extension was designed to be used with command buffers. To get most benefit out of the "opaque layouts" the compiler/runtime needs to know how the kernel connects to others which may only be accomplished reasonably with command buffers. For usability without command buffers we could add a feature that lets OpenCL programmers set the tensor layouts and tune their kernels with respect to them, and have a way to "manually" eliminate the need of import/export tensor commands. We’ll draft new version which should also cover the scenario you described. |
Co-authored-by: Ben Ashbaugh <[email protected]> Co-authored-by: Pekka Jääskeläinen <[email protected]>
* cl_khr_tensor -> cl_exp_tensor. * Remove cl_khr_command_buffer requirement.
* Fix signed -> unsigned. * Single line cl{Retain,Release}TensorObject declaration.
…Tensor. * Clarify in clEnqueue{TranslateFrom,TranslateTo}Tensor that data read from / written to the tensor in opaque manner.
* new section for tensor data type * add origin, region and pitch parameters for clEnqueueTranslate*Tensor. * Update code samples. * Add take on accessing tensors in OpenCL C.
Co-authored-by: Pekka Jääskeläinen <[email protected]>
* Add error codes for tensor translation commands. * Tweaked mem_pitch semantics.
* Fix typos.
d59e19f
to
a9c5402
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Mainly just grammar/typos found so far.
. Should we define OpenCL C language features for accessing tensors? | ||
+ | ||
-- | ||
*RESOLVED*: OpenCL C support for tensors can be introduced later in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just add those built-ins to this extension right away to avoid expanding the extension fragmentation.
. Should image types be extended instead of adding a separate tensor type? | ||
+ | ||
-- | ||
*UNRESOLVED* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it logically might be the other way around: images a special case of tensors. Perhaps something to consider in the longer run if this one takes off.
Co-authored-by: Pekka Jääskeläinen <[email protected]>
Co-authored-by: Pekka Jääskeläinen <[email protected]>
This draft introduces an extension for tensor data type (
cl_tensor
) for storing N-dimensional data in implementation-defined memory layout. This extension is designed for the defined built-in kernel extension for starters. In the future we may extend the tensor data type support in OpenCL C / SPIR-V.Any feedback is appreciated.
HTML is rendered here. (Last update: 8/14/2024)