Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] cl_exp_tensor #1006

Draft
wants to merge 26 commits into
base: main
Choose a base branch
from
Draft

Conversation

linehill
Copy link

@linehill linehill commented Nov 23, 2023

This draft introduces an extension for tensor data type (cl_tensor) for storing N-dimensional data in implementation-defined memory layout. This extension is designed for the defined built-in kernel extension for starters. In the future we may extend the tensor data type support in OpenCL C / SPIR-V.

Any feedback is appreciated.

HTML is rendered here. (Last update: 8/14/2024)

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@FreddieWitherden
Copy link

The idea of an 'implementation defined memory layout' is a fallacy. The optimal layout is always contextual.

As an example here is a high-performance OpenCL SGEMM kernel for Intel GPUs which can get 75% of peak (and a good candidate for a 'built in'). However, to perform C = A*B it wants C and B to be row-major and A to be an obscure tiled format with column-major tiles of size 8 by 4 packed in a row-major fashion. This is no problem in practice as you run a precursor kernel to take A from row-major and copy it into a temporary tiled array). For large matrices which are FLOP bound the overhead here is a few percent.

The key thing, however, is the optimal layout for a parameter depends on the parameter itself. Unless you know exactly what you want to do with a tensor/matrix you can not put it into an optimal format ahead of time. Thus, this extra complexity solves nobodies problem (a GEMM will still likely need to rearrange some of the data before operating on it) and comes at an extreme cost (none of my other kernels can operate on these tensors without going through import/export operations). Plus, even if the language were extended to enable kernels to operate on these tensors it would not help much: since they are a black box how can I construct my kernel to ensure things like coalesced access?

@linehill
Copy link
Author

Thanks for the feedback. I think the extension needs to be improved by illustrating further on how drivers may gather context and use it to optimize memory layouts.

A driver should be able to gather context from kernels and the tensor arguments given to them and use it to optimize memory layouts of the tensors. For example, in the command queue code sample at the following point:

cl_mem in0_mem = clCreateBufferWithProperties(
  ctx, {CL_MEM_BIND_TO_TENSOR, in0, 0}, CL_MEM_READ_ONLY,
  0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */, nullptr, &err);

The driver would at least know the in0 tensor’s shape (64, 100, 200) and know it may be passed to the matmul kernel. The driver may discover what memory layout gives the matmul kernel the best performance (especially if it is a built-in kernel) and use the best layout for the in0 tensor. Drivers may also defer the layout selection up until kernel execution which may provide more context for the layout optimization.

Unless you know exactly what you want to do with a tensor/matrix you can not put it into an optimal format ahead of time.

This could be true with command queues. With command buffers, however, we should already be able to tell drivers in precision what we want to do with the tensors and that should give the drivers an opportunity to optimize the layouts ahead of time.

Thus, this extra complexity solves nobodies problem (a GEMM will still likely need to rearrange some of the data before operating on it) and comes at an extreme cost (none of my other kernels can operate on these tensors without going through import/export operations).

The cost of using import/export tensor commands shouldn't be extreme - at least I’m unaware of the reasoning that led to the “extreme cost” conclusion. Could you elaborate?

The worst case I can see is that a kernel compiler is unable to analyze kernel code for concluding optimal layout (respect to the context it can muster). In that case the driver may fallback to linear memory layout and clEnqueueExportToTensor and clEnqueueImportFromTensor would then behave similarly as they were clEnqueueWriteBufferRect and clEnqueueReadBufferRect, respectively.

Plus, even if the language were extended to enable kernels to operate on these tensors it would not help much: since they are a black box how can I construct my kernel to ensure things like coalesced access?

I think I’m missing something here. AFAIK, OpenCL does not give guarantees for coalesced accesses and they are very hardware dependent. Are you hoping that tensor extension would bring coalesced access guarantees for software defined (OpenCL C / SPIR-V) kernels?

I’m not aware of doors being closed for the possible language extension regarding improving coalescing opportunities. Can you provide insight on this matter?

@FreddieWitherden
Copy link

The driver would at least know the in0 tensor’s shape (64, 100, 200) and know it may be passed to the matmul kernel.

Or any other built-in kernel. Or, in the future, any user defined kernel (assuming the OpenCL kernel language is extended to support tensor arguments). It provides no useful information. Moreover, it is highly likely that a single tensor will partake in multiple matrix multiplications throughout its life of varying sizes, with each multiplication wanting a different layout.

Unless you know exactly what you want to do with a tensor/matrix you can not put it into an optimal format ahead of time.

This could be true with command queues. With command buffers, however, we should already be able to tell drivers in precision what we want to do with the tensors and that should give the drivers an opportunity to optimize the layouts ahead of time.

Giving opportunities in theory is very different to being useful in practice. This would add a large amount of additional complexity to the driver that goes well beyond what any of them currently do. Moreover, compare it to the alternative: pick a fixed layout for the tensors (or a couple of layouts) and have a built-in kernels for them. This would not only make it easier for vendors (just a case of wiring up their existing BLAS library kernels) but also for users (no need to wait for their implementation to support command buffers and for their application to be ported over them to get optimal performance). Fixed layouts are exactly what NVIDIA and AMD do in CUDA and HIP, respectively. It works well.

Thus, this extra complexity solves nobodies problem (a GEMM will still likely need to rearrange some of the data before operating on it) and comes at an extreme cost (none of my other kernels can operate on these tensors without going through import/export operations).

The cost of using import/export tensor commands shouldn't be extreme - at least I’m unaware of the reasoning that led to the “extreme cost” conclusion. Could you elaborate?

Superfluous memory copies to/from device memory. In contrast a fixed layout allows the decision about if to make a copy or not to be deferred until the last minute (and made by the kernel itself which has the most context). In contrast, having two operations requires driver magic. Further, if the stars align it may be possible for the matmul to use a so-called direct kernel which makes no temporary copies at all.

Plus, even if the language were extended to enable kernels to operate on these tensors it would not help much: since they are a black box how can I construct my kernel to ensure things like coalesced access?

I think I’m missing something here. AFAIK, OpenCL does not give guarantees for coalesced accesses and they are very hardware dependent. Are you hoping that tensor extension would bring coalesced access guarantees for software defined (OpenCL C / SPIR-V) kernels?

The data for matrix multiplications has to come from somewhere. Most of the time that is an OpenCL kernel. Moreover, if you don't want superfluous copies then kernels need support for tensor arguments. But, if the layout is a black box then how do you write the kernel efficiently? As a simple example consider a matrix and storing it in row-major vs. column-major. If you tell me (the kernel author) what you want, I can ensure that my kernel is mostly doing stride-1 reads and writes. If the layout is abstracted then a lot of my accesses could end up as gather/scatter operations.

This is why fixed layouts make so much more sense and are the norm in BLAS libraries; cublasSgemmBatched and cublasSgemmStridedBatched as a nice example of how these APIs can work. Notice how two APIs are needed (the first is useful for saving memory and improving cache hit rates if you have a batch of B matrices and want to multiply them all by the same A matrix since you only need to allocate one A matrix).

@linehill
Copy link
Author

Thanks for the input. Fair enough - this extension in conjunction with the defined built-in kernel extension was designed to be used with command buffers. To get most benefit out of the "opaque layouts" the compiler/runtime needs to know how the kernel connects to others which may only be accomplished reasonably with command buffers.

For usability without command buffers we could add a feature that lets OpenCL programmers set the tensor layouts and tune their kernels with respect to them, and have a way to "manually" eliminate the need of import/export tensor commands.

We’ll draft new version which should also cover the scenario you described.

linehill and others added 20 commits August 8, 2024 14:14
Co-authored-by: Ben Ashbaugh <[email protected]>
Co-authored-by: Pekka Jääskeläinen <[email protected]>
* cl_khr_tensor -> cl_exp_tensor.

* Remove cl_khr_command_buffer requirement.
* Fix signed -> unsigned.

* Single line cl{Retain,Release}TensorObject declaration.
…Tensor.

* Clarify in clEnqueue{TranslateFrom,TranslateTo}Tensor that data read
  from / written to the tensor in opaque manner.
* new section for tensor data type

* add origin, region and pitch parameters for clEnqueueTranslate*Tensor.

* Update code samples.

* Add take on accessing tensors in OpenCL C.
Co-authored-by: Pekka Jääskeläinen <[email protected]>
* Add error codes for tensor translation commands.

* Tweaked mem_pitch semantics.
Copy link
Contributor

@pjaaskel pjaaskel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Mainly just grammar/typos found so far.

extensions/cl_exp_tensor.asciidoc Outdated Show resolved Hide resolved
extensions/cl_exp_tensor.asciidoc Outdated Show resolved Hide resolved
extensions/cl_exp_tensor.asciidoc Outdated Show resolved Hide resolved
extensions/cl_exp_tensor.asciidoc Outdated Show resolved Hide resolved
extensions/cl_exp_tensor.asciidoc Outdated Show resolved Hide resolved
extensions/cl_exp_tensor.asciidoc Outdated Show resolved Hide resolved
extensions/cl_exp_tensor.asciidoc Outdated Show resolved Hide resolved
. Should we define OpenCL C language features for accessing tensors?
+
--
*RESOLVED*: OpenCL C support for tensors can be introduced later in a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just add those built-ins to this extension right away to avoid expanding the extension fragmentation.

. Should image types be extended instead of adding a separate tensor type?
+
--
*UNRESOLVED*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it logically might be the other way around: images a special case of tensors. Perhaps something to consider in the longer run if this one takes off.

extensions/cl_exp_tensor.asciidoc Outdated Show resolved Hide resolved
Co-authored-by: Pekka Jääskeläinen <[email protected]>
extensions/cl_exp_tensor.asciidoc Outdated Show resolved Hide resolved
Co-authored-by: Pekka Jääskeläinen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants