[RFC] cl_exp_tensor #1006

linehill · 2023-11-23T09:37:56Z

This draft introduces an extension for tensor data type (cl_tensor) for storing N-dimensional data in implementation-defined memory layout. This extension is designed for the defined built-in kernel extension for starters. In the future we may extend the tensor data type support in OpenCL C / SPIR-V.

Any feedback is appreciated.

HTML is rendered here. (Last update: 8/14/2024)

CLAassistant · 2023-11-23T09:38:03Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

FreddieWitherden · 2023-11-23T16:11:18Z

The idea of an 'implementation defined memory layout' is a fallacy. The optimal layout is always contextual.

As an example here is a high-performance OpenCL SGEMM kernel for Intel GPUs which can get 75% of peak (and a good candidate for a 'built in'). However, to perform C = A*B it wants C and B to be row-major and A to be an obscure tiled format with column-major tiles of size 8 by 4 packed in a row-major fashion. This is no problem in practice as you run a precursor kernel to take A from row-major and copy it into a temporary tiled array). For large matrices which are FLOP bound the overhead here is a few percent.

The key thing, however, is the optimal layout for a parameter depends on the parameter itself. Unless you know exactly what you want to do with a tensor/matrix you can not put it into an optimal format ahead of time. Thus, this extra complexity solves nobodies problem (a GEMM will still likely need to rearrange some of the data before operating on it) and comes at an extreme cost (none of my other kernels can operate on these tensors without going through import/export operations). Plus, even if the language were extended to enable kernels to operate on these tensors it would not help much: since they are a black box how can I construct my kernel to ensure things like coalesced access?

linehill · 2023-11-24T16:32:20Z

Thanks for the feedback. I think the extension needs to be improved by illustrating further on how drivers may gather context and use it to optimize memory layouts.

A driver should be able to gather context from kernels and the tensor arguments given to them and use it to optimize memory layouts of the tensors. For example, in the command queue code sample at the following point:

cl_mem in0_mem = clCreateBufferWithProperties(
  ctx, {CL_MEM_BIND_TO_TENSOR, in0, 0}, CL_MEM_READ_ONLY,
  0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */, nullptr, &err);

The driver would at least know the in0 tensor’s shape (64, 100, 200) and know it may be passed to the matmul kernel. The driver may discover what memory layout gives the matmul kernel the best performance (especially if it is a built-in kernel) and use the best layout for the in0 tensor. Drivers may also defer the layout selection up until kernel execution which may provide more context for the layout optimization.

Unless you know exactly what you want to do with a tensor/matrix you can not put it into an optimal format ahead of time.

This could be true with command queues. With command buffers, however, we should already be able to tell drivers in precision what we want to do with the tensors and that should give the drivers an opportunity to optimize the layouts ahead of time.

Thus, this extra complexity solves nobodies problem (a GEMM will still likely need to rearrange some of the data before operating on it) and comes at an extreme cost (none of my other kernels can operate on these tensors without going through import/export operations).

The cost of using import/export tensor commands shouldn't be extreme - at least I’m unaware of the reasoning that led to the “extreme cost” conclusion. Could you elaborate?

The worst case I can see is that a kernel compiler is unable to analyze kernel code for concluding optimal layout (respect to the context it can muster). In that case the driver may fallback to linear memory layout and clEnqueueExportToTensor and clEnqueueImportFromTensor would then behave similarly as they were clEnqueueWriteBufferRect and clEnqueueReadBufferRect, respectively.

Plus, even if the language were extended to enable kernels to operate on these tensors it would not help much: since they are a black box how can I construct my kernel to ensure things like coalesced access?

I think I’m missing something here. AFAIK, OpenCL does not give guarantees for coalesced accesses and they are very hardware dependent. Are you hoping that tensor extension would bring coalesced access guarantees for software defined (OpenCL C / SPIR-V) kernels?

I’m not aware of doors being closed for the possible language extension regarding improving coalescing opportunities. Can you provide insight on this matter?

FreddieWitherden · 2023-11-24T17:57:46Z

The driver would at least know the in0 tensor’s shape (64, 100, 200) and know it may be passed to the matmul kernel.

Or any other built-in kernel. Or, in the future, any user defined kernel (assuming the OpenCL kernel language is extended to support tensor arguments). It provides no useful information. Moreover, it is highly likely that a single tensor will partake in multiple matrix multiplications throughout its life of varying sizes, with each multiplication wanting a different layout.

Unless you know exactly what you want to do with a tensor/matrix you can not put it into an optimal format ahead of time.

This could be true with command queues. With command buffers, however, we should already be able to tell drivers in precision what we want to do with the tensors and that should give the drivers an opportunity to optimize the layouts ahead of time.

Giving opportunities in theory is very different to being useful in practice. This would add a large amount of additional complexity to the driver that goes well beyond what any of them currently do. Moreover, compare it to the alternative: pick a fixed layout for the tensors (or a couple of layouts) and have a built-in kernels for them. This would not only make it easier for vendors (just a case of wiring up their existing BLAS library kernels) but also for users (no need to wait for their implementation to support command buffers and for their application to be ported over them to get optimal performance). Fixed layouts are exactly what NVIDIA and AMD do in CUDA and HIP, respectively. It works well.

Thus, this extra complexity solves nobodies problem (a GEMM will still likely need to rearrange some of the data before operating on it) and comes at an extreme cost (none of my other kernels can operate on these tensors without going through import/export operations).

The cost of using import/export tensor commands shouldn't be extreme - at least I’m unaware of the reasoning that led to the “extreme cost” conclusion. Could you elaborate?

Superfluous memory copies to/from device memory. In contrast a fixed layout allows the decision about if to make a copy or not to be deferred until the last minute (and made by the kernel itself which has the most context). In contrast, having two operations requires driver magic. Further, if the stars align it may be possible for the matmul to use a so-called direct kernel which makes no temporary copies at all.

Plus, even if the language were extended to enable kernels to operate on these tensors it would not help much: since they are a black box how can I construct my kernel to ensure things like coalesced access?

I think I’m missing something here. AFAIK, OpenCL does not give guarantees for coalesced accesses and they are very hardware dependent. Are you hoping that tensor extension would bring coalesced access guarantees for software defined (OpenCL C / SPIR-V) kernels?

The data for matrix multiplications has to come from somewhere. Most of the time that is an OpenCL kernel. Moreover, if you don't want superfluous copies then kernels need support for tensor arguments. But, if the layout is a black box then how do you write the kernel efficiently? As a simple example consider a matrix and storing it in row-major vs. column-major. If you tell me (the kernel author) what you want, I can ensure that my kernel is mostly doing stride-1 reads and writes. If the layout is abstracted then a lot of my accesses could end up as gather/scatter operations.

This is why fixed layouts make so much more sense and are the norm in BLAS libraries; cublasSgemmBatched and cublasSgemmStridedBatched as a nice example of how these APIs can work. Notice how two APIs are needed (the first is useful for saving memory and improving cache hit rates if you have a batch of B matrices and want to multiply them all by the same A matrix since you only need to allocate one A matrix).

linehill · 2023-11-29T13:20:30Z

Thanks for the input. Fair enough - this extension in conjunction with the defined built-in kernel extension was designed to be used with command buffers. To get most benefit out of the "opaque layouts" the compiler/runtime needs to know how the kernel connects to others which may only be accomplished reasonably with command buffers.

For usability without command buffers we could add a feature that lets OpenCL programmers set the tensor layouts and tune their kernels with respect to them, and have a way to "manually" eliminate the need of import/export tensor commands.

We’ll draft new version which should also cover the scenario you described.

Co-authored-by: Ben Ashbaugh <[email protected]> Co-authored-by: Pekka Jääskeläinen <[email protected]>

* cl_khr_tensor -> cl_exp_tensor. * Remove cl_khr_command_buffer requirement.

* Fix signed -> unsigned. * Single line cl{Retain,Release}TensorObject declaration.

…Tensor. * Clarify in clEnqueue{TranslateFrom,TranslateTo}Tensor that data read from / written to the tensor in opaque manner.

* new section for tensor data type * add origin, region and pitch parameters for clEnqueueTranslate*Tensor. * Update code samples. * Add take on accessing tensors in OpenCL C.

Co-authored-by: Pekka Jääskeläinen <[email protected]>

* Add error codes for tensor translation commands. * Tweaked mem_pitch semantics.

* Fix typos.

pjaaskel

Thanks! Mainly just grammar/typos found so far.