From 7d5e350f1a196c193dac7b383ffb000b50caa95d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Wed, 25 Oct 2023 08:34:30 +0300 Subject: [PATCH 01/26] Sketch something for cl_khr_tensor --- ext/cl_khr_tensor.asciidoc | 547 ++++++++++++++++ ext/cl_khr_tensor.html | 1228 ++++++++++++++++++++++++++++++++++++ 2 files changed, 1775 insertions(+) create mode 100644 ext/cl_khr_tensor.asciidoc create mode 100644 ext/cl_khr_tensor.html diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc new file mode 100644 index 00000000..cd17a42b --- /dev/null +++ b/ext/cl_khr_tensor.asciidoc @@ -0,0 +1,547 @@ +// Copyright 2023 The Khronos Group. This work is licensed under a +// Creative Commons Attribution 4.0 International License; see +// http://creativecommons.org/licenses/by/4.0/ += cl_khr_tensor + +:source-highlighter: coreray + +[[cl_khr_tensor]] +== Tensor Data Type + +Purpose of this extension is to provide ... + +=== General information + +==== Name Strings + +`cl_khr_tensor` + +==== Version history + +[cols="1,1,3",options="header",] +|==== +| *Date* | *Version* | *Description* +| 2023-10-XX | 0.1.0 | First assigned version. +|==== + +==== Dependencies + +This extension is written against the OpenCL Specification version 3.0.14. + +This extension requires OpenCL 1.2 or later. + +This extension requires cl_khr_command_buffer. + +==== Contributors + +Henry Linjamäki, Intel. + + +=== Overview + + +=== Modifications to OpenCL + +==== New OpenCL Functions + +To create a tensor use: + +[source,c] +---- +cl_tensor clCreateTensor( + cl_context context, + const cl_tensor_peoperties *properties, + size_t rank, + size_t shape, + cl_tensor_type dtype, + cl_int *errcode_ret); +---- + +* _context_ is a valid OpenCL context used to create the tensor object. + +* _properties_ is an optional list of properties for the tensor object + and their corresponding values. The list is terminated with the + special property 0. If no properties are required, properties may be + NULL. + +* _rank_ is the number of dimensions. Zero value creates a "scalar" + tensor which has no dimensions but has storage for one element. + +* _shape_ is a list of sizes of the dimensions. The length of the list + must be _rank_ elements. _shape_ can be NULL if _rank_ value is + zero. All the first _rank_ values in the list must be non-zero. + +* _dtype_ is the element type of _tensor_. Refer to the + <> table for the types. + +* _errcode_ret_ may return an appropriate error code. If errcode_ret + is NULL, no error code is returned. + +clCreateTensor function creates a `rank`-dimensional tensor with +`shape[0] * shape[1] * ... * shape[rank-1]` elements of _dtype_ +type. At the creation time of the tensor, it does not have +storage. The storage is assigned to the tensor either by: + +* calling clCreateBufferWithProperties() with CL_MEM_BIND_TO_TENSOR or + +* automatically by command buffers - possibly on-demand basis - if the + tensor is created with CL_TENSOR_COMMAND_BUFFER_TEMPORARY property + set on. + +A command that refers to a tensor must be bound to a valid buffer +object before enqueuing the command into a command queue unless the +command is recorded in a command buffer and +CL_TENSOR_COMMAND_BUFFER_TEMPORARY is set to true. + +*clCreateTensor* returns a valid non-zero tensor object and errcode_ret +is set to CL_SUCCESS if the tensor object is created +successfully. Otherwise, they return a NULL value with one of the +following error values returned in errcode_ret: + +* CL_INVALID_CONTEXT if context is not a valid context. + +* CL_INVALID_PROPERTY if a property name in properties is not a + supported property name, if the value specified for a supported + property name is not valid, or if the same property name is + specified more than once. + +* CL_INVALID_VALUE if a value specified in dtype is invalid. + +* CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources + required by the OpenCL implementation on the host. + +.Tensor element types +[cols="1,2",stripes=odd] +[#TensorDtypes] +|=== +| *Tensor element data type* | *Description* + +| CL_TENSOR_BOOL | 1-bit signedless integer. +| CL_TENSOR_INT8 | 8-bit signed integer. +| CL_TENSOR_INT16 | 16-bit signed integer. +| CL_TENSOR_INT32 | 32-bit signed integer. +| CL_TENSOR_INT64 | 64-bit signed integer. +| CL_TENSOR_UINT8 | 8-bit signed integer. +| CL_TENSOR_UINT16 | 16-bit signed integer. +| CL_TENSOR_UINT32 | 32-bit signed integer. +| CL_TENSOR_UINT64 | 64-bit signed integer. +| CL_TENSOR_HALF | Half precision floating-point value. +| CL_TENSOR_BFLOAT16 | 16-bit brain floating-point value. +| CL_TENSOR_FLOAT | Single precision floating-point value. +| CL_TENSOR_DOUBLE | Double precision floating-point value. +| CL_TENSOR_COMPLEX64 | 64-bit complex floating point value with + 32-bit real and imaginary part. +| CL_TENSOR_COMPLEX128 | 128-bit complex floating point value with + 64-bit real and imaginary part. +|=== + +.Tensor properties +[cols="2,1,2",stripes=odd] +|=== +| *Tensor Property* | *Property Value* | *Description* + +| CL_TENSOR_COMMAND_BUFFER_TEMPORARY | cl_bool + +a| If the value is true, create a "temporary" tensor that only can be +used on commands recorded in command buffers. The storage of the +temporary tensors are managed by command buffers. When a temporary +tensor is used by multiple command buffer, the tensor receive separate +storage for each command buffer. + +// IOW, Data may not be exchanged between command buffers through +// temporary tensors. + +Temporary tensors may not be bound to buffer objects. + +Data stored in temporary tensors are not preserved across command +buffer executions. +|=== + +To retain a tensor object, call the function + +[source,c] +---- +cl_int clRetainTensorObject( + cl_tensor tensor); +---- + +* _tensor_ is the tensor object to be retained. + +The _tensor_ reference count is incremented. + +*clRetainTensor* returns CL_SUCCESS if the function is executed +successfully. Otherwise, it returns one of the following errors: + +* CL_INVALID_TENSOR if tensor is not a valid tensor object. + +To release a tensor object, call the function + +[source,c] +---- +cl_int clReleaseTensorObject( + cl_tensor tensor); +---- + +* _tensor_ is the tensor object to be released. + +The _tensor_ reference count is decremented. + +The tensor object is deleted once the number of instances that are +retained to tensor become zero and the tensor object is no longer +needed by any enqueued or recorded commands that use _tensor_. Using +this function to release a reference that was not obtained by creating +the object or by calling *clRetainTensor* causes undefined behavior. + +*clReleaseTensor* returns CL_SUCCESS if the function is executed +successfully. Otherwise, it returns one of the following errors: + +* CL_INVALID_TENSOR if tensor is not a valid tensor object. + +// TODO: add clSetTensorObjectDestructorCallback? + +To return information about a tensor object, call the function + +[source,c] +---- +cl_int clGetTensorInfo( + cl_tensor tensor, + cl_tensor_info param_name, + size_t param_value_size, + void* param_value, + size_t* param_value_size_ret); +---- + +* _tensor_ specifies the tensor object being queried. + +* _param_name_ specifies the information to query. The list of + supported param_name types and the information returned in + _param_value_ by clGetTensorInfo is described in the <> table. + +* _param_value_ is a pointer to memory where the appropriate result + being queried is returned. If _param_value_ is NULL, it is ignored. + +* _param_value_size_ is used to specify the size in bytes of memory + pointed to by _param_value_. This size must be ≥ size of return type + as described in the <> table. + +* _param_value_size_ret_ returns the actual size in bytes of data + being queried by _param_name_. If _param_value_size_ret_ is NULL, it is + ignored. + +*clGetTensorInfo* returns CL_SUCCESS if the function is executed + succesfully. Otherwise, it returns one of the following errors: + +* CL_INVALID_TENSOR if _tensor_ is not a valid tensor object. + +[#Tensor Object Quaries] +.List of supported param_names by clGetTensorInfo +[cols="2,1,2",stripes=odd] +|=== +| CL_TENSOR_RANK | size_t | Return the tensor rank. +| CL_TENSOR_SHAPE | size_t[] | Return the tensor shape. +| CL_TENSOR_DTYPE | cl_tensor_type | Return the tensor data type. + +| CL_TENSOR_COMMAND_BUFFER_TEMPORARY | cl_bool | Return true if the +tensor is temporary tensor for command buffers. + +| CL_TENSOR_BOUND_TO_BUFFER | cl_bool | Return true if the tensor is +bound to a buffer. If CL_TENSOR_COMMAND_BUFFER_TEMPORARY is true, then +CL_TENSOR_BOUND_TO_BUFFER must return false. + +| CL_TENSOR_BUFFER | cl_mem a| If CL_TENSOR_BOUND_TO_BUFFER is true, +return the buffer object the tensor is bound to. Otherwise, +clGetTensorInfo call returns: + +* CL_INVALID_MEM_OBJECT if the tensor is not bound to a buffer object. + +* CL_INVALID_PROPERTY otherwise. + +| CL_TENSOR_CONTEXT | cl_context | Return the context specified when + the tensor object is created. + +| CL_TENSOR_REFERENCE_COUNT | cl_uint | Return the tensor reference +count. +|=== + +To read from a tensor to host memory / buffer object or to write to a +tensor object from host memory / buffer object call one of the functions. + +[source,c] +---- +cl_int clEnqueueReadTensor( + cl_command_queue command_queue, + cl_tensor tensor, + cl_bool blocking_command, + cl_mem buffer, + void* host_ptr, + cl_uint num_events_in_wait_list, + const cl_event* event_wait_list, + cl_event* event); +---- + +[source,c] +---- +cl_int clEnqueueWriteTensor( + cl_command_queue command_queue, + cl_tensor tensor, + cl_bool blocking_command, + cl_mem buffer, + void* host_ptr, + cl_uint num_events_in_wait_list, + const cl_event* event_wait_list, + cl_event* event); +---- + +* _command_queue_ is a valid host command-queue in which the read / + write command will be queued. _command_queue_ and _tensor_ must be + created with the same OpenCL context. + +* _tensor_ refers to a valid tensor object which is bound to a buffer. + +* _blocking_command_ indicate if the read and write operations are + blocking or non-blocking (see below). + +* _buffer_ refers to a valid buffer object where data is to be + read into or to be written from when the value of _host_ptr_ is + NULL. If _host_ptr_ is non-NULL then value of _buffer_ is ignored. + +* _host_ptr_ is the pointer to buffer in host memory where data is to + be read into or to be written from when the value is non-NULL. + +* _event_wait_list_ and _num_events_in_wait_list_ specify events that + need to complete before this particular command can be executed. If + _event_wait_list_ is NULL, then this particular command does not + wait on any event to complete. If _event_wait_list_ is NULL, + _num_events_in_wait_list_ must be 0. If _event_wait_list_ is not + NULL, the list of events pointed to by _event_wait_list_ must be + valid and _num_events_in_wait_list_ must be greater than 0. The + events specified in _event_wait_list_ act as synchronization + points. The context associated with events in _event_wait_list_ and + _command_queue_ must be the same. The memory associated with + _event_wait_list_ can be reused or freed after the function returns. + +* _event_ returns an event object that identifies this read / write + command and can be used to query or queue a wait for this command to + complete. If _event_ is NULL or the enqueue is unsuccessful, no + event will be created and therefore it will not be possible to query + the status of this command or to wait for this command to + complete. If _event_wait_list_ and _event_ are not NULL, _event_ + must not refer to an element of the _event_wait_list_ array. + +For a read and write operation, the elements of N-dimensional tensor are +related to host memory / buffer object as followed: + +---- +tensor.element(i0, i1, ..., i, i)) == (tensor.dtype)buffer_or_host_ptr[ + i0 * tensor.shape[1] * tensor.shape[2] * ... * tensor.shape[N-1] + + i1 * tensor.shape[2] * tensor.shape[3] * ... * tensor.shape[N-1] + + ... + + i * tensor.shape[i(N-1)] + + i] +---- + +Where `iX` is a tensor coordinate index with inclusive range of `0..`. + +// TODO: add clEnqueueCopyTensor + +// TODO: add clEnqueueFillTensor? + +// TODO: add command buffer variants for clEnqueue{copy,read,write}Tensor. + + +==== Add New Buffer Property in Section 5.2.1 + +[cols="2,1,2",stripes=odd] +|=== +| CL_MEM_BIND_TO_TENSOR | cl_tensor a| Use the created buffer as +storage for the given valid tensor. To succeed creating the buffer, +the target tensor may not have storage already, must not have +CL_TENSOR_COMMAND_BUFFER_TEMPORARY property set on and _size_ argument +of the clCreateBufferWithProperties() must be zero. + +Size of the memory buffer is implementation-defined and it can be +queried with clGetTensorInfo(). + +Memory layout of the tensor in the created memory buffer is +implementation-defined and opaque to the applications and it may +change at unspecified points. Implementation may store auxiliary data +in the memory buffer for the tensor. Therefore, writing data into the +memory buffer directly using the cl_mem handle leads to undefined +behavior. + +If the tensor is already bound to a buffer object, +clCreateBufferWithProperties call returns CL_TENSOR_BOUND_TO_BUFFER +error code. +|=== + +=== Sample Codes + +Helper functions used in the follow up tensor code samples: + +[source,c] +---- +cl_kernel create_matmul_kernel( + cl_context ctx, std::span device_span, + cl_tensor lhs, cl_tensor rhs, cl_tensor out) { + // A hypothetical matmul kernel signature in pseudo OpenCL C for + // illustrative purposes: + // + // kernel void matmul( + // global read_only tensor_t, + // global read_only tensor_t, + // global write_only tensor_t); + + cl_kernel matmul_kernel = /* Omitted. */; + clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor), &lhs); + clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor), &rhs); + clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor), &out); + return matmul_kernel; +} + +cl_kernel create_matmul_kernel( + cl_context ctx, std::span device_span, + cl_tensor lhs, cl_tensor rhs, cl_tensor out) { + // A hypothetical add kernel signature in pseudo OpenCL C for illustrative + // purposes: + // + // kernel void add( + // global read_only tensor_t, + // global read_only tensor_t, + // global write_only tensor_t); + + cl_tensor add_kernel = /* Omitted. */; + clSetKernelArg(add_kernel, 0, sizeof(cl_tensor), &lhs); + clSetKernelArg(add_kernel, 1, sizeof(cl_tensor), &rhs); + clSetKernelArg(add_kernel, 2, sizeof(cl_tensor), &out); + return add_kernel; +} +---- +An example usage of tensors on a command queue: + +[source,c] +---- +constexpr size_t b = 64, m = 100, n = 200, k = 50; + +cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err); +cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err); +cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err); +cl_tensor t0 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err); +cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err); + +cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0); +cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out); + +// Allocate storage for the tensors. The buffer size must be set to zero +// when the buffer is bound to a tensor. OpenCL implementation may +// determine optimal data layout and the storage needed for it, based +// on the tensor's uses (matmul kernel in this sample) so far. +cl_int err; +cl_mem in0_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_BIND_TO_TENSOR, in0, 0}, CL_MEM_READ_ONLY, + 0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */, nullptr, &err); +cl_mem in1_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_BIND_TO_TENSOR, in1, 0}, CL_MEM_READ_ONLY, + 0, nullptr, &err); +cl_mem in2_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_BIND_TO_TENSOR, in2, 0}, CL_MEM_READ_ONLY, + 0, nullptr, &err); +cl_mem t0_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE, + 0, nullptr, &err); +cl_mem out_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_BIND_TO_TENSOR, out, 0}, CL_MEM_WRITE_ONLY, + 0, nullptr, &err); + +std::vector in0_data = ...; +std::vector in1_data = ...; +std::vector out_data(b * m * n); + +// Copies data into in0 tensor while possibly rearranging the data to the +// optimal data layout. +clEnqueueWriteTensor( + cmd_q, in0, false, nullptr, nullptr, {b, m, k}, nullptr, in0_data.data(), + 0, nullptr, nullptr); + +clEnqueueWriteTensor( + cmd_q, in1, false, nullptr, nullptr, {b, k, n}, nullptr, in1_data.data(), + 0, nullptr, nullptr); +clEnqueueNDRangeKernel( + cmd_q, matmul_kernel, 0, nullptr, nullptr, nullptr, 0, nullptr, nullptr); +clEnqueueNDRangeKernel( + cmd_q, add_kernel, 0, nullptr, nullptr, nullptr, 0, nullptr, nullptr); +clEnqueueReadTensor( + cmd_q, out, false, nullptr, nullptr, {b, m, n}, nullptr, out_data.data(), + 0, nullptr, nullptr); +---- + +An example use of tensors in a command buffer when cl_khr_command_buffer +extension is supported: + +[source,c] +---- +constexpr size_t b = 64, m = 100, n = 200, k = 50; + +cl_int err; +// Create tensors which are used as temporaries in a command buffer. +// Command buffers allocate space for them as needed. +// +// NOTE: same temporary tensor handle used in multiple command buffers +// will have separate storage. IOW, command buffers may not exchange +// data via temporary buffers between them. +cl_tensor in0 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0}, + 3, {b, m, k}, CL_TENSOR_FLOAT, err); +cl_tensor in1 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0}, + 3, {b, k, n}, CL_TENSOR_FLOAT, err); +cl_tensor in2 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0}, + 3, {b, m, n}, CL_TENSOR_FLOAT, err); +cl_tensor t0 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0}, + 3, {b, m, n}, CL_TENSOR_FLOAT, err); +cl_tensor out = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0}, + 3, {b, m, n}, CL_TENSOR_FLOAT, err); + +cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0); +cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out); + +// Binding a buffer to temporary tensor is not allowed. +auto ignored = clCreateBufferWithProperties( + ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE, 0, nullptr, &err); +assert(err == CL_TENSOR_IS_TEMPORARY) + +std::vector in0_data = ...; +std::vector in1_data = ...; +std::vector out_data(b * m * n); + +cl_command_buffer_khr cb = + clCreateCommandBufferKHR(num_queues, queue_list, nullptr, &err); + +cl_sync_point_khr in0_syncp, in1_syncp, matmul_syncp, add_syncp; +clCommandWriteTensorKHR( + cmd_b, cmd_q, in0, false, nullptr, nullptr, {b, m, k}, nullptr, + in0_data.data(), 0, nullptr, &in0_syncp); +clCommandWriteTensorKHR( + cmd_b, cmd_q, in1, false, nullptr, nullptr, {b, k, m}, nullptr, + in1_data.data(), 0, nullptr, &in1_syncp); +clCommandNDRangeKernelKHR( + cmd_b, cmd_q, nullptr, matmul_kernel, 0, nullptr, nullptr, nullptr, + 2, {in0_syncp, in2_syncp}, &matmul_syncp, nullptr); +clCommandNDRangeKernelKHR( + cmd_b, cmd_q, nullptr, add_kernel, 0, nullptr, nullptr, nullptr, + 1, {matmul_syncp}, &add_syncp, nullptr); +clCommandReadTensorKHR( + cmd_b, cmd_q, out, false, nullptr, nullptr, {b, k, m}, nullptr, + out_data.data(), 1, {add_syncp}, nullptr); + +// Finalize the command buffer. At this point the OpenCL +// implementation may reserve enough storage for all the tensor +// temporaries. Temporary tensors might be eliminated - for example, +// OpenCL implementation could use 'out' tensor to store result of +// matmul_kernel , thus, eliminating the need of 't0' tensor. +clFinalizeCommandBufferKHR(cmd_b); + +// Temporary tensors used in a command buffer can't be read or written +// into. A hypothetical reason is that the finalized command buffer +// might not use some of the tensor. +assert(clEnqueueReadTensor(..., t0, ...) == CL_INVALID_OPERATION); +---- + +=== Open Questions === diff --git a/ext/cl_khr_tensor.html b/ext/cl_khr_tensor.html new file mode 100644 index 00000000..87892548 --- /dev/null +++ b/ext/cl_khr_tensor.html @@ -0,0 +1,1228 @@ + + + + + + + +cl_khr_tensor + + + + + +
+
+

Tensor Data Type

+
+
+

Purpose of this extension is to provide …​

+
+
+

General information

+
+

Name Strings

+
+

cl_khr_tensor

+
+
+
+

Version history

+ +++++ + + + + + + + + + + + + + + +
DateVersionDescription

2023-10-XX

0.1.0

First assigned version.

+
+
+

Dependencies

+
+

This extension is written against the OpenCL Specification version 3.0.14.

+
+
+

This extension requires OpenCL 1.2 or later.

+
+
+

This extension requires cl_khr_command_buffer.

+
+
+
+

Contributors

+
+

Henry Linjamäki, Intel.

+
+
+
+
+

Overview

+ +
+
+

Modifications to OpenCL

+
+

New OpenCL Functions

+
+

To create a tensor use:

+
+
+
+
cl_tensor clCreateTensor(
+    cl_context context,
+    const cl_tensor_peoperties *properties,
+    size_t rank,
+    size_t shape,
+    cl_tensor_type dtype,
+    cl_int *errcode_ret);
+
+
+
+
    +
  • +

    context is a valid OpenCL context used to create the tensor object.

    +
  • +
  • +

    properties is an optional list of properties for the tensor object +and their corresponding values. The list is terminated with the +special property 0. If no properties are required, properties may be +NULL.

    +
  • +
  • +

    rank is the number of dimensions. Zero value creates a "scalar" +tensor which has no dimensions but has storage for one element.

    +
  • +
  • +

    shape is a list of sizes of the dimensions. The length of the list +must be rank elements. shape can be NULL if rank value is +zero. All the first rank values in the list must be non-zero.

    +
  • +
  • +

    dtype is the element type of tensor. Refer to the +Tensor element types table for the types.

    +
  • +
  • +

    errcode_ret may return an appropriate error code. If errcode_ret +is NULL, no error code is returned.

    +
  • +
+
+
+

clCreateTensor function creates a rank-dimensional tensor with +shape[0] * shape[1] * …​ * shape[rank-1] elements of dtype +type. At the creation time of the tensor, it does not have +storage. The storage is assigned to the tensor either by:

+
+
+
    +
  • +

    calling clCreateBufferWithProperties() with CL_MEM_BIND_TO_TENSOR or

    +
  • +
  • +

    automatically by command buffers - possibly on-demand basis - if the +tensor is created with CL_TENSOR_COMMAND_BUFFER_TEMPORARY property +set on.

    +
  • +
+
+
+

A command that refers to a tensor must be bound to a valid buffer +object before enqueuing the command into a command queue unless the +command is recorded in a command buffer and +CL_TENSOR_COMMAND_BUFFER_TEMPORARY is set to true.

+
+
+

clCreateTensor returns a valid non-zero tensor object and errcode_ret +is set to CL_SUCCESS if the tensor object is created +successfully. Otherwise, they return a NULL value with one of the +following error values returned in errcode_ret:

+
+
+
    +
  • +

    CL_INVALID_CONTEXT if context is not a valid context.

    +
  • +
  • +

    CL_INVALID_PROPERTY if a property name in properties is not a +supported property name, if the value specified for a supported +property name is not valid, or if the same property name is +specified more than once.

    +
  • +
  • +

    CL_INVALID_VALUE if a value specified in dtype is invalid.

    +
  • +
  • +

    CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources +required by the OpenCL implementation on the host.

    +
  • +
+
+ + ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Table 1. Tensor element types
Tensor element data typeDescription

CL_TENSOR_BOOL

1-bit signedless integer.

CL_TENSOR_INT8

8-bit signed integer.

CL_TENSOR_INT16

16-bit signed integer.

CL_TENSOR_INT32

32-bit signed integer.

CL_TENSOR_INT64

64-bit signed integer.

CL_TENSOR_UINT8

8-bit signed integer.

CL_TENSOR_UINT16

16-bit signed integer.

CL_TENSOR_UINT32

32-bit signed integer.

CL_TENSOR_UINT64

64-bit signed integer.

CL_TENSOR_HALF

Half precision floating-point value.

CL_TENSOR_BFLOAT16

16-bit brain floating-point value.

CL_TENSOR_FLOAT

Single precision floating-point value.

CL_TENSOR_DOUBLE

Double precision floating-point value.

CL_TENSOR_COMPLEX64

64-bit complex floating point value with + 32-bit real and imaginary part.

CL_TENSOR_COMPLEX128

128-bit complex floating point value with + 64-bit real and imaginary part.

+ + +++++ + + + + + + + + + + + + + + +
Table 2. Tensor properties
Tensor PropertyProperty ValueDescription

CL_TENSOR_COMMAND_BUFFER_TEMPORARY

cl_bool

+

If the value is true, create a "temporary" tensor that only can be +used on commands recorded in command buffers. The storage of the +temporary tensors are managed by command buffers. When a temporary +tensor is used by multiple command buffer, the tensor receive separate +storage for each command buffer.

+
+
+

Temporary tensors may not be bound to buffer objects.

+
+
+

Data stored in temporary tensors are not preserved across command +buffer executions.

+
+
+

To retain a tensor object, call the function

+
+
+
+
cl_int clRetainTensorObject(
+  cl_tensor tensor);
+
+
+
+
    +
  • +

    tensor is the tensor object to be retained.

    +
  • +
+
+
+

The tensor reference count is incremented.

+
+
+

clRetainTensor returns CL_SUCCESS if the function is executed +successfully. Otherwise, it returns one of the following errors:

+
+
+
    +
  • +

    CL_INVALID_TENSOR if tensor is not a valid tensor object.

    +
  • +
+
+
+

To release a tensor object, call the function

+
+
+
+
cl_int clReleaseTensorObject(
+  cl_tensor tensor);
+
+
+
+
    +
  • +

    tensor is the tensor object to be released.

    +
  • +
+
+
+

The tensor reference count is decremented.

+
+
+

The tensor object is deleted once the number of instances that are +retained to tensor become zero and the tensor object is no longer +needed by any enqueued or recorded commands that use tensor. Using +this function to release a reference that was not obtained by creating +the object or by calling clRetainTensor causes undefined behavior.

+
+
+

clReleaseTensor returns CL_SUCCESS if the function is executed +successfully. Otherwise, it returns one of the following errors:

+
+
+
    +
  • +

    CL_INVALID_TENSOR if tensor is not a valid tensor object.

    +
  • +
+
+
+

To return information about a tensor object, call the function

+
+
+
+
cl_int clGetTensorInfo(
+  cl_tensor tensor,
+  cl_tensor_info param_name,
+  size_t param_value_size,
+  void* param_value,
+  size_t* param_value_size_ret);
+
+
+
+
    +
  • +

    tensor specifies the tensor object being queried.

    +
  • +
  • +

    param_name specifies the information to query. The list of +supported param_name types and the information returned in +param_value by clGetTensorInfo is described in the [Tensor Object +Queries] table.

    +
  • +
  • +

    param_value is a pointer to memory where the appropriate result +being queried is returned. If param_value is NULL, it is ignored.

    +
  • +
  • +

    param_value_size is used to specify the size in bytes of memory +pointed to by param_value. This size must be ≥ size of return type +as described in the [Tensor Object Queries] table.

    +
  • +
  • +

    param_value_size_ret returns the actual size in bytes of data +being queried by param_name. If param_value_size_ret is NULL, it is +ignored.

    +
  • +
+
+
+

clGetTensorInfo returns CL_SUCCESS if the function is executed + succesfully. Otherwise, it returns one of the following errors:

+
+
+
    +
  • +

    CL_INVALID_TENSOR if tensor is not a valid tensor object.

    +
  • +
+
+ + +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Table 3. List of supported param_names by clGetTensorInfo

CL_TENSOR_RANK

size_t

Return the tensor rank.

CL_TENSOR_SHAPE

size_t[]

Return the tensor shape.

CL_TENSOR_DTYPE

cl_tensor_type

Return the tensor data type.

CL_TENSOR_COMMAND_BUFFER_TEMPORARY

cl_bool

Return true if the +tensor is temporary tensor for command buffers.

CL_TENSOR_BOUND_TO_BUFFER

cl_bool

Return true if the tensor is +bound to a buffer. If CL_TENSOR_COMMAND_BUFFER_TEMPORARY is true, then +CL_TENSOR_BOUND_TO_BUFFER must return false.

CL_TENSOR_BUFFER

cl_mem

+

If CL_TENSOR_BOUND_TO_BUFFER is true, +return the buffer object the tensor is bound to. Otherwise, +clGetTensorInfo call returns:

+
+
+
    +
  • +

    CL_INVALID_MEM_OBJECT if the tensor is not bound to a buffer object.

    +
  • +
  • +

    CL_INVALID_PROPERTY otherwise.

    +
  • +
+

CL_TENSOR_CONTEXT

cl_context

Return the context specified when + the tensor object is created.

CL_TENSOR_REFERENCE_COUNT

cl_uint

Return the tensor reference +count.

+
+

To read from a tensor to host memory / buffer object or to write to a +tensor object from host memory / buffer object call one of the functions.

+
+
+
+
cl_int clEnqueueReadTensor(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  cl_mem buffer,
+  void* host_ptr,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);
+
+
+
+
+
cl_int clEnqueueWriteTensor(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  cl_mem buffer,
+  void* host_ptr,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);
+
+
+
+
    +
  • +

    command_queue is a valid host command-queue in which the read / +write command will be queued. command_queue and tensor must be +created with the same OpenCL context.

    +
  • +
  • +

    tensor refers to a valid tensor object which is bound to a buffer.

    +
  • +
  • +

    blocking_command indicate if the read and write operations are +blocking or non-blocking (see below).

    +
  • +
  • +

    buffer refers to a valid buffer object where data is to be +read into or to be written from when the value of host_ptr is +NULL. If host_ptr is non-NULL then value of buffer is ignored.

    +
  • +
  • +

    host_ptr is the pointer to buffer in host memory where data is to +be read into or to be written from when the value is non-NULL.

    +
  • +
  • +

    event_wait_list and num_events_in_wait_list specify events that +need to complete before this particular command can be executed. If +event_wait_list is NULL, then this particular command does not +wait on any event to complete. If event_wait_list is NULL, +num_events_in_wait_list must be 0. If event_wait_list is not +NULL, the list of events pointed to by event_wait_list must be +valid and num_events_in_wait_list must be greater than 0. The +events specified in event_wait_list act as synchronization +points. The context associated with events in event_wait_list and +command_queue must be the same. The memory associated with +event_wait_list can be reused or freed after the function returns.

    +
  • +
  • +

    event returns an event object that identifies this read / write +command and can be used to query or queue a wait for this command to +complete. If event is NULL or the enqueue is unsuccessful, no +event will be created and therefore it will not be possible to query +the status of this command or to wait for this command to +complete. If event_wait_list and event are not NULL, event +must not refer to an element of the event_wait_list array.

    +
  • +
+
+
+

For a read and write operation, the elements of N-dimensional tensor are +related to host memory / buffer object as followed:

+
+
+
+
tensor.element(i0, i1, ..., i<N-2>, i<N-1>)) == (tensor.dtype)buffer_or_host_ptr[
+  i0 * tensor.shape[1] * tensor.shape[2] * ... * tensor.shape[N-1] +
+  i1 * tensor.shape[2] * tensor.shape[3] * ... * tensor.shape[N-1] +
+  ... +
+  i<N-2> * tensor.shape[i(N-1)] +
+  i<N-1>]
+
+
+
+

Where iX is a tensor coordinate index with inclusive range of 0..<shape[X]>.

+
+
+
+

Add New Buffer Property in Section 5.2.1

+ +++++ + + + + + + + +

CL_MEM_BIND_TO_TENSOR

cl_tensor

+

Use the created buffer as +storage for the given valid tensor. To succeed creating the buffer, +the target tensor may not have storage already, must not have +CL_TENSOR_COMMAND_BUFFER_TEMPORARY property set on and size argument +of the clCreateBufferWithProperties() must be zero.

+
+
+

Size of the memory buffer is implementation-defined and it can be +queried with clGetTensorInfo().

+
+
+

Memory layout of the tensor in the created memory buffer is +implementation-defined and opaque to the applications and it may +change at unspecified points. Implementation may store auxiliary data +in the memory buffer for the tensor. Therefore, writing data into the +memory buffer directly using the cl_mem handle leads to undefined +behavior.

+
+
+

If the tensor is already bound to a buffer object, +clCreateBufferWithProperties call returns CL_TENSOR_BOUND_TO_BUFFER +error code.

+
+
+
+
+

Sample Codes

+
+

Helper functions used in the follow up tensor code samples:

+
+
+
+
cl_kernel create_matmul_kernel(
+  cl_context ctx, std::span<cl_device_id> device_span,
+  cl_tensor lhs, cl_tensor rhs, cl_tensor out) {
+  // A hypothetical matmul kernel signature in pseudo OpenCL C for
+  // illustrative purposes:
+  //
+  //   kernel void matmul(
+  //     global read_only tensor_t,
+  //     global read_only tensor_t,
+  //     global write_only tensor_t);
+
+  cl_kernel matmul_kernel = /* Omitted. */;
+  clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor), &lhs);
+  clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor), &rhs);
+  clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor), &out);
+  return matmul_kernel;
+}
+
+cl_kernel create_matmul_kernel(
+  cl_context ctx, std::span<cl_device_id> device_span,
+  cl_tensor lhs, cl_tensor rhs, cl_tensor out) {
+  // A hypothetical add kernel signature in pseudo OpenCL C for illustrative
+  // purposes:
+  //
+  // kernel void add(
+  //     global read_only tensor_t,
+  //     global read_only tensor_t,
+  //     global write_only tensor_t);
+
+  cl_tensor add_kernel = /* Omitted. */;
+  clSetKernelArg(add_kernel, 0, sizeof(cl_tensor), &lhs);
+  clSetKernelArg(add_kernel, 1, sizeof(cl_tensor), &rhs);
+  clSetKernelArg(add_kernel, 2, sizeof(cl_tensor), &out);
+  return add_kernel;
+}
+
+
+
+

An example usage of tensors on a command queue:

+
+
+
+
constexpr size_t b = 64, m = 100, n = 200, k = 50;
+
+cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
+cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
+cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor t0  = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+
+cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
+cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
+
+// Allocate storage for the tensors. The buffer size must be set to zero
+// when the buffer is bound to a tensor. OpenCL implementation may
+// determine optimal data layout and the storage needed for it, based
+// on the tensor's uses (matmul kernel in this sample) so far.
+cl_int err;
+cl_mem in0_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, in0, 0}, CL_MEM_READ_ONLY,
+  0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */, nullptr, &err);
+cl_mem in1_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, in1, 0}, CL_MEM_READ_ONLY,
+  0, nullptr, &err);
+cl_mem in2_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, in2, 0}, CL_MEM_READ_ONLY,
+  0, nullptr, &err);
+cl_mem t0_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE,
+  0, nullptr, &err);
+cl_mem out_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, out, 0}, CL_MEM_WRITE_ONLY,
+  0, nullptr, &err);
+
+std::vector<float> in0_data = ...;
+std::vector<float> in1_data = ...;
+std::vector<float> out_data(b * m * n);
+
+// Copies data into in0 tensor while possibly rearranging the data to the
+// optimal data layout.
+clEnqueueWriteTensor(
+  cmd_q, in0, false, nullptr, nullptr, {b, m, k}, nullptr, in0_data.data(),
+  0, nullptr, nullptr);
+
+clEnqueueWriteTensor(
+  cmd_q, in1, false, nullptr, nullptr, {b, k, n}, nullptr, in1_data.data(),
+  0, nullptr, nullptr);
+clEnqueueNDRangeKernel(
+  cmd_q, matmul_kernel, 0, nullptr, nullptr, nullptr, 0, nullptr, nullptr);
+clEnqueueNDRangeKernel(
+  cmd_q, add_kernel, 0, nullptr, nullptr, nullptr, 0, nullptr, nullptr);
+clEnqueueReadTensor(
+  cmd_q, out, false, nullptr, nullptr, {b, m, n}, nullptr, out_data.data(),
+  0, nullptr, nullptr);
+
+
+
+

An example use of tensors in a command buffer when cl_khr_command_buffer +extension is supported:

+
+
+
+
constexpr size_t b = 64, m = 100, n = 200, k = 50;
+
+cl_int err;
+// Create tensors which are used as temporaries in a command buffer.
+// Command buffers allocate space for them as needed.
+//
+// NOTE: same temporary tensor handle used in multiple command buffers
+//       will have separate storage. IOW, command buffers may not exchange
+//       data via temporary buffers between them.
+cl_tensor in0 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
+  3, {b, m, k}, CL_TENSOR_FLOAT, err);
+cl_tensor in1 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
+  3, {b, k, n}, CL_TENSOR_FLOAT, err);
+cl_tensor in2 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
+  3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor t0  = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
+  3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor out = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
+  3, {b, m, n}, CL_TENSOR_FLOAT, err);
+
+cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
+cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
+
+// Binding a buffer to temporary tensor is not allowed.
+auto ignored = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE, 0, nullptr, &err);
+assert(err == CL_TENSOR_IS_TEMPORARY)
+
+std::vector<float> in0_data = ...;
+std::vector<float> in1_data = ...;
+std::vector<float> out_data(b * m * n);
+
+cl_command_buffer_khr cb =
+  clCreateCommandBufferKHR(num_queues, queue_list, nullptr, &err);
+
+cl_sync_point_khr in0_syncp, in1_syncp, matmul_syncp, add_syncp;
+clCommandWriteTensorKHR(
+  cmd_b, cmd_q, in0, false, nullptr, nullptr, {b, m, k}, nullptr,
+  in0_data.data(), 0, nullptr, &in0_syncp);
+clCommandWriteTensorKHR(
+  cmd_b, cmd_q, in1, false, nullptr, nullptr, {b, k, m}, nullptr,
+  in1_data.data(), 0, nullptr, &in1_syncp);
+clCommandNDRangeKernelKHR(
+  cmd_b, cmd_q, nullptr, matmul_kernel, 0, nullptr, nullptr, nullptr,
+  2, {in0_syncp, in2_syncp}, &matmul_syncp, nullptr);
+clCommandNDRangeKernelKHR(
+  cmd_b, cmd_q, nullptr, add_kernel, 0, nullptr, nullptr, nullptr,
+  1, {matmul_syncp}, &add_syncp, nullptr);
+clCommandReadTensorKHR(
+  cmd_b, cmd_q, out,  false, nullptr, nullptr, {b, k, m}, nullptr,
+  out_data.data(), 1, {add_syncp}, nullptr);
+
+// Finalize the command buffer. At this point the OpenCL
+// implementation may reserve enough storage for all the tensor
+// temporaries. Temporary tensors might be eliminated - for example,
+// OpenCL implementation could use 'out' tensor to store result of
+// matmul_kernel , thus, eliminating the need of 't0' tensor.
+clFinalizeCommandBufferKHR(cmd_b);
+
+// Temporary tensors used in a command buffer can't be read or written
+// into. A hypothetical reason is that the finalized command buffer
+// might not use some of the tensor.
+assert(clEnqueueReadTensor(..., t0, ...) == CL_INVALID_OPERATION);
+
+
+
+
+

Open Questions

+ +
+
+
+
+ + + \ No newline at end of file From 1f0be1eb7b6ac4a0f6131569708940e1b6b87544 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 2 Nov 2023 14:16:43 +0200 Subject: [PATCH 02/26] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Ben Ashbaugh Co-authored-by: Pekka Jääskeläinen --- ext/cl_khr_tensor.asciidoc | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc index cd17a42b..1df37e9e 100644 --- a/ext/cl_khr_tensor.asciidoc +++ b/ext/cl_khr_tensor.asciidoc @@ -51,7 +51,7 @@ cl_tensor clCreateTensor( cl_context context, const cl_tensor_peoperties *properties, size_t rank, - size_t shape, + const size_t* shape, cl_tensor_type dtype, cl_int *errcode_ret); ---- @@ -88,7 +88,7 @@ storage. The storage is assigned to the tensor either by: set on. A command that refers to a tensor must be bound to a valid buffer -object before enqueuing the command into a command queue unless the +object before enqueuing the command that refers to the tensor into a command queue unless the command is recorded in a command buffer and CL_TENSOR_COMMAND_BUFFER_TEMPORARY is set to true. @@ -124,7 +124,7 @@ following error values returned in errcode_ret: | CL_TENSOR_UINT16 | 16-bit signed integer. | CL_TENSOR_UINT32 | 32-bit signed integer. | CL_TENSOR_UINT64 | 64-bit signed integer. -| CL_TENSOR_HALF | Half precision floating-point value. +| CL_TENSOR_HALF | Half precision floating-point. | CL_TENSOR_BFLOAT16 | 16-bit brain floating-point value. | CL_TENSOR_FLOAT | Single precision floating-point value. | CL_TENSOR_DOUBLE | Double precision floating-point value. @@ -144,7 +144,7 @@ following error values returned in errcode_ret: a| If the value is true, create a "temporary" tensor that only can be used on commands recorded in command buffers. The storage of the temporary tensors are managed by command buffers. When a temporary -tensor is used by multiple command buffer, the tensor receive separate +tensor is used by multiple command buffers, the tensor receives separate storage for each command buffer. // IOW, Data may not be exchanged between command buffers through @@ -171,7 +171,7 @@ The _tensor_ reference count is incremented. *clRetainTensor* returns CL_SUCCESS if the function is executed successfully. Otherwise, it returns one of the following errors: -* CL_INVALID_TENSOR if tensor is not a valid tensor object. +* CL_INVALID_TENSOR if the tensor is not a valid tensor object. To release a tensor object, call the function @@ -242,7 +242,7 @@ cl_int clGetTensorInfo( | CL_TENSOR_DTYPE | cl_tensor_type | Return the tensor data type. | CL_TENSOR_COMMAND_BUFFER_TEMPORARY | cl_bool | Return true if the -tensor is temporary tensor for command buffers. +tensor is a temporary tensor for command buffers. | CL_TENSOR_BOUND_TO_BUFFER | cl_bool | Return true if the tensor is bound to a buffer. If CL_TENSOR_COMMAND_BUFFER_TEMPORARY is true, then @@ -263,8 +263,8 @@ clGetTensorInfo call returns: count. |=== -To read from a tensor to host memory / buffer object or to write to a -tensor object from host memory / buffer object call one of the functions. +The following functions are for reading from a tensor to host memory / buffer object or to write to a +tensor object from host memory / buffer object. [source,c] ---- @@ -286,7 +286,7 @@ cl_int clEnqueueWriteTensor( cl_tensor tensor, cl_bool blocking_command, cl_mem buffer, - void* host_ptr, + const void* host_ptr, cl_uint num_events_in_wait_list, const cl_event* event_wait_list, cl_event* event); @@ -329,10 +329,10 @@ cl_int clEnqueueWriteTensor( must not refer to an element of the _event_wait_list_ array. For a read and write operation, the elements of N-dimensional tensor are -related to host memory / buffer object as followed: +related to host memory / buffer object as follows: ---- -tensor.element(i0, i1, ..., i, i)) == (tensor.dtype)buffer_or_host_ptr[ +tensor.element(i0, i1, ..., i, i) == (tensor.dtype)buffer_or_host_ptr[ i0 * tensor.shape[1] * tensor.shape[2] * ... * tensor.shape[N-1] + i1 * tensor.shape[2] * tensor.shape[3] * ... * tensor.shape[N-1] + ... + @@ -505,7 +505,7 @@ cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out); // Binding a buffer to temporary tensor is not allowed. auto ignored = clCreateBufferWithProperties( ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE, 0, nullptr, &err); -assert(err == CL_TENSOR_IS_TEMPORARY) +assert(err == CL_TENSOR_IS_TEMPORARY); std::vector in0_data = ...; std::vector in1_data = ...; From a801aaf4fcec40d31203799de9a3390bf427f957 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 2 Nov 2023 09:15:20 +0200 Subject: [PATCH 03/26] * Add brief introduction. * cl_khr_tensor -> cl_exp_tensor. * Remove cl_khr_command_buffer requirement. --- ext/cl_khr_tensor.asciidoc | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc index 1df37e9e..05c7ad52 100644 --- a/ext/cl_khr_tensor.asciidoc +++ b/ext/cl_khr_tensor.asciidoc @@ -1,20 +1,25 @@ // Copyright 2023 The Khronos Group. This work is licensed under a // Creative Commons Attribution 4.0 International License; see // http://creativecommons.org/licenses/by/4.0/ -= cl_khr_tensor += cl_exp_tensor :source-highlighter: coreray -[[cl_khr_tensor]] +[[cl_exp_tensor]] == Tensor Data Type -Purpose of this extension is to provide ... +This extension provides a new opaque OpenCL datatype called +`cl_tensor`. It is used for storing N-dimensional tensor data in +implementation-defined memory layout which may be optimized based on +tensor's use cases. The datatype is designed to be efficiently used +within the `cl_khr_command_buffers` extension to capture task graphs +which can utilize tensors as input, output and temporary storage. === General information ==== Name Strings -`cl_khr_tensor` +`cl_exp_tensor` ==== Version history @@ -30,8 +35,6 @@ This extension is written against the OpenCL Specification version 3.0.14. This extension requires OpenCL 1.2 or later. -This extension requires cl_khr_command_buffer. - ==== Contributors Henry Linjamäki, Intel. + From b890c30db0c532169d5133bd5599e1af793214ca Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 2 Nov 2023 09:18:38 +0200 Subject: [PATCH 04/26] Add contributors --- ext/cl_khr_tensor.asciidoc | 2 ++ 1 file changed, 2 insertions(+) diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc index 05c7ad52..5cba054c 100644 --- a/ext/cl_khr_tensor.asciidoc +++ b/ext/cl_khr_tensor.asciidoc @@ -38,6 +38,8 @@ This extension requires OpenCL 1.2 or later. ==== Contributors Henry Linjamäki, Intel. + +Pekka Jääslkeläinen, Intel and Tampere University. + +Ben Ashbaugh, Intel. + === Overview From fafb30b0dc9ec4381956009a901cd0a57644c9ce Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 2 Nov 2023 09:19:16 +0200 Subject: [PATCH 05/26] * Fix name for add kernel creator --- ext/cl_khr_tensor.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc index 5cba054c..0115b054 100644 --- a/ext/cl_khr_tensor.asciidoc +++ b/ext/cl_khr_tensor.asciidoc @@ -403,7 +403,7 @@ cl_kernel create_matmul_kernel( return matmul_kernel; } -cl_kernel create_matmul_kernel( +cl_kernel create_add_kernel( cl_context ctx, std::span device_span, cl_tensor lhs, cl_tensor rhs, cl_tensor out) { // A hypothetical add kernel signature in pseudo OpenCL C for illustrative From 740f3f22d8d043f5acf409cffc8d207932a81558 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 2 Nov 2023 09:21:51 +0200 Subject: [PATCH 06/26] * cl_tensor_type -> cl_tensor _datatype. * Fix signed -> unsigned. * Single line cl{Retain,Release}TensorObject declaration. --- ext/cl_khr_tensor.asciidoc | 28 +++++++++++++--------------- 1 file changed, 13 insertions(+), 15 deletions(-) diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc index 0115b054..bed45d97 100644 --- a/ext/cl_khr_tensor.asciidoc +++ b/ext/cl_khr_tensor.asciidoc @@ -56,8 +56,8 @@ cl_tensor clCreateTensor( cl_context context, const cl_tensor_peoperties *properties, size_t rank, - const size_t* shape, - cl_tensor_type dtype, + const size_t shape, + cl_tensor_datatype dtype, cl_int *errcode_ret); ---- @@ -125,17 +125,17 @@ following error values returned in errcode_ret: | CL_TENSOR_INT16 | 16-bit signed integer. | CL_TENSOR_INT32 | 32-bit signed integer. | CL_TENSOR_INT64 | 64-bit signed integer. -| CL_TENSOR_UINT8 | 8-bit signed integer. -| CL_TENSOR_UINT16 | 16-bit signed integer. -| CL_TENSOR_UINT32 | 32-bit signed integer. -| CL_TENSOR_UINT64 | 64-bit signed integer. +| CL_TENSOR_UINT8 | 8-bit unsigned integer. +| CL_TENSOR_UINT16 | 16-bit unsigned integer. +| CL_TENSOR_UINT32 | 32-bit unsigned integer. +| CL_TENSOR_UINT64 | 64-bit unsigned integer. | CL_TENSOR_HALF | Half precision floating-point. -| CL_TENSOR_BFLOAT16 | 16-bit brain floating-point value. -| CL_TENSOR_FLOAT | Single precision floating-point value. -| CL_TENSOR_DOUBLE | Double precision floating-point value. -| CL_TENSOR_COMPLEX64 | 64-bit complex floating point value with +| CL_TENSOR_BFLOAT16 | 16-bit brain floating-point. +| CL_TENSOR_FLOAT | Single precision floating-point. +| CL_TENSOR_DOUBLE | Double precision floating-point. +| CL_TENSOR_COMPLEX64 | 64-bit complex floating point with 32-bit real and imaginary part. -| CL_TENSOR_COMPLEX128 | 128-bit complex floating point value with +| CL_TENSOR_COMPLEX128 | 128-bit complex floating point with 64-bit real and imaginary part. |=== @@ -165,8 +165,7 @@ To retain a tensor object, call the function [source,c] ---- -cl_int clRetainTensorObject( - cl_tensor tensor); +cl_int clRetainTensorObject(cl_tensor tensor); ---- * _tensor_ is the tensor object to be retained. @@ -182,8 +181,7 @@ To release a tensor object, call the function [source,c] ---- -cl_int clReleaseTensorObject( - cl_tensor tensor); +cl_int clReleaseTensorObject(cl_tensor tensor); ---- * _tensor_ is the tensor object to be released. From 701daa3dc3f65a8d89f4608d75a7adf6f489f26f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 2 Nov 2023 09:59:18 +0200 Subject: [PATCH 07/26] * clEnqueue(Read,Write)Tensor -> clEnqueue(TranslateFrom,TranslateTo)Tensor. * Clarify in clEnqueue{TranslateFrom,TranslateTo}Tensor that data read from / written to the tensor in opaque manner. --- ext/cl_khr_tensor.asciidoc | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc index bed45d97..99e65370 100644 --- a/ext/cl_khr_tensor.asciidoc +++ b/ext/cl_khr_tensor.asciidoc @@ -271,7 +271,7 @@ tensor object from host memory / buffer object. [source,c] ---- -cl_int clEnqueueReadTensor( +cl_int clEnqueueTranslateFromTensor( cl_command_queue command_queue, cl_tensor tensor, cl_bool blocking_command, @@ -284,7 +284,7 @@ cl_int clEnqueueReadTensor( [source,c] ---- -cl_int clEnqueueWriteTensor( +cl_int clEnqueueTranslateToTensor( cl_command_queue command_queue, cl_tensor tensor, cl_bool blocking_command, @@ -331,8 +331,14 @@ cl_int clEnqueueWriteTensor( complete. If _event_wait_list_ and _event_ are not NULL, _event_ must not refer to an element of the _event_wait_list_ array. -For a read and write operation, the elements of N-dimensional tensor are -related to host memory / buffer object as follows: +The *clEnqueueTranslateToTensor* function copies contents of the buffer +object / host allocation to tensor's storage in +implementation-defined, opaque memory layout. The +*clEnqueueTranslateFromTensor* function copies data from tensor's +storage to buffer object / host allocation. + +The elements of buffer object / host allocation are mapped to tensor +coordinates as follows: ---- tensor.element(i0, i1, ..., i, i) == (tensor.dtype)buffer_or_host_ptr[ @@ -343,7 +349,11 @@ tensor.element(i0, i1, ..., i, i) == (tensor.dtype)buffer_or_host_ptr[ i] ---- -Where `iX` is a tensor coordinate index with inclusive range of `0..`. +Where `iX` is a tensor coordinate index with inclusive range of +`0..`. The `tensor.element()` represents an abstract +function that accesses a tensor element in its storage at given +coordinate. The method how the coordinates translate to tensor storage +addresses is unspecified. // TODO: add clEnqueueCopyTensor From aa9ead742ab84740e11c4ff371d51e0ebf94a538 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 2 Nov 2023 13:38:29 +0200 Subject: [PATCH 08/26] Refactor command buffer temporary property out of tensor --- ext/cl_khr_tensor.asciidoc | 139 +++++++++++++++++++++---------------- 1 file changed, 78 insertions(+), 61 deletions(-) diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc index 99e65370..0de088c7 100644 --- a/ext/cl_khr_tensor.asciidoc +++ b/ext/cl_khr_tensor.asciidoc @@ -66,7 +66,8 @@ cl_tensor clCreateTensor( * _properties_ is an optional list of properties for the tensor object and their corresponding values. The list is terminated with the special property 0. If no properties are required, properties may be - NULL. + NULL. This extension does not define any optional properties for + tensors. * _rank_ is the number of dimensions. Zero value creates a "scalar" tensor which has no dimensions but has storage for one element. @@ -84,18 +85,11 @@ cl_tensor clCreateTensor( clCreateTensor function creates a `rank`-dimensional tensor with `shape[0] * shape[1] * ... * shape[rank-1]` elements of _dtype_ type. At the creation time of the tensor, it does not have -storage. The storage is assigned to the tensor either by: - -* calling clCreateBufferWithProperties() with CL_MEM_BIND_TO_TENSOR or - -* automatically by command buffers - possibly on-demand basis - if the - tensor is created with CL_TENSOR_COMMAND_BUFFER_TEMPORARY property - set on. +storage. The storage is assigned to the tensor by calling +clCreateBufferWithProperties() with CL_MEM_BIND_TO_TENSOR. A command that refers to a tensor must be bound to a valid buffer -object before enqueuing the command that refers to the tensor into a command queue unless the -command is recorded in a command buffer and -CL_TENSOR_COMMAND_BUFFER_TEMPORARY is set to true. +object before enqueuing or recording the command. *clCreateTensor* returns a valid non-zero tensor object and errcode_ret is set to CL_SUCCESS if the tensor object is created @@ -139,28 +133,6 @@ following error values returned in errcode_ret: 64-bit real and imaginary part. |=== -.Tensor properties -[cols="2,1,2",stripes=odd] -|=== -| *Tensor Property* | *Property Value* | *Description* - -| CL_TENSOR_COMMAND_BUFFER_TEMPORARY | cl_bool - -a| If the value is true, create a "temporary" tensor that only can be -used on commands recorded in command buffers. The storage of the -temporary tensors are managed by command buffers. When a temporary -tensor is used by multiple command buffers, the tensor receives separate -storage for each command buffer. - -// IOW, Data may not be exchanged between command buffers through -// temporary tensors. - -Temporary tensors may not be bound to buffer objects. - -Data stored in temporary tensors are not preserved across command -buffer executions. -|=== - To retain a tensor object, call the function [source,c] @@ -244,12 +216,8 @@ cl_int clGetTensorInfo( | CL_TENSOR_SHAPE | size_t[] | Return the tensor shape. | CL_TENSOR_DTYPE | cl_tensor_type | Return the tensor data type. -| CL_TENSOR_COMMAND_BUFFER_TEMPORARY | cl_bool | Return true if the -tensor is a temporary tensor for command buffers. - | CL_TENSOR_BOUND_TO_BUFFER | cl_bool | Return true if the tensor is -bound to a buffer. If CL_TENSOR_COMMAND_BUFFER_TEMPORARY is true, then -CL_TENSOR_BOUND_TO_BUFFER must return false. +bound to a buffer. | CL_TENSOR_BUFFER | cl_mem a| If CL_TENSOR_BOUND_TO_BUFFER is true, return the buffer object the tensor is bound to. Otherwise, @@ -366,11 +334,34 @@ addresses is unspecified. [cols="2,1,2",stripes=odd] |=== +| CL_MEM_COMMAND_BUFFER_TEMPORARY | cl_bool + +a| This property can be set if *cl_khr_command_buffer* extension is +supported. + +If the value is true, create a "temporary" buffer object that only can +be used on commands recorded in command buffers. Non-recording +command enqueue functions must return CL_INVALID_OPERATION if the +command refers to a temporary buffer object. + +The temporary buffer objects are managed by command buffers. When a +temporary buffer object is used by multiple command buffer, the object +receives disjoint storage for each command buffer. + +// Consequently, Data may not be exchanged between command buffers through +// temporary buffers. + +Storage of the temporary buffer objects may be allocated on-demand +basis. At the times the buffer is not needed, OpenCL implementations +may reuse storage for other tasks within the command buffer. + +Contents of the temporary buffers are not guaranteed to be preserved +across command buffer executions. + | CL_MEM_BIND_TO_TENSOR | cl_tensor a| Use the created buffer as storage for the given valid tensor. To succeed creating the buffer, -the target tensor may not have storage already, must not have -CL_TENSOR_COMMAND_BUFFER_TEMPORARY property set on and _size_ argument -of the clCreateBufferWithProperties() must be zero. +the target tensor may not have storage already and _size_ +argument of the clCreateBufferWithProperties() must be zero. Size of the memory buffer is implementation-defined and it can be queried with clGetTensorInfo(). @@ -387,6 +378,26 @@ clCreateBufferWithProperties call returns CL_TENSOR_BOUND_TO_BUFFER error code. |=== +==== Add New Memory Object Query in Section 5.5.5 + +[cols="2,1,2",stripes=odd] +|=== +| CL_MEM_COMMAND_BUFFER_TEMPORARY | cl_bool | This property can be +queried if *cl_khr_command_buffer* extension is supported. + +Return true if the _memobj_ is temporary buffer object for command +buffers. +|=== + +==== Add New Error Codes in Appendix F + +[cols="2,3", stripes=odd] +|=== +| CL_TENSOR_BOUND_TO_BUFFER | Returned when attempting to bind a + buffer object to a tensor which already has been bound to the same + or another. +|=== + === Sample Codes Helper functions used in the follow up tensor code samples: @@ -495,30 +506,36 @@ extension is supported: constexpr size_t b = 64, m = 100, n = 200, k = 50; cl_int err; -// Create tensors which are used as temporaries in a command buffer. -// Command buffers allocate space for them as needed. -// -// NOTE: same temporary tensor handle used in multiple command buffers -// will have separate storage. IOW, command buffers may not exchange -// data via temporary buffers between them. -cl_tensor in0 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0}, - 3, {b, m, k}, CL_TENSOR_FLOAT, err); -cl_tensor in1 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0}, - 3, {b, k, n}, CL_TENSOR_FLOAT, err); -cl_tensor in2 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0}, - 3, {b, m, n}, CL_TENSOR_FLOAT, err); -cl_tensor t0 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0}, - 3, {b, m, n}, CL_TENSOR_FLOAT, err); -cl_tensor out = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0}, - 3, {b, m, n}, CL_TENSOR_FLOAT, err); +cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err); +cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err); +cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err); +cl_tensor t0 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err); +cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err); cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0); cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out); -// Binding a buffer to temporary tensor is not allowed. -auto ignored = clCreateBufferWithProperties( - ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE, 0, nullptr, &err); -assert(err == CL_TENSOR_IS_TEMPORARY); +// Bind command buffer managed storage to tensors. +// +// NOTE: same temporary tensor handle used in multiple command buffers +// will have separate storage. IOW, command buffers may not exchange +// data via temporary buffers between them. +cl_mem in0_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in0, 0}, + CL_MEM_READ_ONLY, 0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */, + nullptr, &err); +cl_mem in1_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in1, 0}, + CL_MEM_READ_ONLY, 0, nullptr, &err); +cl_mem in2_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in2, 0}, + CL_MEM_READ_ONLY, 0, nullptr, &err); +cl_mem t0_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, t0, 0}, + CL_MEM_READ_WRITE, 0, nullptr, &err); +cl_mem out_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, out, 0}, + CL_MEM_WRITE_ONLY, 0, nullptr, &err); std::vector in0_data = ...; std::vector in1_data = ...; From 88a0a84709923d042b4b7dcbf19a36576e0b421f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 2 Nov 2023 13:41:39 +0200 Subject: [PATCH 09/26] Fix cl_tensor_type -> cl_tensor_datatype --- ext/cl_khr_tensor.asciidoc | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc index 0de088c7..22a6cd00 100644 --- a/ext/cl_khr_tensor.asciidoc +++ b/ext/cl_khr_tensor.asciidoc @@ -56,7 +56,7 @@ cl_tensor clCreateTensor( cl_context context, const cl_tensor_peoperties *properties, size_t rank, - const size_t shape, + const size_t* shape, cl_tensor_datatype dtype, cl_int *errcode_ret); ---- @@ -212,9 +212,9 @@ cl_int clGetTensorInfo( .List of supported param_names by clGetTensorInfo [cols="2,1,2",stripes=odd] |=== -| CL_TENSOR_RANK | size_t | Return the tensor rank. -| CL_TENSOR_SHAPE | size_t[] | Return the tensor shape. -| CL_TENSOR_DTYPE | cl_tensor_type | Return the tensor data type. +| CL_TENSOR_RANK | size_t | Return the tensor rank. +| CL_TENSOR_SHAPE | size_t[] | Return the tensor shape. +| CL_TENSOR_DTYPE | cl_tensor_datatype | Return the tensor data type. | CL_TENSOR_BOUND_TO_BUFFER | cl_bool | Return true if the tensor is bound to a buffer. From 37fe00630d7932a5b871eeb56f1288ba207dc583 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 2 Nov 2023 13:42:16 +0200 Subject: [PATCH 10/26] Add an open question --- ext/cl_khr_tensor.asciidoc | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc index 22a6cd00..e91f81df 100644 --- a/ext/cl_khr_tensor.asciidoc +++ b/ext/cl_khr_tensor.asciidoc @@ -575,3 +575,10 @@ assert(clEnqueueReadTensor(..., t0, ...) == CL_INVALID_OPERATION); ---- === Open Questions === + +. Should we have support for tensors with undefined shape and tensors + with unknown / symbolic dimension sizes like in ONNX? + +// https://onnx.ai/onnx/repo-docs/ShapeInference.html + +*UNRESOLVED* From 0a43252c1cbc5f5fec745fd83051a38cff822908 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 2 Nov 2023 13:46:58 +0200 Subject: [PATCH 11/26] Add CL_INVALID_TENSOR error code --- ext/cl_khr_tensor.asciidoc | 2 ++ 1 file changed, 2 insertions(+) diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc index e91f81df..1b2a9686 100644 --- a/ext/cl_khr_tensor.asciidoc +++ b/ext/cl_khr_tensor.asciidoc @@ -396,6 +396,8 @@ buffers. | CL_TENSOR_BOUND_TO_BUFFER | Returned when attempting to bind a buffer object to a tensor which already has been bound to the same or another. +| CL_INVALID_TENSOR | Returned then the specified tensor is not a + valid tensor object. |=== === Sample Codes From d10d149267045549ffb1e16286d04445e3cadc68 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 2 Nov 2023 13:59:25 +0200 Subject: [PATCH 12/26] Require either buffer or host_ptr to be non-NULL --- ext/cl_khr_tensor.asciidoc | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc index 1b2a9686..f1437dd3 100644 --- a/ext/cl_khr_tensor.asciidoc +++ b/ext/cl_khr_tensor.asciidoc @@ -272,12 +272,10 @@ cl_int clEnqueueTranslateToTensor( * _blocking_command_ indicate if the read and write operations are blocking or non-blocking (see below). -* _buffer_ refers to a valid buffer object where data is to be - read into or to be written from when the value of _host_ptr_ is - NULL. If _host_ptr_ is non-NULL then value of _buffer_ is ignored. - -* _host_ptr_ is the pointer to buffer in host memory where data is to - be read into or to be written from when the value is non-NULL. +* _buffer_ and _host_ptr_ refer to a valid buffer object / host + allocation where data is to be read into or to be written from. + Either the _buffer_ or _host_ptr_ can be non-NULL in which case the + non-NULL argument is used as the operand for the operation. * _event_wait_list_ and _num_events_in_wait_list_ specify events that need to complete before this particular command can be executed. If From f40eedaa57a2252d4e00ca66cf3a49969b700fb3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 2 Nov 2023 14:27:14 +0200 Subject: [PATCH 13/26] Regenerate html for cl_exp_tensor --- ext/cl_khr_tensor.html | 298 +++++++++++++++++++++++------------------ 1 file changed, 168 insertions(+), 130 deletions(-) diff --git a/ext/cl_khr_tensor.html b/ext/cl_khr_tensor.html index 87892548..c232ddea 100644 --- a/ext/cl_khr_tensor.html +++ b/ext/cl_khr_tensor.html @@ -5,7 +5,7 @@ -cl_khr_tensor +cl_exp_tensor + +
-
-

Tensor Data Type

+
-

This extension provides a new opaque OpenCL datatype called -cl_tensor. It is used for storing N-dimensional tensor data in -implementation-defined memory layout which may be optimized based on -tensor’s use cases. The datatype is designed to be efficiently used -within the cl_khr_command_buffers extension to capture task graphs -which can utilize tensors as input, output and temporary storage.

+

This extension provides new buffer abstraction - tensor objects - for +managing N-dimensional data.

-
-

General information

-
-

Name Strings

+
+
+
+

XXX - Not complete yet!!!

+
+ +
+
+
+

Name Strings

+

cl_exp_tensor

-
-

Version history

- + +
+

Contact

+
+
+

TODO

+
+
+
+
+

Contributors

+
+
+

Henry Linjamäki, Intel.
+Pekka Jääslkeläinen, Intel and Tampere University.
+Ben Ashbaugh, Intel.

+
+
+
+
+

Notice

+
+
+

TODO

+
+
+
+
+

Status

+
+
+

Draft spec, NOT APPROVED!!

+
+
+
+
+

Version

+
+
+

Built On: 2024-08-14
+Version: 0.2.0

+
+
+
+
+

Dependencies

+
+
+

This extension is written against the OpenCL Specification version 3.0.14.

+
+
+

This extension requires OpenCL 1.2 or later.

+
+
+
+
+

Overview

+
+
+

The extension provides new tensor object abstraction. Tensor objects +are similar to image types in regard they represents N-dimensional +data of some application chosen data type and they may be mapped to +dedicated hardware except that

+
+
+
    +
  • +

    higher than 3-dimensional data can be supported (limited by +devices' capabilities).

    +
  • +
  • +

    applications may choose how the data elements of the tensors are +laid out in the buffers using the tensor layout descriptions +provided in this extension.

    +
  • +
+
+
+

Applications may also choose the memory layouts of the tensors be +implementation-specified, letting the driver to optimize the tensor +data layout for better performance or to lay out the data as required by +hardware functions (e.g. exposed via builtin kernels).

+
+
+

The scope of this extension to provide host APIs for creating tensor +objects and transfer data between tensors, host and other memory +objects.

+
+
+

A separate extension implemented on top of this extension, +cl_exp_defined_builtin_kernels which provides "defined built-in +kernels" (DKBs) which operates on tensors. It also provides mechanism +for drivers to create DBKs that are optimized for the tensor arguments +they operate on.

+
+
+
+
+

New API Functions

+
+
+
+
cl_int clEnqueueImportFromTensorEXP(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  void* host_ptr,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);
+
+cl_int clEnqueueExportToTensorEXP(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  const void* host_ptr,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);
+
+cl_int clEnqueueCopyTensorEXP(
+  cl_command_queue command_queue,
+  cl_tensor src_tensor,
+  cl_tensor dst_tensor,
+  const cl_tensor_shape* src_origin,
+  const cl_tensor_shape* dst_origin,
+  const cl_tensor_shape* region,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);
+
+cl_int clCommandImportFromTensorEXP(
+  cl_command_buffer_khr command_buffer,
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  void* host_ptr,
+  cl_uint num_sync_points_in_wait_list,
+  const cl_sync_point_khr* sync_point_wait_list,
+  cl_sync_point_khr* sync_point,
+  cl_mutable_command_khr* mutable_handle);
+
+cl_int clCommandExportToTensorEXP(
+  cl_command_buffer_khr command_buffer,
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  const void* host_ptr,
+  cl_uint num_sync_points_in_wait_list,
+  const cl_sync_point_khr* sync_point_wait_list,
+  cl_sync_point_khr* sync_point,
+  cl_mutable_command_khr* mutable_handle);
+
+
+
+
+
+

New API Types

+
+
+
+
typedef cl_uint cl_tensor_layout_type_exp;
+typedef cl_uint cl_tensor_dim_exp;
+typedef cl_uint cl_tensor_layout_ml_type_exp;
+typedef cl_properties cl_tensor_properties_exp;
+
+#define CL_TENSOR_DESC_MAX_RANK_EXP       20u
+#define CL_TENSOR_DESC_MAX_PROPERTIES_EXP 16u
+
+typedef struct cl_tensor_desc_exp {
+    cl_uint               rank;
+    cl_tensor_datatype    dtype;
+    cl_tensor_properties_exp  properties[CL_TENSOR_DESC_MAX_PROPERTIES_EXP]
+    cl_tensor_shape       shape[CL_TENSOR_DESC_MAX_RANK_EXP];
+    const void*           layout;
+    cl_tensor_layout_type_exp layout_type;
+} cl_tensor_desc_exp;
+
+typedef struct cl_tensor_layout_blas_exp {
+    cl_tensor_dim_exp    leading_dims[CL_TENSOR_DESC_MAX_RANK_EXP];
+} cl_tensor_layout_blas_exp;
+
+typedef struct cl_tensor_layout_blas_pitched_exp {
+    cl_tensor_dim_exp    leading_dims[CL_TENSOR_DESC_MAX_RANK_EXP];
+    cl_tensor_stride     leading_strides[CL_TENSOR_DESC_MAX_RANK_EXP];
+} cl_tensor_layout_blas_pitched__exp;
+
+typedef struct cl_tensor_layout_ml_exp {
+  cl_tensor_layout_ml_type_exp ml_type;
+} cl_tensor_layout_ml_exp;
+
+
+
+
+
+

New API Enums

+
+
+

Accepted value for properties parameter to +clCreateBufferWithProperties for creating a tensor object:

+
+
+
+
CL_MEM_TENSOR_EXP               0x????
+
+
+
+

Accepted values for the param_name parameter to clGetDeviceInfo:

+
+
+
+
CL_DEVICE_MAX_TENSOR_ARGS_EXP     0x????
+CL_DEVICE_MAX_TENSOR_RANK_EXP     0x????
+CL_DEVICE_MAX_TENSOR_ELEMENTS_EXP 0x????
+CL_DEVICE_MAX_TENSOR_STRIDE_EXP   0x????
+
+
+
+

Accepted values for cl_tensor_datatype type:

+
+
+
+
CL_TENSOR_DTYPE_BOOL_EXP        0x????
+
+CL_TENSOR_DTYPE_INT4_EXP        0x????
+CL_TENSOR_DTYPE_INT8_EXP        0x????
+CL_TENSOR_DTYPE_INT16_EXP       0x????
+CL_TENSOR_DTYPE_INT32_EXP       0x????
+CL_TENSOR_DTYPE_INT64_EXP       0x????
+
+CL_TENSOR_DTYPE_UINT4_EXP       0x????
+CL_TENSOR_DTYPE_UINT8_EXP       0x????
+CL_TENSOR_DTYPE_UINT16_EXP      0x????
+CL_TENSOR_DTYPE_UINT32_EXP      0x????
+CL_TENSOR_DTYPE_UINT64_EXP      0x????
+
+CL_TENSOR_DTYPE_FP8_EXP         0x????
+CL_TENSOR_DTYPE_FP16_EXP        0x????
+CL_TENSOR_DTYPE_FP32_EXP        0x????
+CL_TENSOR_DTYPE_FP64_EXP        0x????
+
+CL_TENSOR_DTYPE_BFLOAT16_EXP    0x????
+
+CL_TENSOR_DTYPE_COMPLEX64_EXP   0x????
+CL_TENSOR_DTYPE_COMPLEX128_EXP  0x????
+
+
+
+

Accepted values for cl_tensor_layout_type_exp:

+
+
+
+
CL_TENSOR_LAYOUT_OPAQUE_EXP       0x????
+CL_TENSOR_LAYOUT_BLAS_EXP         0x????
+CL_TENSOR_LAYOUT_BLAS_PITCHED_EXP 0x????
+CL_TENSOR_LAYOUT_ML_EXP           0x????
+
+
+
+

Accepted values for cl_tensor_layout_ml_type_exp:

+
+
+
+
CL_TENSOR_LAYOUT_ML_C_EXP       0x????
+CL_TENSOR_LAYOUT_ML_NC_EXP      0x????
+CL_TENSOR_LAYOUT_ML_CN_EXP      0x????
+CL_TENSOR_LAYOUT_ML_HW_EXP      0x????
+CL_TENSOR_LAYOUT_ML_CHW_EXP     0x????
+CL_TENSOR_LAYOUT_ML_NCHW_EXP    0x????
+CL_TENSOR_LAYOUT_ML_NHWC_EXP    0x????
+
+
+
+

New error codes:

+
+
+
+
CL_INVALID_TENSOR_RANK_EXP   0x????
+CL_INVALID_TENSOR_DTYPE_EXP  0x????
+CL_INVALID_TENSOR_SHAPE_EXP  0x????
+CL_INVALID_TENSOR_LAYOUT_EXP 0x????
+
+
+
+

Modifications to The OpenCL API Specification

+
+
+
(Modify Section 4.2, Querying Devices)
+
+
+
+
+
+
(Add the following to Table 5., List of supported _param_names by clGetDeviceInfo)
+
+
+
+
+
+
+
+
+--+ - - - + + + - - - + + + + + + + + + + + + + + + + + +
DateVersionDescriptionDevice InfoReturn TypeDescription

2023-11-23

0.1.0

First assigned version.

CL_DEVICE_MAX_TENSOR_ARGS_EXP

cl_uint

Max number of tensor objects + arguments specified as arguments to.

CL_DEVICE_MAX_TENSOR_RANK_EXP

cl_uint

Max tensor rank. The minimum + value is 4.

CL_DEVICE_MAX_TENSOR_ELEMENTS_EXP

size_t

Maximum number of tensor + elements in total. The minimum value is 65536.

CL_DEVICE_MAX_TENSOR_PITCH_EXP

size_t

Maximum pitch value for + all pitch components for + CL_TENSOR_LAYOUT_BLAS_PITCHED_EXP memory + layout.

+

The minimum value is 65536.

+
+
+
-
-

Dependencies

-
-

This extension is written against the OpenCL Specification version 3.0.14.

-
-

This extension requires OpenCL 1.2 or later.

+
+
+
(Modify Section 5.2.1, Creating Buffer Objects)
+
+
+
+
+
+
(Add the following to Table 18., Buffer creation properties)
+
-
-

Contributors

-
-

Henry Linjamäki, Intel.
-Pekka Jääslkeläinen, Intel and Tampere University.
-Ben Ashbaugh, Intel.

+
+
+ +++++ + + + + + + + + + + + + + + +
cl_mem_propertiesProperty ValueDescription

CL_MEM_TENSOR_EXP

cl_tensor_desc_exp

+

Creates a tensor object with +properties set in cl_tensor_desc_exp tensor description structure.

-
-

Overview

-

The new tensor object enables applications to describe N-dimensional -arrays whose memory layout is opaque to applications. The goals -of this extension are the following:

+

The size parameter of the clCreateBufferWithProperties() is +ignored and may be set to zero. The required storage space needed is +inferred from the tensor description. The storage size of the queried +with clGetMemObjectInfo(). The storage size may change during +the runtime unless constrained by the given tensor description.

+
+
+
+
+
+
(Add to list of error codes clCreateBufferWithProperties())
+
+
+
  • -

    Enable implementations to have freedom of placement data of the tensors for -improving performance of the kernels which use them. This extension -is designed such it allows implementations to determine optimal -memory layouts for the tensors based on their use cases for -increased performance, by means of, for example, analyzing kernels’ access -patterns or, in case of built-in kernels, by inspecting the tensor -arguments they operate on.

    +

    CL_INVALID_VALUE if CL_MEM_TENSOR_EXP property is specified and the +rank member of the cl_tensor_desc_exp structure has invalid or +unsupported value.

  • -

    Reduce details and boilerplate needed for performance portable implementation of -applications by being less dependent on platform or device specifics -on the memory layout / data arrangements which matters for -performance. Such specifics may include:

    -
    -
      -
    • -

      alignment of data (e.g. for avoiding misaligned memory accesses)

      +

      CL_INVALID_TENSOR_SHAPE_EXP if CL_MEM_TENSOR_EXP property is +specified and the shape member of the cl_tensor_desc_exp +structure has invalid or unsupported description.

    • -

      arrangement of data required by kernels (column-major vs row-major -for matrix multiplication, NHWC vs NCHW for neural network -convolution)

      +

      CL_INVALID_TENSOR_LAYOUT_TYPE_EXP if CL_MEM_TENSOR_EXP property is +specified and the layout_type member of the cl_tensor_desc_exp +structure has an invalid enumeration constant.

    • -

      arrangement of the data into tiles (or “packing”) for improving -cache and TLB hits

      +

      CL_INVALID_TENSOR_LAYOUT_EXP if CL_MEM_TENSOR_EXP property is +specified and the layout member of the cl_tensor_desc_exp has an +invalid description.

      +
    • +
    +
    +
    +
    + +
    +
    +
    +
    +
    (Add the following to Section 5.2.2, Reading, Writing and Copying Buffer Objects)
    +
    +
    +
    +
    +

    The following functions are for reading from a tensor to host memory / +buffer object or to write to a tensor object from host memory / buffer +object.

    +
    +
    +
    +
    cl_int clEnqueueImportFromTensorEXP(
    +  cl_command_queue command_queue,
    +  cl_tensor tensor,
    +  cl_bool blocking_command,
    +  const size_t* tensor_origin,
    +  const size_t* mem_origin,
    +  const size_t* region,
    +  const size_t* mem_pitch,
    +  cl_mem buffer,
    +  void* host_ptr,
    +  cl_uint num_events_in_wait_list,
    +  const cl_event* event_wait_list,
    +  cl_event* event);
    +
    +
    +
    +
    +
    cl_int clEnqueueExportToTensorEXP(
    +  cl_command_queue command_queue,
    +  cl_tensor tensor,
    +  cl_bool blocking_command,
    +  const size_t* tensor_origin,
    +  const size_t* mem_origin,
    +  const size_t* region,
    +  const size_t* mem_pitch,
    +  cl_mem buffer,
    +  const void* host_ptr,
    +  cl_uint num_events_in_wait_list,
    +  const cl_event* event_wait_list,
    +  cl_event* event);
    +
    +
    +
    +
      +
    • +

      command_queue is a valid host command-queue in which the read / +write command will be queued. command_queue and tensor must be +created with the same OpenCL context.

      +
    • +
    • +

      tensor refers to a valid tensor object which is bound to a buffer.

      +
    • +
    • +

      blocking_command indicate if the read and write operations are +blocking or non-blocking (see below).

      +
    • +
    • +

      tensor_origin defines the offset coordinates in tensor for start of +the regions to read / write tensor data. The length of the array +must be at least rank the the tensor.

      +
    • +
    • +

      mem_origin defines the offset coordinates in the memory region +pointed by buffer or host_ptr expressed in elements of tensor +data type. The length of the array must be at least rank the the +tensor.

      +
    • +
    • +

      region defines the region being read or written expressed in in +elements of tensor data type. The length of the array must be at +least rank the the tensor. If region is NULL then tensor's +shape will be used as the region.

      +
    • +
    • +

      mem_pitch defines the length of each dimension in elements to be +used for the memory region of buffer or host_ptr. The length of +the array must be at least the rank of tensor minus one. if +mem_pitch is NULL or mem_pitch[i] is zero, mem_pitch[i] is +computed as region[i + 1].

      +
    • +
    • +

      buffer and host_ptr refer to a valid buffer object / host +allocation where data is to be read into or to be written from. +Either the buffer or host_ptr can be non-NULL in which case the +non-NULL argument is used as the operand for the operation.

      +
    • +
    • +

      event_wait_list and num_events_in_wait_list specify events that +need to complete before this particular command can be executed. If +event_wait_list is NULL, then this particular command does not +wait on any event to complete. If event_wait_list is NULL, +num_events_in_wait_list must be 0. If event_wait_list is not +NULL, the list of events pointed to by event_wait_list must be +valid and num_events_in_wait_list must be greater than 0. The +events specified in event_wait_list act as synchronization +points. The context associated with events in event_wait_list and +command_queue must be the same. The memory associated with +event_wait_list can be reused or freed after the function returns.

      +
    • +
    • +

      event returns an event object that identifies this read / write +command and can be used to query or queue a wait for this command to +complete. If event is NULL or the enqueue is unsuccessful, no +event will be created and therefore it will not be possible to query +the status of this command or to wait for this command to +complete. If event_wait_list and event are not NULL, event +must not refer to an element of the event_wait_list array.

      +
    • +
    +
    +
    +

    The clEnqueueExportToTensorEXP function copies contents of the buffer +object / host allocation to tensor’s storage in +implementation-defined, opaque memory layout. The +clEnqueueImportFromTensorEXP function copies data from tensor’s +storage to buffer object / host allocation.

    +
    +
    +

    The elements of buffer object / host allocation are mapped to tensor +coordinates and vice versa as follows in pseudo C code:

    +
    +
    +
    +
    tensor_element(
    +  tensor,
    +  tensor_origin[0] + i[0],
    +  tensor_origin[1] + i[1],
    +  ...,
    +  tensor_origin[N-2] + i[N-2],
    +  tensor_origin[N-2] + i[N-1]) ==
    +((TENSOR_DATATYPE *)buffer_or_host_ptr)[
    +  (mem_origin[0] + i[0]) * pitch(0) +
    +  (mem_origin[1] + i[1]) * pitch(1) +
    +  ... +
    +  (mem_origin[N-2] + i[N-2]) * pitch(N-2) +
    +  (mem_origin[N-1] + i[N-1])];
    +
    +
    +
    +

    Where the N is tensor rank, the i[X] is a tensor coordinate with +inclusive range of 0..<region[X]-1> and the pitch is computed as +follows in pseudo C code:

    +
    +
    +
    +
    size_t pitch(size_t dim) {
    +  size_t pitch = 1;
    +  for (size_t i = dim; i < tensor_rank - 1; i++)
    +    pitch *=
    +      (mem_pitch != NULL || mem_pitch[i] == 0) ? mem_pitch[i] : region[i + 1];
    +  return pitch;
    +}
    +
    +
    +
    +

    For dim in 0..(tensor_rank()-1). The tensor_element() represents +an abstract function that accesses a tensor element in its storage at +given coordinate. The method how the coordinates translate to tensor +storage addresses is unspecified.

    +
    +
    +

    clEnqueueImportFromTensorEXP and clEnqueueExportToTensorEXP +returns CL_SUCCESS if the function is executed +successfully. Otherwise, it returns one of the following errors:

    +
    +
    +
      +
    • +

      CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host +command-queue.

      +
    • +
    • +

      CL_INVALID_CONTEXT if the context associated with command_queue +and buffer are not the same or if the context associated with +command_queue and events in event_wait_list are not the same.

      +
    • +
    • +

      CL_INVALID_MEM_OBJECT if buffer is not a valid buffer object.

      +
    • +
    • +

      CL_INVALID_VALUE if tensor_origin or mem_origin is NULL.

      +
    • +
    • +

      CL_INVALID_VALUE if the region being read or written specified by +(mem_origin, region, mem_pitch) is out of bounds.

      +
    • +
    • +

      CL_INVALID_VALUE if any region array element is 0.

      +
    • +
    • +

      CL_INVALID_VALUE if mem_pitch is not NULL and mem_pitch[i] is +not 0 and mem_pitch[i] is less than region[i].

      +
    • +
    • +

      CL_INVALID_VALUE if buffer and host_ptr both are NULL or non-NULL.

      +
    • +
    • +

      CL_INVALID_EVENT_WAIT_LIST if event_wait_list is NULL and +num_events_in_wait_list > 0, or event_wait_list is not NULL and +num_events_in_wait_list is 0, or if event objects in +event_wait_list are not valid events.

      +
    • +
    • +

      CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST if the read and write +operations are blocking and the execution status of any of the +events in event_wait_list is a negative integer value.

      +
    • +
    • +

      CL_OUT_OF_RESOURCES if there is a failure to allocate resources +required by the OpenCL implementation on the device.

      +
    • +
    • +

      CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources +required by the OpenCL implementation on the host.

      +
    • +
    +
    +
    +

    To copy elements from one tensor to another use:

    +
    +
    +
    +
    cl_int clEnqueueCopyTensorEXP(
    +  cl_command_queue command_queue,
    +  cl_tensor src_tensor,
    +  cl_tensor dst_tensor,
    +  const cl_tensor_shape* src_origin,
    +  const cl_tensor_shape* dst_origin,
    +  const cl_tensor_shape* region,
    +  cl_uint num_events_in_wait_list,
    +  const cl_event* event_wait_list,
    +  cl_event* event);
    +
    +
    +
    +
      +
    • +

      command_queue is a valid host command-queue in which the read / +write command will be queued. command_queue and tensor must be +created with the same OpenCL context.

      +
    • +
    • +

      src_tensor and dst_tensor refer to valid buffer objects created +with CL_MEM_TENSOR_EXP. Tensor elements are copied from src_tensor +to dst_tensor. Rank of the src_tensor and dst_tensor must match.

      +
    • +
    • +

      src_origin and dst_origin define origins of the copy region. The +length of the arrays must be at least tensors' rank.

      +
    • +
    • +

      region defines extends of the slice being being copied. The length +of the arrays must be at least tensors' rank.

      +
    • +
    • +

      event_wait_list and num_events_in_wait_list specify events that +need to complete before this particular command can be executed. If +event_wait_list is NULL, then this particular command does not +wait on any event to complete. If event_wait_list is NULL, +num_events_in_wait_list must be 0. If event_wait_list is not +NULL, the list of events pointed to by event_wait_list must be +valid and num_events_in_wait_list must be greater than 0. The +events specified in event_wait_list act as synchronization +points. The context associated with events in event_wait_list and +command_queue must be the same. The memory associated with +event_wait_list can be reused or freed after the function returns.

      +
    • +
    • +

      event returns an event object that identifies this read / write +command and can be used to query or queue a wait for this command to +complete. If event is NULL or the enqueue is unsuccessful, no +event will be created and therefore it will not be possible to query +the status of this command or to wait for this command to +complete. If event_wait_list and event are not NULL, event +must not refer to an element of the event_wait_list array.

      +
    • +
    +
    +
    +

    Elements are copied from the source tensor to the destination tensor +so that after the completion following condition holds expressed in +pseudo C:

    +
    +
    +
    +
    // 'so' and 'do' are aliases for src_origin and dst_origin, respectively.
    +tensor_element(dst_tensor, do[0] + i[0], do[1] + i[1], ..., do[N-1] + i[N-1])
    +==
    +tensor_element(src_tensor, so[0] + i[0], so[1] + i[1], ..., so[N-1] + i[N-1]);
    +
    +
    +
    +

    Where the N is tensor rank, the i[X] is a tensor coordinate with +inclusive range of 0..<region[X]-1>.

    +
    +
    +

    clEnqueueCopyTensorEXP returns CL_SUCCESS if the function is +executed successfully. Otherwise, it returns one of the following +errors:

    +
    +
    +
      +
    • +

      CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host +command-queue.

      +
    • +
    • +

      CL_INVALID_CONTEXT if the context associated with command_queue +and buffer are not the same or if the context associated with +command_queue and events in event_wait_list are not the same.

      +
    • +
    • +

      CL_INVALID_MEM_OBJECT if src_tensor or dst_tensor are not a +valid buffer object created with CL_MEM_TENSOR_EXP.

      +
    • +
    • +

      CL_INVALID_VALUE if tensor_origin or mem_origin is NULL.

      +
    • +
    • +

      CL_INVALID_VALUE if src_origin, dst_origin or region is NULL.

      +
    • +
    • +

      CL_INVALID_VALUE if region[i] is zero for i in [0, tensor_rank).

      +
    • +
    • +

      CL_INVALID_VALUE if origin[i] + region[i] > tensor_shape[i] at any +dimension i in range [0, tensor_rank).

      +
    • +
    • +

      CL_INVALID_EVENT_WAIT_LIST if event_wait_list is NULL and +num_events_in_wait_list > 0, or event_wait_list is not NULL and +num_events_in_wait_list is 0, or if event objects in +event_wait_list are not valid events.

      +
    • +
    • +

      CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST if the read and write +operations are blocking and the execution status of any of the +events in event_wait_list is a negative integer value.

      +
    • +
    • +

      CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate +memory for data store associated with memory object the tensor is +bound to.

      +
    • +
    • +

      CL_OUT_OF_RESOURCES if there is a failure to allocate resources +required by the OpenCL implementation on the device.

      +
    • +
    • +

      CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources +required by the OpenCL implementation on the host.

      +
    • +
    +
    +
    +
    +
    +
    (Add the following to Section 5.17.5, Recording Commands to a Command-Buffer)
    +
    +
    +
    +
    +

    If cl_khr_command_buffer is supported, then the following command +buffer counterparts of the clEnqueueImportFromTensorEXP and +clEnqueueExportToTensorEXP commands are available.

    +
    +
    +
    +
    cl_int clCommandImportFromTensorEXP(
    +  cl_command_buffer_khr command_buffer,
    +  cl_command_queue command_queue,
    +  cl_tensor tensor,
    +  const size_t* tensor_origin,
    +  const size_t* mem_origin,
    +  const size_t* region,
    +  const size_t* mem_pitch,
    +  cl_mem buffer,
    +  void* host_ptr,
    +  cl_uint num_sync_points_in_wait_list,
    +  const cl_sync_point_khr* sync_point_wait_list,
    +  cl_sync_point_khr* sync_point,
    +  cl_mutable_command_khr* mutable_handle);
    +
    +cl_int clCommandExportToTensorEXP(
    +  cl_command_buffer_khr command_buffer,
    +  cl_command_queue command_queue,
    +  cl_tensor tensor,
    +  const size_t* tensor_origin,
    +  const size_t* mem_origin,
    +  const size_t* region,
    +  const size_t* mem_pitch,
    +  cl_mem buffer,
    +  const void* host_ptr,
    +  cl_uint num_sync_points_in_wait_list,
    +  const cl_sync_point_khr* sync_point_wait_list,
    +  cl_sync_point_khr* sync_point,
    +  cl_mutable_command_khr* mutable_handle);
    +
    +
    +
    +
      +
    • +

      command_buffer refers to valid command-buffer object.

      +
    • +
    • +

      For command_queue, tensor, tensor_origin, mem_origin, +region, mem_pitch, buffer and host_ptr parameters refer to +clEnqueueImportFromTensor.

      +
    • +
    • +

      For num_sync_points_in_wait_list, sync_point_wait_list, +sync_point, mutable_handle parameters refer to +clCommandCopyBufferEXP.

      +
    • +
    +
    +
    +

    clCommandImportFromTensorEXP and clCommandImportFromTensorEXP +returns CL_SUCCESS if the function is executed +successfully. Otherwise, it returns one of the following errors:

    +
    +
    +
      +
    • +

      CL_INVALID_COMMAND_QUEUE if command_queue is not NULL.

      +
    • +
    • +

      CL_INVALID_COMMAND_BUFFER_KHR if command_buffer is not a valid +command-buffer.

      +
    • +
    • +

      CL_INVALID_CONTEXT if the context associated with command_queue +and command_buffer is not the same.

      +
    • +
    • +

      CL_INVALID_OPERATION if command_buffer has been finalized.

      +
    • +
    • +

      CL_INVALID_VALUE if mutable_handle is not NULL.

      +
    • +
    • +

      CL_INVALID_SYNC_POINT_WAIT_LIST_KHR if sync_point_wait_list is +NULL and num_sync_points_in_wait_list is > 0, or +sync_point_wait_list is not NULL and num_sync_points_in_wait_list is +0, or if synchronization-point objects in sync_point_wait_list are +not valid synchronization-points.

    • -

      arrangement of data into specific tiles in order to exploit complex -HW operations such as matrix multiplications (Intel AMX, AMD matrix -cores).

      +

      CL_OUT_OF_RESOURCES if there is a failure to allocate resources +required by the OpenCL implementation on the device.

    • -

      arrangement of data into rows separated by a stride in order to -avoid bank conflicts in GPUs.

      -
    • -
    -
    +

    CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources +required by the OpenCL implementation on the host.

-
-

The tensor data type is designed to be efficiently used together with command buffers (cl_khr_command_buffers) -and built-in kernels, including kernels to be provided by the Defined -Built-in Kernels (cl_khr_defined_builtin_kernels) extension that is being prepared together with this extension.

-
-

Modifications to OpenCL

-
-

New Section: 5.x Tensor Objects

-
-

A tensor object stores an N-dimensional array of elements. The memory -layout of the tensor is opaque to the application. When a tensor -object is created it is initially not associated to any storage for the tensor elements. - A storage is bound to a tensor -by creating a memory buffer with CL_MEM_BIND_TO_BUFFER. Tensor objects -without storage can be set as kernel arguments for kernels which -accepts them. Kernels which have tensor arguments must have storage -assigned to them prior enqueuing the kernels for execution.

-
-
-
-

New OpenCL Functions added to Tensor Objects section

+ +
(Add the following to new Section 5.X.Y, Tensor Descriptions)
+
+
+
-

To create a tensor use:

+

The following structure describes properties of a tensor to be created +with clCreateBufferWithProperties() using CL_MEM_TENSOR_EXP property:

-
cl_tensor clCreateTensor(
-    cl_context context,
-    const cl_tensor_peoperties *properties,
-    size_t rank,
-    const size_t* shape,
-    cl_tensor_datatype dtype,
-    cl_int *errcode_ret);
+
typedef struct cl_tensor_desc_exp {
+    cl_uint               rank;
+    cl_tensor_datatype    dtype;
+    cl_tensor_properties_exp  properties[CL_TENSOR_DESC_MAX_PROPERTIES_EXP]
+    cl_tensor_shape       shape[CL_TENSOR_DESC_MAX_RANK_EXP];
+    const void*           layout;
+    cl_tensor_layout_type_exp layout_type;
+} cl_tensor_desc_exp;
  • -

    context is a valid OpenCL context used to create the tensor object.

    +

    rank defines the tensor’s rank - the number of dimensions.

    +
  • +
  • +

    dtype defines the data type of the elements in the +tensor. Possible types are listed in tensor +element type table.

  • properties is an optional list of properties for the tensor object and their corresponding values. The list is terminated with the special property 0. If no properties are required, properties may be NULL. This extension does not define any optional properties for -tensors.

    -
  • -
  • -

    rank is the number of dimensions. Zero value creates a "scalar" -tensor which has no dimensions but has storage for one element.

    -
  • -
  • -

    shape is a list of sizes of the dimensions. The length of the list -must be rank elements. shape can be NULL if rank value is -zero. All the first rank values in the list must be non-zero.

    -
  • -
  • -

    dtype is the element type of tensor. Refer to the -Tensor element types. The API type indicates the corresponding type for copying elements from an host allocation / buffer object to tensor or vice versa. table for the types.

    -
  • -
  • -

    errcode_ret may return an appropriate error code. If errcode_ret -is NULL, no error code is returned.

    -
  • -
-
-
-

clCreateTensor function creates a rank-dimensional tensor with -shape[0] * shape[1] * …​ * shape[rank-1] elements of dtype -type. At the creation time of the tensor, it does not have -storage. The storage is assigned to the tensor by calling -clCreateBufferWithProperties() with CL_MEM_BIND_TO_TENSOR.

-
-
-

A command that refers to a tensor must be bound to a valid buffer -object before enqueuing or recording the command.

-
-
-

clCreateTensor returns a valid non-zero tensor object and errcode_ret -is set to CL_SUCCESS if the tensor object is created -successfully. Otherwise, they return a NULL value with one of the -following error values returned in errcode_ret:

-
-
-
    -
  • -

    CL_INVALID_CONTEXT if context is not a valid context.

    +tensors, but future extensions may define properties.

  • -

    CL_INVALID_PROPERTY if a property name in properties is not a -supported property name, if the value specified for a supported -property name is not valid, or if the same property name is -specified more than once.

    +

    shape defines the extends of the tensor’s dimensions in number of +elements.

  • -

    CL_INVALID_VALUE if a value specified in dtype is invalid.

    +

    layout points to an optional structure describing how tensor +elements are laid out in the buffer memory. The structure must be a +type corresponding to the layout_type listed in +tensor layout type table. The pointer is +ignored if the tensor_type is CL_TENSOR_LAYOUT_OPAQUE_EXP.

  • -

    CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources -required by the OpenCL implementation on the host.

    +

    layout_type indicates the layout structure type the layout +point to.

- +
@@ -675,884 +1538,378 @@

New OpenCL Functi

- - + + - + + + + + + - + - + - + - + - + - + - + - + + + + + + - + - + - + - + - +
Table 1. Tensor element types. The API type indicates the corresponding type for copying elements from an host allocation / buffer object to tensor or vice versa.

CL_TENSOR_BOOL

1-bit signedless integer.

CL_TENSOR_DTYPE_BOOL

Data type representing true or false.

cl_uchar. [1]

CL_TENSOR_INT8

CL_TENSOR_DTYPE_INT4_EXP

4-bit signed integer.

cl_char.

CL_TENSOR_DTYPE_INT8_EXP

8-bit signed integer.

cl_char.

CL_TENSOR_INT16

CL_TENSOR_DTYPE_INT16_EXP

16-bit signed integer.

cl_short.

CL_TENSOR_INT32

CL_TENSOR_DTYPE_INT32_EXP

32-bit signed integer.

cl_int.

CL_TENSOR_INT64

CL_TENSOR_DTYPE_INT64_EXP

64-bit signed integer.

cl_long.

CL_TENSOR_UINT8

CL_TENSOR_DTYPE_UINT8_EXP

8-bit unsigned integer.

cl_uchar.

CL_TENSOR_UINT16

CL_TENSOR_DTYPE_UINT16_EXP

16-bit unsigned integer.

cl_ushort.

CL_TENSOR_UINT32

CL_TENSOR_DTYPE_UINT32_EXP

32-bit unsigned integer.

cl_uint.

CL_TENSOR_UINT64

CL_TENSOR_DTYPE_UINT64_EXP

64-bit unsigned integer.

cl_ulong.

CL_TENSOR_HALF

CL_TENSOR_DTYPE_FP8_EXP

Half precision floating-point.

cl_char.

CL_TENSOR_DTYPE_FP16_EXP

Half precision floating-point.

cl_half.

CL_TENSOR_BFLOAT16

CL_TENSOR_DTYPE_BFLOAT16_EXP

16-bit brain floating-point.

cl_ushort

CL_TENSOR_FLOAT

CL_TENSOR_DTYPE_FP32_EXP

Single precision floating-point.

cl_float.

CL_TENSOR_DOUBLE

CL_TENSOR_DTYPE_FP64_EXP

Double precision floating-point.

cl_double.

CL_TENSOR_COMPLEX64

CL_TENSOR_DTYPE_COMPLEX64_EXP

64-bit complex floating-point with 32-bit real and imaginary part.

cl_float2

CL_TENSOR_COMPLEX128

CL_TENSOR_DTYPE_COMPLEX128_EXP

128-bit complex floating-point with 64-bit real and imaginary part.

cl_double2

-
-

To retain a tensor object, call the function

-
-
-
-
cl_int clRetainTensorObject(cl_tensor tensor);
-
-
-
-
    -
  • -

    tensor is the tensor object to be retained.

    -
  • -
-
-
-

The tensor reference count is incremented.

-
-
-

clRetainTensor returns CL_SUCCESS if the function is executed -successfully. Otherwise, it returns one of the following errors:

-
-
-
    -
  • -

    CL_INVALID_TENSOR if the tensor is not a valid tensor object.

    -
  • -
-
-
-

To release a tensor object, call the function

-
-
-
-
cl_int clReleaseTensorObject(cl_tensor tensor);
-
-
-
-
    -
  • -

    tensor is the tensor object to be released.

    -
  • -
-
-
-

The tensor reference count is decremented.

-
-
-

The tensor object is deleted once the number of instances that are -retained to tensor become zero and the tensor object is no longer -needed by any enqueued or recorded commands that use tensor. Using -this function to release a reference that was not obtained by creating -the object or by calling clRetainTensor causes undefined behavior.

-
-
-

clReleaseTensor returns CL_SUCCESS if the function is executed -successfully. Otherwise, it returns one of the following errors:

-
-
-
    -
  • -

    CL_INVALID_TENSOR if tensor is not a valid tensor object.

    -
  • -
-
-
-

To return information about a tensor object, call the function

-
-
-
-
cl_int clGetTensorInfo(
-  cl_tensor tensor,
-  cl_tensor_info param_name,
-  size_t param_value_size,
-  void* param_value,
-  size_t* param_value_size_ret);
-
-
-
-
    -
  • -

    tensor specifies the tensor object being queried.

    -
  • -
  • -

    param_name specifies the information to query. The list of -supported param_name types and the information returned in -param_value by clGetTensorInfo is described in the [Tensor Object -Queries] table.

    -
  • -
  • -

    param_value is a pointer to memory where the appropriate result -being queried is returned. If param_value is NULL, it is ignored.

    -
  • -
  • -

    param_value_size is used to specify the size in bytes of memory -pointed to by param_value. This size must be ≥ size of return type -as described in the [Tensor Object Queries] table.

    -
  • -
  • -

    param_value_size_ret returns the actual size in bytes of data -being queried by param_name. If param_value_size_ret is NULL, it is -ignored.

    -
  • -
-
-
-

clGetTensorInfo returns CL_SUCCESS if the function is executed - succesfully. Otherwise, it returns one of the following errors:

-
-
-
    -
  • -

    CL_INVALID_TENSOR if tensor is not a valid tensor object.

    -
  • -
-
- - +
Table 2. List of supported param_names by clGetTensorInfo
+---+++ - - - - - - - - - - - + - - - + + + + + - - - + + + - - - + + + - - - + + + - - - - - -
Table 2. Optional tensor memory layout types.

CL_TENSOR_RANK

size_t

Return the tensor rank.

CL_TENSOR_SHAPE

size_t[]

Return the tensor shape.

CL_TENSOR_DTYPE

cl_tensor_datatype

Return the tensor data type.

layout typetensor layout typeDescription

CL_TENSOR_BOUND_TO_BUFFER

cl_bool

Return true if the tensor is -bound to a buffer.

CL_TENSOR_LAYOUT_OPAQUE_EXP

N/A

The tensor don’t have application + defined memory layout. Driver controls the tensors layout. To read + or write elements of the tensor

CL_TENSOR_BUFFER

cl_mem

-

If CL_TENSOR_BOUND_TO_BUFFER is true, -return the buffer object the tensor is bound to. Otherwise, -clGetTensorInfo call returns:

-
-
-
    -
  • -

    CL_INVALID_MEM_OBJECT if the tensor is not bound to a buffer object.

    -
  • -
  • -

    CL_INVALID_PROPERTY otherwise.

    -
  • -
-

CL_TENSOR_LAYOUT_BLAS_EXP

cl_tensor_layout_blas_exp

A type that describe packed memory layout similar ones used in BLAS APIs.

CL_TENSOR_CONTEXT

cl_context

Return the context specified when - the tensor object is created.

CL_TENSOR_LAYOUT_BLAS_EXP

cl_tensor_layout_blas_pitched_exp

A type that describe memory layout similar ones used in BLAS APIs.

CL_TENSOR_REFERENCE_COUNT

cl_uint

Return the tensor reference -count.

-
-

The following functions are for reading from a tensor to host memory / -buffer object or to write to a tensor object from host memory / buffer -object.

-
-
-
-
cl_int clEnqueueImportFromTensor(
-  cl_command_queue command_queue,
-  cl_tensor tensor,
-  cl_bool blocking_command,
-  const size_t* tensor_origin,
-  const size_t* mem_origin,
-  const size_t* region,
-  const size_t* mem_pitch,
-  cl_mem buffer,
-  void* host_ptr,
-  cl_uint num_events_in_wait_list,
-  const cl_event* event_wait_list,
-  cl_event* event);
-
-
-
-
-
cl_int clEnqueueExportToTensor(
-  cl_command_queue command_queue,
-  cl_tensor tensor,
-  cl_bool blocking_command,
-  const size_t* tensor_origin,
-  const size_t* mem_origin,
-  const size_t* region,
-  const size_t* mem_pitch,
-  cl_mem buffer,
-  const void* host_ptr,
-  cl_uint num_events_in_wait_list,
-  const cl_event* event_wait_list,
-  cl_event* event);
-
-
-
-
    -
  • -

    command_queue is a valid host command-queue in which the read / -write command will be queued. command_queue and tensor must be -created with the same OpenCL context.

    -
  • -
  • -

    tensor refers to a valid tensor object which is bound to a buffer.

    -
  • -
  • -

    blocking_command indicate if the read and write operations are -blocking or non-blocking (see below).

    -
  • -
  • -

    tensor_origin defines the offset coordinates in tensor for start of -the regions to read / write tensor data. The length of the array -must be at least rank the the tensor.

    -
  • -
  • -

    mem_origin defines the offset coordinates in the memory region -pointed by buffer or host_ptr expressed in elements of tensor -data type. The length of the array must be at least rank the the -tensor.

    -
  • -
  • -

    region defines the region being read or written expressed in in -elements of tensor data type. The length of the array must be at -least rank the the tensor. If region is NULL then tensor's -shape will be used as the region.

    -
  • -
  • -

    mem_pitch defines the length of each dimension in elements to be -used for the memory region of buffer or host_ptr. The length of -the array must be at least the rank of tensor minus one. if -mem_pitch is NULL or mem_pitch[i] is zero, mem_pitch[i] is -computed as region[i + 1].

    -
  • -
  • -

    buffer and host_ptr refer to a valid buffer object / host -allocation where data is to be read into or to be written from. -Either the buffer or host_ptr can be non-NULL in which case the -non-NULL argument is used as the operand for the operation.

    -
  • -
  • -

    event_wait_list and num_events_in_wait_list specify events that -need to complete before this particular command can be executed. If -event_wait_list is NULL, then this particular command does not -wait on any event to complete. If event_wait_list is NULL, -num_events_in_wait_list must be 0. If event_wait_list is not -NULL, the list of events pointed to by event_wait_list must be -valid and num_events_in_wait_list must be greater than 0. The -events specified in event_wait_list act as synchronization -points. The context associated with events in event_wait_list and -command_queue must be the same. The memory associated with -event_wait_list can be reused or freed after the function returns.

    -
  • -
  • -

    event returns an event object that identifies this read / write -command and can be used to query or queue a wait for this command to -complete. If event is NULL or the enqueue is unsuccessful, no -event will be created and therefore it will not be possible to query -the status of this command or to wait for this command to -complete. If event_wait_list and event are not NULL, event -must not refer to an element of the event_wait_list array.

    -
  • -
-
-
-

The clEnqueueExportToTensor function copies contents of the buffer -object / host allocation to tensor’s storage in -implementation-defined, opaque memory layout. The -clEnqueueImportFromTensor function copies data from tensor’s -storage to buffer object / host allocation.

-
-
-

The elements of buffer object / host allocation are mapped to tensor -coordinates and vice versa as follows in pseudo C code:

-
-
-
-
tensor_element(
-  tensor_origin[0] + i[0],
-  tensor_origin[1] + i[1],
-  ...,
-  tensor_origin[N-2] + i[N-2],
-  tensor_origin[N-2] + i[N-1]) ==
-((TENSOR_DATATYPE *)buffer_or_host_ptr)[
-  (mem_origin[0] + i[0]) * pitch(0) +
-  (mem_origin[1] + i[1]) * pitch(1) +
-  ... +
-  (mem_origin[N-2] + i[N-2]) * pitch(N-2) +
-  (mem_origin[N-1] + i[N-1])];
+

CL_TENSOR_LAYOUT_ML_EXP

+

cl_tensor_layout_ml_exp

+

A convenience layout type over CL_TENSOR_LAYOUT_BLAS_EXP.

+ + + +
+
+
+
+
+
(Add the following to new Section 5.X.Y.1, BLAS Tensor Layout)
+
+
+
-

Where the N is tensor rank, the i[X] is a tensor coordinate with -inclusive range of 0..<region[X]-1> and the pitch is computed as -follows in pseudo C code:

+

The following structures describes packed / pitched BLAS-like memory +layout for the tensor:

-
size_t pitch(size_t dim) {
-  size_t pitch = 1;
-  for (size_t i = dim; i < tensor_rank - 1; i++)
-    pitch *=
-      (mem_pitch != NULL || mem_pitch[i] == 0) ? mem_pitch[i] : region[i + 1];
-  return pitch;
-}
-
-
-
-

For dim in 0..(tensor_rank()-1). The tensor_element() represents -an abstract function that accesses a tensor element in its storage at -given coordinate. The method how the coordinates translate to tensor -storage addresses is unspecified.

+
typedef struct cl_tensor_layout_blas_exp {
+    cl_tensor_dim_exp    leading_dims[CL_TENSOR_DESC_MAX_RANK_EXP];
+} cl_tensor_layout_blas_exp;
+
+typedef struct cl_tensor_layout_blas_pitched_exp {
+    cl_tensor_dim_exp    leading_dims[CL_TENSOR_DESC_MAX_RANK_EXP];
+    cl_tensor_pitch      leading_pitches[CL_TENSOR_DESC_MAX_RANK_EXP];
+} cl_tensor_layout_blas_pitched_exp;
+
+typedef struct cl_tensor_layout_ml_exp {
+    cl_tensor_layout_ml_type_exp ml_type;
+} cl_tensor_layout_ml_exp;
-
-

clEnqueueImportFromTensor and clEnqueueExportToTensor -returns CL_SUCCESS if the function is executed -successfully. Otherwise, it returns one of the following errors:

  • -

    CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host -command-queue.

    -
  • -
  • -

    CL_INVALID_CONTEXT if the context associated with command_queue -and buffer are not the same or if the context associated with -command_queue and events in event_wait_list are not the same.

    -
  • -
  • -

    CL_INVALID_MEM_OBJECT if buffer is not a valid buffer object.

    -
  • -
  • -

    CL_INVALID_VALUE if tensor_origin or mem_origin is NULL.

    -
  • -
  • -

    CL_INVALID_VALUE if the region being read or written specified by -(mem_origin, region, mem_pitch) is out of bounds.

    -
  • -
  • -

    CL_INVALID_VALUE if any region array element is 0.

    -
  • -
  • -

    CL_INVALID_VALUE if mem_pitch is not NULL and mem_pitch[i] is -not 0 and mem_pitch[i] is less than region[i].

    -
  • -
  • -

    CL_INVALID_VALUE if buffer and host_ptr both are NULL or non-NULL.

    +

    leading_dims describes which elements along the tensor dimension +are laid out in the memory. leading_dims[0] point to dimension +whose elements are laid out first, followed by elements along +dimension by leading_dims[1] and so on. The first N elements must +be non-zero where N is tensor’s rank and the values must be unique +and within range [0, tensor_rank).

  • -

    CL_INVALID_EVENT_WAIT_LIST if event_wait_list is NULL and -num_events_in_wait_list > 0, or event_wait_list is not NULL and -num_events_in_wait_list is 0, or if event objects in -event_wait_list are not valid events.

    -
  • +

    leading_pitches describes distance between from an element to the +next one for the leading dimensions in leading_dims. The distance +is measured in number of elements. The first N elements must be +non-zero where the N is tensor’s rank minus one. The values of the +array must be non-zero for the first tensor rank minus one elements +and following conditions must hold:

    +
    +
    • -

      CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST if the read and write -operations are blocking and the execution status of any of the -events in event_wait_list is a negative integer value.

      +

      leading_pitches[0] >= tensor_shape[leading_dims[0]] if the tensor +rank is greater than one and

    • -

      CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate -memory for data store associated with memory object the tensor is -bound to.

      +

      leading_pitches[i + 1] >= tensor_shape[leading_dims[i]] * +leading_pitches[i] for i in [0, tensor_rank - 1) if the tensor +rank is greater than two.

    • -
    • -

      CL_OUT_OF_RESOURCES if there is a failure to allocate resources -required by the OpenCL implementation on the device.

      +
    +
    +
+
+
+
  • -

    CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources -required by the OpenCL implementation on the host.

    +

    ml_type defines memory layout via enumerators which corresponds to +predefined configurations of cl_tensor_layout_blas_exp structure +as listed in ML tensor layout type table.

-

If cl_khr_command_buffer is supported, then the following command -buffer counterparts of the clEnqueueImportFromTensor and -clEnqueueExportToTensor commands are available.

-
-
-
-
cl_int clCommandImportFromTensorKHR(
-  cl_command_buffer_khr command_buffer,
-  cl_command_queue command_queue,
-  cl_tensor tensor,
-  const size_t* tensor_origin,
-  const size_t* mem_origin,
-  const size_t* region,
-  const size_t* mem_pitch,
-  cl_mem buffer,
-  void* host_ptr,
-  cl_uint num_sync_points_in_wait_list,
-  const cl_sync_point_khr* sync_point_wait_list,
-  cl_sync_point_khr* sync_point,
-  cl_mutable_command_khr* mutable_handle);
-
+

The memory layout descriptions map tensor coordinates to buffer’s +memory byte locations respect to buffer’s base address as followed in +pseudo C:

-
cl_int clCommandExportToTensorKHR(
-  cl_command_buffer_khr command_buffer,
-  cl_command_queue command_queue,
-  cl_tensor tensor,
-  const size_t* tensor_origin,
-  const size_t* mem_origin,
-  const size_t* region,
-  const size_t* mem_pitch,
-  cl_mem buffer,
-  const void* host_ptr,
-  cl_uint num_sync_points_in_wait_list,
-  const cl_sync_point_khr* sync_point_wait_list,
-  cl_sync_point_khr* sync_point,
-  cl_mutable_command_khr* mutable_handle);
-
+
size_t index = 0;
+for (unsigned i = 0; i < tensor_rank - 1; i++)
+  index += tensor_coordinates[leading_dims[i]] * pitches[i];
+buffer_offset = index * tensor_element_size;
-
-
    -
  • -

    command_buffer refers to valid command-buffer object.

    -
  • -
  • -

    For command_queue, tensor, tensor_origin, mem_origin, -region, mem_pitch, buffer and host_ptr parameters refer to -clEnqueueImportFromTensor.

    -
  • -
  • -

    For num_sync_points_in_wait_list, sync_point_wait_list, -sync_point, mutable_handle parameters refer to -clCommandCopyBufferKHR.

    -
  • -
-

clCommandImportFromTensorKHR and clCommandImportFromTensorKHR -returns CL_SUCCESS if the function is executed -successfully. Otherwise, it returns one of the following errors:

+

Where pitches[i] equals to:

  • -

    CL_INVALID_COMMAND_QUEUE if command_queue is not NULL.

    -
  • -
  • -

    CL_INVALID_COMMAND_BUFFER_KHR if command_buffer is not a valid -command-buffer.

    -
  • -
  • -

    CL_INVALID_CONTEXT if the context associated with command_queue -and command_buffer is not the same.

    -
  • -
  • -

    CL_INVALID_OPERATION if command_buffer has been finalized.

    -
  • -
  • -

    CL_INVALID_VALUE if mutable_handle is not NULL.

    -
  • -
  • -

    CL_INVALID_SYNC_POINT_WAIT_LIST_KHR if sync_point_wait_list is -NULL and num_sync_points_in_wait_list is > 0, or -sync_point_wait_list is not NULL and num_sync_points_in_wait_list is -0, or if synchronization-point objects in sync_point_wait_list are -not valid synchronization-points.

    -
  • -
  • -

    CL_OUT_OF_RESOURCES if there is a failure to allocate resources -required by the OpenCL implementation on the device.

    +

    leading_pitches[i] for cl_tensor_layout_blas_pitched_exp.

  • -

    CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources -required by the OpenCL implementation on the host.

    +

    tensor_shape[leading_dims[i]] * +tensor_shape[leading_dims[i-1]] * …​ * +tensor_shape[leading_dims[0]] for cl_tensor_layout_blas_exp.

-
-
-

Add New Buffer Property in Section 5.2.1

- +
+---++ + + + + + + - - - + + + - - - + + - -
Table 3. ML tensor layout types and their corresponding cl_tensor_layout_blas_exp configuration.
ML layout typeEquivalent leading_dims configuration

CL_MEM_COMMAND_BUFFER_TEMPORARY

cl_bool

-

This property can be set if cl_khr_command_buffer extension is -supported.

-
-
- + + + - - + + -

CL_TENSOR_LAYOUT_ML_C_EXP

{}

-
Note
-
-This property temporarily lives here and will be moved to -a separate extension proposal. -

CL_TENSOR_LAYOUT_ML_NC_EXP

{1}

-
-
-

If the value is true, create a "temporary" buffer object that only can -be used on commands recorded in command buffers. Non-recording -command enqueue functions must return CL_INVALID_OPERATION if the -command refers to a temporary buffer object.

-
-
-

The temporary buffer objects are managed by command buffers. When a -temporary buffer object is used by multiple command buffer, the object -receives disjoint storage for each command buffer.

-
-
-

Storage of the temporary buffer objects may be allocated on-demand -basis. At the times the buffer is not needed, OpenCL implementations -may reuse storage for other tasks within the command buffer.

-
-
-

Contents of the temporary buffers are not guaranteed to be preserved -across command buffer executions.

-

CL_TENSOR_LAYOUT_ML_CN_EXP

{0}

CL_MEM_BIND_TO_TENSOR

cl_tensor

-

Use the created buffer as -storage for the given valid tensor. To succeed creating the buffer, -the target tensor may not have storage already and size -argument of the clCreateBufferWithProperties() must be zero.

-
-
-

Size of the memory buffer is implementation-defined and it can be -queried with clGetTensorInfo().

-
-
-

Memory layout of the tensor in the created memory buffer is -implementation-defined and opaque to the applications and it may -change at unspecified points. Implementation may use non-contiguous -allocations to store the tensor data and implementation may store -auxiliary data within the allocations. Therefore, reading from or -writing to the memory buffer directly using the cl_mem handle leads to -undefined behavior.

-
-
-

If the tensor is already bound to a buffer object, -clCreateBufferWithProperties call returns CL_TENSOR_BOUND_TO_BUFFER -error code.

-

CL_TENSOR_LAYOUT_ML_HW_EXP

{1}

-
-
-

Add New Memory Object Query in Section 5.5.5

- ----- - - - - + + - -

CL_MEM_COMMAND_BUFFER_TEMPORARY

cl_bool

This property can be -queried if cl_khr_command_buffer extension is supported.

-

Return true if the memobj is temporary buffer object for command -buffers.

CL_TENSOR_LAYOUT_ML_CHW_EXP

{2, 1}

-
-
-

Add New Error Codes in Appendix F

- ---- - - - + + - - + +

CL_TENSOR_BOUND_TO_BUFFER

Returned when attempting to bind a - buffer object to a tensor which already has been bound to the same - or another.

CL_TENSOR_LAYOUT_ML_NCHW_EXP

{3, 2, 1}

CL_INVALID_TENSOR

Returned then the specified tensor is not a - valid tensor object.

CL_TENSOR_LAYOUT_ML_NHWC_EXP

{1, 3, 2}

-
-

Sample Codes

-
-

Helper functions used in the follow up tensor code samples:

+
+
+
-
-
-
cl_kernel create_matmul_kernel(
-  cl_context ctx, std::span<cl_device_id> device_span,
-  cl_tensor lhs, cl_tensor rhs, cl_tensor out) {
-  // A hypothetical matmul kernel signature in pseudo OpenCL C for
-  // illustrative purposes:
-  //
-  //   kernel void matmul(global read_only tensor_t, global read_only tensor_t,
-  //                      global write_only tensor_t);
-
-  cl_kernel matmul_kernel = /* Omitted. */;
-  clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor), &lhs);
-  clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor), &rhs);
-  clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor), &out);
-  return matmul_kernel;
-}
-
-cl_kernel create_add_kernel(
-  cl_context ctx, std::span<cl_device_id> device_span,
-  cl_tensor lhs, cl_tensor rhs, cl_tensor out) {
-  // A hypothetical add kernel signature in pseudo OpenCL C for illustrative
-  // purposes:
-  //
-  // kernel void add(global read_only tensor_t, global read_only tensor_t,
-  //                 global write_only tensor_t);
-
-  cl_tensor add_kernel = /* Omitted. */;
-  clSetKernelArg(add_kernel, 0, sizeof(cl_tensor), &lhs);
-  clSetKernelArg(add_kernel, 1, sizeof(cl_tensor), &rhs);
-  clSetKernelArg(add_kernel, 2, sizeof(cl_tensor), &out);
-  return add_kernel;
-}
+
+

Sample Codes

+
-

An example usage of tensors on a command queue:

+

An example usage of tensors:

-
constexpr size_t b = 64, m = 100, n = 200, k = 50;
+
constexpr size_t b = 64, m = 100, n = 200, k = 50;
 
-cl_int err;
-cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
-cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
-cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
-cl_tensor t0  = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
-cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+std::vector<float> in0_data = ...;
+std::vector<float> in1_data = ...;
+std::vector<float> out_data(b * m * n);
 
-cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
-cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
+// Create tensor with opaque layout.
+cl_tensor_desc_exp in0_desc;
+in0_desc.rank = 3;
+in0_desc.properties[0] = 0;
+in0_desc.shape[0] = b;
+in0_desc.shape[1] = m;
+in0_desc.shape[2] = k;
+in0_desc.layout = nullptr;
+in0_desc.layout_type = CL_TENSOR_LAYOUT_OPAQUE_EXP;
 
-// Allocate storage for the tensors. The buffer size must be set to
-// zero when the buffer is bound to a tensor. OpenCL implementation
-// may determine optimal data layout and the storage needed for it,
-// based on the tensor's uses (the 'matmul' and 'add' kernels in this
-// sample) so far.
-cl_mem in0_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, in0, 0}, CL_MEM_READ_ONLY,
-  0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */, nullptr, &err);
-cl_mem in1_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, in1, 0}, CL_MEM_READ_ONLY,
-  0, nullptr, &err);
-cl_mem in2_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, in2, 0}, CL_MEM_READ_ONLY,
-  0, nullptr, &err);
-cl_mem t0_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE,
-  0, nullptr, &err);
-cl_mem out_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, out, 0}, CL_MEM_WRITE_ONLY,
-  0, nullptr, &err);
+cl_int err;
+cl_mem in0_tensor = clCreateBufferWithProperties(
+  ctx, {CL_MEM_TENSOR_EXP, in0_desc, 0},
+  CL_MEM_READ_ONLY, 0, nullptr, &err);
 
-std::vector<float> in0_data = ...;
-std::vector<float> in1_data = ...;
-std::vector<float> out_data(b * m * n);
+// Create tensor from a host allocation using an application defined
+// layout description for mapping elements to the tensor.
+cl_tensor_desc_exp in1_desc;
+in1_desc.rank = 3;
+in1_desc.properties[0] = 0;
+in1_desc.shape[0] = b;
+in1_desc.shape[1] = k;
+in1_desc.shape[2] = n;
 
-// Copies data into in0 tensor while possibly rearranging the data to the
-// optimal data layout.
-clEnqueueExportToTensor(
-  cmd_q, in0, false, {0, 0, 0}, {0, 0, 0}, {b, m, k},
-  nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr);
-clEnqueueExportToTensor(
-  cmd_q, in1, false, {0, 0, 0}, {0, 0, 0}, {b, k, n},
-  nullptr, nullptr, in1_data.data(), 0, nullptr, nullptr);
-clEnqueueNDRangeKernel(
-  cmd_q, matmul_kernel, 3, matmul_grid, nullptr, nullptr, 0, nullptr, nullptr);
-clEnqueueNDRangeKernel(
-  cmd_q, add_kernel, 3, add_grid, nullptr, nullptr, 0, nullptr, nullptr);
-clEnqueueImportFromTensor(
-  cmd_q, out, false,  {0, 0, 0}, {0, 0, 0}, {b, m, n},
-  nullptr, nullptr, out_data.data(), 0, nullptr, nullptr);
-
-
-
-

An example use of tensors in a command buffer when cl_khr_command_buffer -extension is supported:

-
-
-
-
constexpr size_t b = 64, m = 100, n = 200, k = 50;
+cl_tensor_layout_blas_exp col_major;
+col_major.leading_dims[0] = 1,
+col_major.leading_dims[1] = 2,
+in1_desc.layout = &col_major;
+in1_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP;
 
-cl_int err;
-cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
-cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
-cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
-cl_tensor t0  = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
-cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_mem in1_tensor = clCreateBufferWithProperties(
+  ctx, {CL_MEM_TENSOR_EXP, in1_desc, 0},
+  CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, in1_data.data(), &err);
+
+// Create another tensor with application defined layout.
+cl_tensor_desc_exp out_desc;
+out_desc.rank = 3;
+out_desc.properties[0] = 0;
+out_desc.shape[0] = b;
+out_desc.shape[1] = m;
+out_desc.shape[2] = n;
 
-cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
-cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
+cl_tensor_layout_blas_exp row_major;
+row_major.leading_dims[0] = 2,
+row_major.leading_dims[1] = 1,
+out_desc.layout = &row_major;
+out_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP;
 
-// Bind command buffer managed storage to tensors.
-//
-// NOTE: same temporary tensor handle used in multiple command buffers
-//       will have separate storage. IOW, command buffers may not exchange
-//       data via temporary buffers between them.
-cl_mem in0_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in0, 0},
-  CL_MEM_READ_ONLY, 0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */,
-  nullptr, &err);
-cl_mem in1_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in1, 0},
-  CL_MEM_READ_ONLY, 0, nullptr, &err);
-cl_mem in2_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in2, 0},
-  CL_MEM_READ_ONLY, 0, nullptr, &err);
-cl_mem t0_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, t0, 0},
-  CL_MEM_READ_WRITE, 0, nullptr, &err);
-cl_mem out_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, out, 0},
-  CL_MEM_WRITE_ONLY, 0, nullptr, &err);
+cl_mem out_tensor = clCreateBufferWithProperties(
+  ctx, {CL_MEM_TENSOR_EXP, out_desc, 0},
+  CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY, 0, out_data.data(), &err);
 
-std::vector<float> in0_data = ...;
-std::vector<float> in1_data = ...;
-std::vector<float> out_data(b * m * n);
+// Create a kernel that operates on the tensors and is possibly
+// optimized for them using via yet realized API extension.
+cl_kernel batched_matmul_kernel = create_batched_matmul_kernel(
+  ctx, device_span, in1_desc, in2_desc, out_desc);
 
-cl_command_buffer_khr cb =
-  clCreateCommandBufferKHR(num_queues, queue_list, nullptr, &err);
+clSetKernelArg(batched_matmul_kernel, 0, sizeof(cl_mem), &in0_tensor);
+clSetKernelArg(batched_matmul_kernel, 1, sizeof(cl_mem), &in1_tensor);
+clSetKernelArg(batched_matmul_kernel, 2, sizeof(cl_mem), &out_tensor);
 
-cl_sync_point_khr in0_syncp, in1_syncp, matmul_syncp, add_syncp;
-clCommandExportToTensorKHR(
-  cmd_b, cmd_q, in0, {0, 0, 0}, {0, 0, 0}, {b, m, k},
-  nullptr, nullptr, in0_data.data(), 0, nullptr, &in0_syncp);
-clCommandExportToTensorKHR(
-  cmd_b, cmd_q, in1, {0, 0, 0}, {0, 0, 0}, {b, k, m},
-  nullptr, nullptr, in1_data.data(), 0, nullptr, &in1_syncp);
-clCommandNDRangeKernelKHR(
-  cmd_b, cmd_q, nullptr, matmul_kernel, 3, matmul_grid, nullptr, nullptr,
-  2, {in0_syncp, in2_syncp}, &matmul_syncp, nullptr);
-clCommandNDRangeKernelKHR(
-  cmd_b, cmd_q, nullptr, add_kernel, 3, add_grid, nullptr, nullptr,
-  1, {matmul_syncp}, &add_syncp, nullptr);
-clCommandImportFromTensorKHR(
-  cmd_b, cmd_q, out, {0, 0, 0}, {0, 0, 0}, {b, k, m},
-  nullptr, nullptr, out_data.data(), 1, {add_syncp}, nullptr);
+// Required command for transferring data to layout-opaque tensors and
+// from it to elsewhere.
+clEnqueueExportToTensor(
+  cmd_q, in0_tensor, false, {0, 0, 0}, {0, 0, 0}, {b, m, k},
+  nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr);
 
-// Finalize the command buffer. At this point the OpenCL
-// implementation may reserve enough storage for all the tensor
-// temporaries. Temporary tensors might be eliminated - for example,
-// OpenCL implementation could use 'out' tensor to store result of
-// matmul_kernel , thus, eliminating the need of 't0' tensor.
-clFinalizeCommandBufferKHR(cmd_b);
+clEnqueueNDRangeKernel(
+  cmd_q, batched_matmul_kernel, 3, matmul_grid, nullptr, nullptr, 0, nullptr, nullptr);
 
-// Temporary tensors used in a command buffer can't be read or written
-// into. A hypothetical reason is that the finalized command buffer
-// might not use some of the tensor.
-assert(clEnqueueImportFromTensor(..., t0, ...) == CL_INVALID_OPERATION);
+clEnqueueMapBuffer( + cmd_q, out_tensor, CL_TRUE, CL_MAP_READ, 0, b * m * n, 0, nullptr, nullptr);
-
-

Open Questions

+
+
+

Issues

+
  1. @@ -1572,8 +1929,28 @@

    Open Questions

    RESOLVED: OpenCL C support for tensors can be introduced later in a - separate extension. Built-in kernels may benefit from this - extension as it is.

    + separate extension. Built-in kernels may benefit from this + extension as it is.

    +
    +
    +
+ +
  • +

    What is the use case of cl_tensor_layout_blas_pitch_exp?

    +
    +
    +
    +

    UNRESOLVED

    +
    +
    +
    +
  • +
  • +

    Should image types be extended instead of adding a separate tensor type?

    +
    +
    +
    +

    UNRESOLVED

    @@ -1582,17 +1959,53 @@

    Open Questions

  • +
    +

    Version History

    +
    + ++++++ + + + + + + + + + + + + + + + + + + + + + + +
    VersionDateAuthorChanges

    0.1.0

    2023-11-23

    Henry Linjamäki

    Initial revision

    0.2.0

    2024-8-14

    Henry Linjamäki

    * Rework document structure match to the cl_khr_extension_template.

    +

    * Added clEnqueueCopyTensor.

    +

    * Added API for setting memory layout for tensors.

    +

    -1. only LSB bit is considered when writing data to tensor. When reading data from tensor the boolean value will be written as 0 or 1. The boolean values in the tensor may be packed densenly +1. zero and non-zero bytes are interpreted as false and true values, respectively.
    From 4586eefc17fcbf38cce4bdc6ef29360ff16a1cda Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 15 Aug 2024 12:21:05 +0300 Subject: [PATCH 21/26] Update extensions/cl_exp_tensor.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Pekka Jääskeläinen --- extensions/cl_exp_tensor.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/extensions/cl_exp_tensor.asciidoc b/extensions/cl_exp_tensor.asciidoc index 002ff366..774c1857 100644 --- a/extensions/cl_exp_tensor.asciidoc +++ b/extensions/cl_exp_tensor.asciidoc @@ -6,7 +6,7 @@ = cl_exp_tensor -This extension provides new buffer abstraction - tensor objects - for +This extension provides a new buffer abstraction, tensor objects, for managing N-dimensional data. == XXX - Not complete yet!!! From 19116009dfa676d948adace46eca4a222c5f74a0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 15 Aug 2024 12:31:26 +0300 Subject: [PATCH 22/26] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Pekka Jääskeläinen --- extensions/cl_exp_tensor.asciidoc | 56 +++++++++++++++---------------- 1 file changed, 28 insertions(+), 28 deletions(-) diff --git a/extensions/cl_exp_tensor.asciidoc b/extensions/cl_exp_tensor.asciidoc index 774c1857..5c33f277 100644 --- a/extensions/cl_exp_tensor.asciidoc +++ b/extensions/cl_exp_tensor.asciidoc @@ -22,7 +22,7 @@ TODO == Contributors Henry Linjamäki, Intel. + -Pekka Jääslkeläinen, Intel and Tampere University. + +Pekka Jääskeläinen, Intel. + Ben Ashbaugh, Intel. + == Notice @@ -46,30 +46,30 @@ This extension requires OpenCL 1.2 or later. == Overview -The extension provides new tensor object abstraction. Tensor objects -are similar to image types in regard they represents N-dimensional -data of some application chosen data type and they may be mapped to -dedicated hardware except that +The extension provides a new tensor object abstraction. Tensor objects +are similar to image types in regard that they represent N-dimensional +data of an application chosen data type and they may be mapped to +dedicated hardware, with the following key differences: -* higher than 3-dimensional data can be supported (limited by +* Higher than 3-dimensional data can be supported (limited by devices' capabilities). -* applications may choose how the data elements of the tensors are +* Applications may choose how the data elements of the tensors are laid out in the buffers using the tensor layout descriptions provided in this extension. -Applications may also choose the memory layouts of the tensors be +Applications may also choose the memory layouts of the tensors to be implementation-specified, letting the driver to optimize the tensor data layout for better performance or to lay out the data as required by -hardware functions (e.g. exposed via builtin kernels). +hardware accelerated functions (e.g. exposed via builtin kernels). -The scope of this extension to provide host APIs for creating tensor +The scope of this extension is to provide host APIs for creating tensor objects and transfer data between tensors, host and other memory objects. A separate extension implemented on top of this extension, -cl_exp_defined_builtin_kernels which provides "defined built-in -kernels" (DKBs) which operates on tensors. It also provides mechanism +cl_exp_defined_builtin_kernels provides "defined built-in +kernels" (DKBs) which can operate on tensors. It also provides mechanism for drivers to create DBKs that are optimized for the tensor arguments they operate on. @@ -184,7 +184,7 @@ typedef struct cl_tensor_layout_ml_exp { == New API Enums -Accepted value for _properties_ parameter to +Accepted value for the _properties_ parameter to *clCreateBufferWithProperties* for creating a tensor object: [source,c] @@ -812,13 +812,13 @@ and true values, respectively.] |=== | *layout type* | *tensor layout type* | *Description* -| CL_TENSOR_LAYOUT_OPAQUE_EXP | N/A | The tensor don't have application +| CL_TENSOR_LAYOUT_OPAQUE_EXP | N/A | The tensor doesn't have application defined memory layout. Driver controls the tensors layout. To read or write elements of the tensor | CL_TENSOR_LAYOUT_BLAS_EXP |<> -| A type that describe packed memory layout similar ones used in BLAS APIs. +| A type that describes a packed memory layout similar ones used in BLAS APIs. | CL_TENSOR_LAYOUT_BLAS_EXP |<> @@ -837,7 +837,7 @@ A convenience layout type over `CL_TENSOR_LAYOUT_BLAS_EXP`. (Add the following to new Section 5.X.Y.1, *BLAS Tensor Layout*) :: + -- -The following structures describes packed / pitched BLAS-like memory +The following structures describe packed / pitched BLAS-like memory layout for the tensor: [source,c] @@ -857,13 +857,13 @@ typedef struct cl_tensor_layout_ml_exp { ---- * _leading_dims_ describes which elements along the tensor dimension - are laid out in the memory. `leading_dims[0]` point to dimension + are laid out in the memory. `leading_dims[0]` points to the dimension whose elements are laid out first, followed by elements along - dimension by `leading_dims[1]` and so on. The first N elements must - be non-zero where N is tensor's rank and the values must be unique + the dimension by `leading_dims[1]` and so on. The first N elements must + be non-zero where N is a tensor's rank and the values must be unique and within range `[0, tensor_rank)`. -* _leading_pitches_ describes distance between from an element to the +* _leading_pitches_ describes the distance between an element to the next one for the leading dimensions in _leading_dims_. The distance is measured in number of elements. The first N elements must be non-zero where the N is tensor's rank minus one. The values of the @@ -880,7 +880,7 @@ typedef struct cl_tensor_layout_ml_exp { // ^ This condition is meant to ensure that the tensor elements at different // coordinates don't alias. -* _ml_type_ defines memory layout via enumerators which corresponds to +* _ml_type_ defines the memory layout via enumerators which corresponds to predefined configurations of `cl_tensor_layout_blas_exp` structure as listed in <> table. @@ -933,7 +933,7 @@ std::vector in0_data = ...; std::vector in1_data = ...; std::vector out_data(b * m * n); -// Create tensor with opaque layout. +// Create a tensor with an opaque layout. cl_tensor_desc_exp in0_desc; in0_desc.rank = 3; in0_desc.properties[0] = 0; @@ -948,7 +948,7 @@ cl_mem in0_tensor = clCreateBufferWithProperties( ctx, {CL_MEM_TENSOR_EXP, in0_desc, 0}, CL_MEM_READ_ONLY, 0, nullptr, &err); -// Create tensor from a host allocation using an application defined +// Create tensor from a host allocation using an application-defined // layout description for mapping elements to the tensor. cl_tensor_desc_exp in1_desc; in1_desc.rank = 3; @@ -967,7 +967,7 @@ cl_mem in1_tensor = clCreateBufferWithProperties( ctx, {CL_MEM_TENSOR_EXP, in1_desc, 0}, CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, in1_data.data(), &err); -// Create another tensor with application defined layout. +// Create another tensor with an application-defined layout. cl_tensor_desc_exp out_desc; out_desc.rank = 3; out_desc.properties[0] = 0; @@ -995,7 +995,7 @@ clSetKernelArg(batched_matmul_kernel, 1, sizeof(cl_mem), &in1_tensor); clSetKernelArg(batched_matmul_kernel, 2, sizeof(cl_mem), &out_tensor); // Required command for transferring data to layout-opaque tensors and -// from it to elsewhere. +// from it elsewhere. clEnqueueExportToTensor( cmd_q, in0_tensor, false, {0, 0, 0}, {0, 0, 0}, {b, m, k}, nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr); @@ -1008,9 +1008,9 @@ clEnqueueMapBuffer( ---- -== Issues +== Issues and Open Questions -. Should we have support for tensors with undefined shape and tensors +. Should we support tensors with undefined shape and tensors with unknown / symbolic dimension sizes like in ONNX? + -- @@ -1053,6 +1053,6 @@ clEnqueueMapBuffer( * Added clEnqueueCopyTensor. -* Added API for setting memory layout for tensors. +* Added an API for setting the memory layout for tensors. |==== From af6d58c249f2cd7729f75e9c63f19b355b570c11 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 15 Aug 2024 13:47:03 +0300 Subject: [PATCH 23/26] Address some feedback, fix formatting --- extensions/cl_exp_tensor.asciidoc | 44 ++++++++++++++++++++++--------- 1 file changed, 32 insertions(+), 12 deletions(-) diff --git a/extensions/cl_exp_tensor.asciidoc b/extensions/cl_exp_tensor.asciidoc index 5c33f277..f7a2b9c0 100644 --- a/extensions/cl_exp_tensor.asciidoc +++ b/extensions/cl_exp_tensor.asciidoc @@ -220,7 +220,8 @@ CL_TENSOR_DTYPE_UINT16_EXP 0x???? CL_TENSOR_DTYPE_UINT32_EXP 0x???? CL_TENSOR_DTYPE_UINT64_EXP 0x???? -CL_TENSOR_DTYPE_FP8_EXP 0x???? +CL_TENSOR_DTYPE_FP8E4M3_EXP 0x???? +CL_TENSOR_DTYPE_FP8E5M2_EXP 0x???? CL_TENSOR_DTYPE_FP16_EXP 0x???? CL_TENSOR_DTYPE_FP32_EXP 0x???? CL_TENSOR_DTYPE_FP64_EXP 0x???? @@ -795,7 +796,17 @@ and true values, respectively.] | CL_TENSOR_DTYPE_UINT16_EXP | 16-bit unsigned integer. | cl_ushort. | CL_TENSOR_DTYPE_UINT32_EXP | 32-bit unsigned integer. | cl_uint. | CL_TENSOR_DTYPE_UINT64_EXP | 64-bit unsigned integer. | cl_ulong. -| CL_TENSOR_DTYPE_FP8_EXP | Half precision floating-point. | cl_char. + +| CL_TENSOR_DTYPE_FP8E4M3_EXP | 8-bit floating point with a sign bit, + 4 exponent bits, 3 mantissa bits and a exponent bias of 7. +| cl_char. + +| CL_TENSOR_DTYPE_FP8E5M2_EXP | 8-bit floating point with a sign bit, + 5 exponent bits, 2 mantissa bits and a exponent bias of 15. +| cl_char. + +// Reference: https://arxiv.org/pdf/2209.05433 + | CL_TENSOR_DTYPE_FP16_EXP | Half precision floating-point. | cl_half. | CL_TENSOR_DTYPE_BFLOAT16_EXP | 16-bit brain floating-point. | cl_ushort | CL_TENSOR_DTYPE_FP32_EXP | Single precision floating-point. | cl_float. @@ -812,20 +823,29 @@ and true values, respectively.] |=== | *layout type* | *tensor layout type* | *Description* -| CL_TENSOR_LAYOUT_OPAQUE_EXP | N/A | The tensor doesn't have application - defined memory layout. Driver controls the tensors layout. To read - or write elements of the tensor +| CL_TENSOR_LAYOUT_OPAQUE_EXP | N/A a| The tensor doesn't have + application defined memory layout. Driver controls the tensors + layout. To read or write elements of the tensor, the application + must: + +* use *clEnqueueExportToTensor* and *clEnqueueImportFromTensor* (or their + command buffer variants) or +* use *clEnqueueCopyTensor* to copy elements to / from another tensor + object with an application-defined memory layout. | CL_TENSOR_LAYOUT_BLAS_EXP |<> | A type that describes a packed memory layout similar ones used in BLAS APIs. -| CL_TENSOR_LAYOUT_BLAS_EXP +| CL_TENSOR_LAYOUT_BLAS_PITCHED_EXP |<> | A type that describe memory layout similar ones used in BLAS APIs. -| CL_TENSOR_LAYOUT_ML_EXP | <> | -A convenience layout type over `CL_TENSOR_LAYOUT_BLAS_EXP`. +| CL_TENSOR_LAYOUT_ML_EXP | <> | + +The tensor layout is specified with an enumerator. Each enumerator +corresponds to a predefined configuration of +*cl_tensor_layout_blas_exp* structure. |=== @@ -878,15 +898,15 @@ typedef struct cl_tensor_layout_ml_exp { rank is greater than two. // ^ This condition is meant to ensure that the tensor elements at different -// coordinates don't alias. +// coordinates don't alias in memory. * _ml_type_ defines the memory layout via enumerators which corresponds to predefined configurations of `cl_tensor_layout_blas_exp` structure as listed in <> table. The memory layout descriptions map tensor coordinates to buffer's -memory byte locations respect to buffer's base address as followed in -pseudo C: +memory byte locations respect to buffer's base address as in the +followed in pseudo C code example: [source,c] ---- @@ -1047,7 +1067,7 @@ clEnqueueMapBuffer( | Version | Date | Author | Changes | 0.1.0 | 2023-11-23 | Henry Linjamäki | *Initial revision* -| 0.2.0 | 2024-8-14 | Henry Linjamäki | +| 0.2.0 | 2024-8-14 | Henry Linjamäki a| * Rework document structure match to the cl_khr_extension_template. From 294b1a19a357893da91a28330546e0500927a452 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 15 Aug 2024 14:09:49 +0300 Subject: [PATCH 24/26] Add people who gave feedback in the version history --- extensions/cl_exp_tensor.asciidoc | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/extensions/cl_exp_tensor.asciidoc b/extensions/cl_exp_tensor.asciidoc index f7a2b9c0..2cb4ad26 100644 --- a/extensions/cl_exp_tensor.asciidoc +++ b/extensions/cl_exp_tensor.asciidoc @@ -1060,14 +1060,19 @@ clEnqueueMapBuffer( == Version History -[cols="5,15,15,70"] +[cols="5,10,15,40"] [grid="rows"] [options="header"] |==== -| Version | Date | Author | Changes -| 0.1.0 | 2023-11-23 | Henry Linjamäki | *Initial revision* - -| 0.2.0 | 2024-8-14 | Henry Linjamäki a| +| Version | Date | Author | Changes +| 0.1.0 | 2023-11-23 | Henry Linjamäki | *Initial revision* + +| 0.2.0 | 2024-8-14 | +Henry Linjamäki + +Pekka Jääskeläinen + +Michal Babej + +Freddie Witherden +a| * Rework document structure match to the cl_khr_extension_template. From 2293467fd922a732cf97c91b214eab47811ab598 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 15 Aug 2024 14:17:24 +0300 Subject: [PATCH 25/26] Update html render (temporary) --- extensions/cl_exp_tensor.html | 132 ++++++++++++++++++++++------------ 1 file changed, 85 insertions(+), 47 deletions(-) diff --git a/extensions/cl_exp_tensor.html b/extensions/cl_exp_tensor.html index 29822a4d..db1045c9 100644 --- a/extensions/cl_exp_tensor.html +++ b/extensions/cl_exp_tensor.html @@ -535,7 +535,7 @@

    cl_exp_tensor

    -

    This extension provides new buffer abstraction - tensor objects - for +

    This extension provides a new buffer abstraction, tensor objects, for managing N-dimensional data.

    @@ -567,7 +567,7 @@

    Contributors

    Henry Linjamäki, Intel.
    -Pekka Jääslkeläinen, Intel and Tampere University.
    +Pekka Jääskeläinen, Intel.
    Ben Ashbaugh, Intel.

    @@ -592,7 +592,7 @@

    Status

    Version

    -

    Built On: 2024-08-14
    +

    Built On: 2024-08-15
    Version: 0.2.0

    @@ -612,39 +612,39 @@

    Dependencies

    Overview

    -

    The extension provides new tensor object abstraction. Tensor objects -are similar to image types in regard they represents N-dimensional -data of some application chosen data type and they may be mapped to -dedicated hardware except that

    +

    The extension provides a new tensor object abstraction. Tensor objects +are similar to image types in regard that they represent N-dimensional +data of an application chosen data type and they may be mapped to +dedicated hardware, with the following key differences:

    • -

      higher than 3-dimensional data can be supported (limited by +

      Higher than 3-dimensional data can be supported (limited by devices' capabilities).

    • -

      applications may choose how the data elements of the tensors are +

      Applications may choose how the data elements of the tensors are laid out in the buffers using the tensor layout descriptions provided in this extension.

    -

    Applications may also choose the memory layouts of the tensors be +

    Applications may also choose the memory layouts of the tensors to be implementation-specified, letting the driver to optimize the tensor data layout for better performance or to lay out the data as required by -hardware functions (e.g. exposed via builtin kernels).

    +hardware accelerated functions (e.g. exposed via builtin kernels).

    -

    The scope of this extension to provide host APIs for creating tensor +

    The scope of this extension is to provide host APIs for creating tensor objects and transfer data between tensors, host and other memory objects.

    A separate extension implemented on top of this extension, -cl_exp_defined_builtin_kernels which provides "defined built-in -kernels" (DKBs) which operates on tensors. It also provides mechanism +cl_exp_defined_builtin_kernels provides "defined built-in +kernels" (DKBs) which can operate on tensors. It also provides mechanism for drivers to create DBKs that are optimized for the tensor arguments they operate on.

    @@ -769,7 +769,7 @@

    New API Types

    New API Enums

    -

    Accepted value for properties parameter to +

    Accepted value for the properties parameter to clCreateBufferWithProperties for creating a tensor object:

    @@ -807,7 +807,8 @@

    New API Enums

    CL_TENSOR_DTYPE_UINT32_EXP 0x???? CL_TENSOR_DTYPE_UINT64_EXP 0x???? -CL_TENSOR_DTYPE_FP8_EXP 0x???? +CL_TENSOR_DTYPE_FP8E4M3_EXP 0x???? +CL_TENSOR_DTYPE_FP8E5M2_EXP 0x???? CL_TENSOR_DTYPE_FP16_EXP 0x???? CL_TENSOR_DTYPE_FP32_EXP 0x???? CL_TENSOR_DTYPE_FP64_EXP 0x???? @@ -1588,8 +1589,15 @@

    Modifications to The Ope

    cl_ulong.

    -

    CL_TENSOR_DTYPE_FP8_EXP

    -

    Half precision floating-point.

    +

    CL_TENSOR_DTYPE_FP8E4M3_EXP

    +

    8-bit floating point with a sign bit, + 4 exponent bits, 3 mantissa bits and a exponent bias of 7.

    +

    cl_char.

    + + +

    CL_TENSOR_DTYPE_FP8E5M2_EXP

    +

    8-bit floating point with a sign bit, + 5 exponent bits, 2 mantissa bits and a exponent bias of 15.

    cl_char.

    @@ -1644,24 +1652,41 @@

    Modifications to The Ope

    CL_TENSOR_LAYOUT_OPAQUE_EXP

    N/A

    -

    The tensor don’t have application - defined memory layout. Driver controls the tensors layout. To read - or write elements of the tensor

    +
    +

    The tensor doesn’t have + application defined memory layout. Driver controls the tensors + layout. To read or write elements of the tensor, the application + must:

    +
    +
    +
      +
    • +

      use clEnqueueExportToTensor and clEnqueueImportFromTensor (or their +command buffer variants) or

      +
    • +
    • +

      use clEnqueueCopyTensor to copy elements to / from another tensor +object with an application-defined memory layout.

      +
    • +
    +

    CL_TENSOR_LAYOUT_BLAS_EXP

    cl_tensor_layout_blas_exp

    -

    A type that describe packed memory layout similar ones used in BLAS APIs.

    +

    A type that describes a packed memory layout similar ones used in BLAS APIs.

    -

    CL_TENSOR_LAYOUT_BLAS_EXP

    +

    CL_TENSOR_LAYOUT_BLAS_PITCHED_EXP

    cl_tensor_layout_blas_pitched_exp

    A type that describe memory layout similar ones used in BLAS APIs.

    CL_TENSOR_LAYOUT_ML_EXP

    -

    cl_tensor_layout_ml_exp

    -

    A convenience layout type over CL_TENSOR_LAYOUT_BLAS_EXP.

    +

    cl_tensor_layout_ml_exp

    +

    The tensor layout is specified with an enumerator. Each enumerator +corresponds to a predefined configuration of +cl_tensor_layout_blas_exp structure.

    @@ -1677,7 +1702,7 @@

    Modifications to The Ope
    -

    The following structures describes packed / pitched BLAS-like memory +

    The following structures describe packed / pitched BLAS-like memory layout for the tensor:

    @@ -1700,14 +1725,14 @@

    Modifications to The Ope
    • leading_dims describes which elements along the tensor dimension -are laid out in the memory. leading_dims[0] point to dimension +are laid out in the memory. leading_dims[0] points to the dimension whose elements are laid out first, followed by elements along -dimension by leading_dims[1] and so on. The first N elements must -be non-zero where N is tensor’s rank and the values must be unique +the dimension by leading_dims[1] and so on. The first N elements must +be non-zero where N is a tensor’s rank and the values must be unique and within range [0, tensor_rank).

    • -

      leading_pitches describes distance between from an element to the +

      leading_pitches describes the distance between an element to the next one for the leading dimensions in leading_dims. The distance is measured in number of elements. The first N elements must be non-zero where the N is tensor’s rank minus one. The values of the @@ -1732,7 +1757,7 @@

      Modifications to The Ope
      • -

        ml_type defines memory layout via enumerators which corresponds to +

        ml_type defines the memory layout via enumerators which corresponds to predefined configurations of cl_tensor_layout_blas_exp structure as listed in ML tensor layout type table.

      • @@ -1740,8 +1765,8 @@

        Modifications to The Ope

      The memory layout descriptions map tensor coordinates to buffer’s -memory byte locations respect to buffer’s base address as followed in -pseudo C:

      +memory byte locations respect to buffer’s base address as in the +followed in pseudo C code example:

      @@ -1831,7 +1856,7 @@

      Sample Codes

      std::vector<float> in1_data = ...; std::vector<float> out_data(b * m * n); -// Create tensor with opaque layout. +// Create a tensor with an opaque layout. cl_tensor_desc_exp in0_desc; in0_desc.rank = 3; in0_desc.properties[0] = 0; @@ -1846,7 +1871,7 @@

      Sample Codes

      ctx, {CL_MEM_TENSOR_EXP, in0_desc, 0}, CL_MEM_READ_ONLY, 0, nullptr, &err); -// Create tensor from a host allocation using an application defined +// Create tensor from a host allocation using an application-defined // layout description for mapping elements to the tensor. cl_tensor_desc_exp in1_desc; in1_desc.rank = 3; @@ -1865,7 +1890,7 @@

      Sample Codes

      ctx, {CL_MEM_TENSOR_EXP, in1_desc, 0}, CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, in1_data.data(), &err); -// Create another tensor with application defined layout. +// Create another tensor with an application-defined layout. cl_tensor_desc_exp out_desc; out_desc.rank = 3; out_desc.properties[0] = 0; @@ -1893,7 +1918,7 @@

      Sample Codes

      clSetKernelArg(batched_matmul_kernel, 2, sizeof(cl_mem), &out_tensor); // Required command for transferring data to layout-opaque tensors and -// from it to elsewhere. +// from it elsewhere. clEnqueueExportToTensor( cmd_q, in0_tensor, false, {0, 0, 0}, {0, 0, 0}, {b, m, k}, nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr); @@ -1908,12 +1933,12 @@

      Sample Codes

      -

      Issues

      +

      Issues and Open Questions

      1. -

        Should we have support for tensors with undefined shape and tensors +

        Should we support tensors with undefined shape and tensors with unknown / symbolic dimension sizes like in ONNX?

        @@ -1964,10 +1989,10 @@

        Version History

        -+--++ @@ -1987,10 +2012,23 @@

        Version History

        - - + +

        0.2.0

        2024-8-14

        Henry Linjamäki

        * Rework document structure match to the cl_khr_extension_template.

        -

        * Added clEnqueueCopyTensor.

        -

        * Added API for setting memory layout for tensors.

        Henry Linjamäki
        +Pekka Jääskeläinen
        +Michal Babej
        +Freddie Witherden

        +
          +
        • +

          Rework document structure match to the cl_khr_extension_template.

          +
        • +
        • +

          Added clEnqueueCopyTensor.

          +
        • +
        • +

          Added an API for setting the memory layout for tensors.

          +
        • +
        +
        @@ -2005,7 +2043,7 @@

        Version History

        From 01a415857be1e13195ba4933bea4b92cfa9a2460 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Fri, 16 Aug 2024 09:28:52 +0300 Subject: [PATCH 26/26] Update extensions/cl_exp_tensor.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Pekka Jääskeläinen --- extensions/cl_exp_tensor.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/extensions/cl_exp_tensor.asciidoc b/extensions/cl_exp_tensor.asciidoc index 2cb4ad26..619cca5c 100644 --- a/extensions/cl_exp_tensor.asciidoc +++ b/extensions/cl_exp_tensor.asciidoc @@ -906,7 +906,7 @@ typedef struct cl_tensor_layout_ml_exp { The memory layout descriptions map tensor coordinates to buffer's memory byte locations respect to buffer's base address as in the -followed in pseudo C code example: +following pseudo C code example: [source,c] ----