From 7d5e350f1a196c193dac7b383ffb000b50caa95d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Wed, 25 Oct 2023 08:34:30 +0300
Subject: [PATCH 01/26] Sketch something for cl_khr_tensor

---
 ext/cl_khr_tensor.asciidoc |  547 ++++++++++++++++
 ext/cl_khr_tensor.html     | 1228 ++++++++++++++++++++++++++++++++++++
 2 files changed, 1775 insertions(+)
 create mode 100644 ext/cl_khr_tensor.asciidoc
 create mode 100644 ext/cl_khr_tensor.html
diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc
new file mode 100644
index 00000000..cd17a42b
--- /dev/null
+++ b/ext/cl_khr_tensor.asciidoc
@@ -0,0 +1,547 @@
+// Copyright 2023 The Khronos Group. This work is licensed under a
+// Creative Commons Attribution 4.0 International License; see
+// http://creativecommons.org/licenses/by/4.0/
+= cl_khr_tensor
+
+:source-highlighter: coreray
+
+[[cl_khr_tensor]]
+== Tensor Data Type
+
+Purpose of this extension is to provide ...
+
+=== General information
+
+==== Name Strings
+
+`cl_khr_tensor`
+
+==== Version history
+
+[cols="1,1,3",options="header",]
+|====
+| *Date*     | *Version* | *Description*
+| 2023-10-XX | 0.1.0     | First assigned version.
+|====
+
+==== Dependencies
+
+This extension is written against the OpenCL Specification version 3.0.14.
+
+This extension requires OpenCL 1.2 or later.
+
+This extension requires cl_khr_command_buffer.
+
+==== Contributors
+
+Henry Linjamäki, Intel. +
+
+=== Overview
+
+
+=== Modifications to OpenCL
+
+==== New OpenCL Functions
+
+To create a tensor use:
+
+[source,c]
+----
+cl_tensor clCreateTensor(
+    cl_context context,
+    const cl_tensor_peoperties *properties,
+    size_t rank,
+    size_t shape,
+    cl_tensor_type dtype,
+    cl_int *errcode_ret);
+----
+
+* _context_ is a valid OpenCL context used to create the tensor object.
+
+* _properties_ is an optional list of properties for the tensor object
+  and their corresponding values. The list is terminated with the
+  special property 0. If no properties are required, properties may be
+  NULL.
+
+* _rank_ is the number of dimensions. Zero value creates a "scalar"
+  tensor which has no dimensions but has storage for one element.
+
+* _shape_ is a list of sizes of the dimensions. The length of the list
+  must be _rank_ elements. _shape_ can be NULL if _rank_ value is
+  zero. All the first _rank_ values in the list must be non-zero.
+
+* _dtype_ is the element type of _tensor_. Refer to the
+  <<TensorDtypes>> table for the types.
+
+* _errcode_ret_ may return an appropriate error code. If errcode_ret
+  is NULL, no error code is returned.
+
+clCreateTensor function creates a `rank`-dimensional tensor with
+`shape[0] * shape[1] * ... * shape[rank-1]` elements of _dtype_
+type. At the creation time of the tensor, it does not have
+storage. The storage is assigned to the tensor either by:
+
+* calling clCreateBufferWithProperties() with CL_MEM_BIND_TO_TENSOR or
+
+* automatically by command buffers - possibly on-demand basis - if the
+  tensor is created with CL_TENSOR_COMMAND_BUFFER_TEMPORARY property
+  set on.
+
+A command that refers to a tensor must be bound to a valid buffer
+object before enqueuing the command into a command queue unless the
+command is recorded in a command buffer and
+CL_TENSOR_COMMAND_BUFFER_TEMPORARY is set to true.
+
+*clCreateTensor* returns a valid non-zero tensor object and errcode_ret
+is set to CL_SUCCESS if the tensor object is created
+successfully. Otherwise, they return a NULL value with one of the
+following error values returned in errcode_ret:
+
+* CL_INVALID_CONTEXT if context is not a valid context.
+
+* CL_INVALID_PROPERTY if a property name in properties is not a
+  supported property name, if the value specified for a supported
+  property name is not valid, or if the same property name is
+  specified more than once.
+
+* CL_INVALID_VALUE if a value specified in dtype is invalid.
+
+* CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
+  required by the OpenCL implementation on the host.
+
+.Tensor element types
+[cols="1,2",stripes=odd]
+[#TensorDtypes]
+|===
+| *Tensor element data type* | *Description*
+
+| CL_TENSOR_BOOL       | 1-bit signedless integer.
+| CL_TENSOR_INT8       | 8-bit signed integer.
+| CL_TENSOR_INT16      | 16-bit signed integer.
+| CL_TENSOR_INT32      | 32-bit signed integer.
+| CL_TENSOR_INT64      | 64-bit signed integer.
+| CL_TENSOR_UINT8      | 8-bit signed integer.
+| CL_TENSOR_UINT16     | 16-bit signed integer.
+| CL_TENSOR_UINT32     | 32-bit signed integer.
+| CL_TENSOR_UINT64     | 64-bit signed integer.
+| CL_TENSOR_HALF       | Half precision floating-point value.
+| CL_TENSOR_BFLOAT16   | 16-bit brain floating-point value.
+| CL_TENSOR_FLOAT      | Single precision floating-point value.
+| CL_TENSOR_DOUBLE     | Double precision floating-point value.
+| CL_TENSOR_COMPLEX64  | 64-bit complex floating point value with
+  32-bit real and imaginary part.
+| CL_TENSOR_COMPLEX128 | 128-bit complex floating point value with
+  64-bit real and imaginary part.
+|===
+
+.Tensor properties
+[cols="2,1,2",stripes=odd]
+|===
+| *Tensor Property* | *Property Value* | *Description*
+
+| CL_TENSOR_COMMAND_BUFFER_TEMPORARY | cl_bool
+
+a| If the value is true, create a "temporary" tensor that only can be
+used on commands recorded in command buffers. The storage of the
+temporary tensors are managed by command buffers. When a temporary
+tensor is used by multiple command buffer, the tensor receive separate
+storage for each command buffer.
+
+// IOW, Data may not be exchanged between command buffers through
+// temporary tensors.
+
+Temporary tensors may not be bound to buffer objects.
+
+Data stored in temporary tensors are not preserved across command
+buffer executions.
+|===
+
+To retain a tensor object, call the function
+
+[source,c]
+----
+cl_int clRetainTensorObject(
+  cl_tensor tensor);
+----
+
+* _tensor_ is the tensor object to be retained.
+
+The _tensor_ reference count is incremented.
+
+*clRetainTensor* returns CL_SUCCESS if the function is executed
+successfully. Otherwise, it returns one of the following errors:
+
+* CL_INVALID_TENSOR if tensor is not a valid tensor object.
+
+To release a tensor object, call the function
+
+[source,c]
+----
+cl_int clReleaseTensorObject(
+  cl_tensor tensor);
+----
+
+* _tensor_ is the tensor object to be released.
+
+The _tensor_ reference count is decremented.
+
+The tensor object is deleted once the number of instances that are
+retained to tensor become zero and the tensor object is no longer
+needed by any enqueued or recorded commands that use _tensor_. Using
+this function to release a reference that was not obtained by creating
+the object or by calling *clRetainTensor* causes undefined behavior.
+
+*clReleaseTensor* returns CL_SUCCESS if the function is executed
+successfully. Otherwise, it returns one of the following errors:
+
+* CL_INVALID_TENSOR if tensor is not a valid tensor object.
+
+// TODO: add clSetTensorObjectDestructorCallback?
+
+To return information about a tensor object, call the function
+
+[source,c]
+----
+cl_int clGetTensorInfo(
+  cl_tensor tensor,
+  cl_tensor_info param_name,
+  size_t param_value_size,
+  void* param_value,
+  size_t* param_value_size_ret);
+----
+
+* _tensor_ specifies the tensor object being queried.
+
+* _param_name_ specifies the information to query. The list of
+  supported param_name types and the information returned in
+  _param_value_ by clGetTensorInfo is described in the <<Tensor Object
+  Queries>> table.
+
+* _param_value_ is a pointer to memory where the appropriate result
+  being queried is returned. If _param_value_ is NULL, it is ignored.
+
+* _param_value_size_ is used to specify the size in bytes of memory
+  pointed to by _param_value_. This size must be ≥ size of return type
+  as described in the <<Tensor Object Queries>> table.
+
+* _param_value_size_ret_ returns the actual size in bytes of data
+  being queried by _param_name_. If _param_value_size_ret_ is NULL, it is
+  ignored.
+
+*clGetTensorInfo* returns CL_SUCCESS if the function is executed
+ succesfully. Otherwise, it returns one of the following errors:
+
+* CL_INVALID_TENSOR if _tensor_ is not a valid tensor object.
+
+[#Tensor Object Quaries]
+.List of supported param_names by clGetTensorInfo
+[cols="2,1,2",stripes=odd]
+|===
+| CL_TENSOR_RANK  | size_t         | Return the tensor rank.
+| CL_TENSOR_SHAPE | size_t[]       | Return the tensor shape.
+| CL_TENSOR_DTYPE | cl_tensor_type | Return the tensor data type.
+
+| CL_TENSOR_COMMAND_BUFFER_TEMPORARY | cl_bool | Return true if the
+tensor is temporary tensor for command buffers.
+
+| CL_TENSOR_BOUND_TO_BUFFER | cl_bool | Return true if the tensor is
+bound to a buffer. If CL_TENSOR_COMMAND_BUFFER_TEMPORARY is true, then
+CL_TENSOR_BOUND_TO_BUFFER must return false.
+
+| CL_TENSOR_BUFFER | cl_mem a| If CL_TENSOR_BOUND_TO_BUFFER is true,
+return the buffer object the tensor is bound to. Otherwise,
+clGetTensorInfo call returns:
+
+* CL_INVALID_MEM_OBJECT if the tensor is not bound to a buffer object.
+
+* CL_INVALID_PROPERTY otherwise.
+
+| CL_TENSOR_CONTEXT | cl_context | Return the context specified when
+  the tensor object is created.
+
+| CL_TENSOR_REFERENCE_COUNT | cl_uint | Return the tensor reference
+count.
+|===
+
+To read from a tensor to host memory / buffer object or to write to a
+tensor object from host memory / buffer object call one of the functions.
+
+[source,c]
+----
+cl_int clEnqueueReadTensor(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  cl_mem buffer,
+  void* host_ptr,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);
+----
+
+[source,c]
+----
+cl_int clEnqueueWriteTensor(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  cl_mem buffer,
+  void* host_ptr,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);
+----
+
+* _command_queue_ is a valid host command-queue in which the read /
+  write command will be queued. _command_queue_ and _tensor_ must be
+  created with the same OpenCL context.
+
+* _tensor_ refers to a valid tensor object which is bound to a buffer.
+
+* _blocking_command_ indicate if the read and write operations are
+  blocking or non-blocking (see below).
+
+* _buffer_ refers to a valid buffer object where data is to be
+  read into or to be written from when the value of _host_ptr_ is
+  NULL. If _host_ptr_ is non-NULL then value of _buffer_ is ignored.
+
+* _host_ptr_ is the pointer to buffer in host memory where data is to
+  be read into or to be written from when the value is non-NULL.
+
+* _event_wait_list_ and _num_events_in_wait_list_ specify events that
+  need to complete before this particular command can be executed. If
+  _event_wait_list_ is NULL, then this particular command does not
+  wait on any event to complete. If _event_wait_list_ is NULL,
+  _num_events_in_wait_list_ must be 0. If _event_wait_list_ is not
+  NULL, the list of events pointed to by _event_wait_list_ must be
+  valid and _num_events_in_wait_list_ must be greater than 0. The
+  events specified in _event_wait_list_ act as synchronization
+  points. The context associated with events in _event_wait_list_ and
+  _command_queue_ must be the same. The memory associated with
+  _event_wait_list_ can be reused or freed after the function returns.
+
+* _event_ returns an event object that identifies this read / write
+  command and can be used to query or queue a wait for this command to
+  complete. If _event_ is NULL or the enqueue is unsuccessful, no
+  event will be created and therefore it will not be possible to query
+  the status of this command or to wait for this command to
+  complete. If _event_wait_list_ and _event_ are not NULL, _event_
+  must not refer to an element of the _event_wait_list_ array.
+
+For a read and write operation, the elements of N-dimensional tensor are
+related to host memory / buffer object as followed:
+
+----
+tensor.element(i0, i1, ..., i<N-2>, i<N-1>)) == (tensor.dtype)buffer_or_host_ptr[
+  i0 * tensor.shape[1] * tensor.shape[2] * ... * tensor.shape[N-1] +
+  i1 * tensor.shape[2] * tensor.shape[3] * ... * tensor.shape[N-1] +
+  ... +
+  i<N-2> * tensor.shape[i(N-1)] +
+  i<N-1>]
+----
+
+Where `iX` is a tensor coordinate index with inclusive range of `0..<shape[X]>`.
+
+// TODO: add clEnqueueCopyTensor
+
+// TODO: add clEnqueueFillTensor?
+
+// TODO: add command buffer variants for clEnqueue{copy,read,write}Tensor.
+
+
+==== Add New Buffer Property in Section 5.2.1
+
+[cols="2,1,2",stripes=odd]
+|===
+| CL_MEM_BIND_TO_TENSOR | cl_tensor a| Use the created buffer as
+storage for the given valid tensor. To succeed creating the buffer,
+the target tensor may not have storage already, must not have
+CL_TENSOR_COMMAND_BUFFER_TEMPORARY property set on and _size_ argument
+of the clCreateBufferWithProperties() must be zero.
+
+Size of the memory buffer is implementation-defined and it can be
+queried with clGetTensorInfo().
+
+Memory layout of the tensor in the created memory buffer is
+implementation-defined and opaque to the applications and it may
+change at unspecified points. Implementation may store auxiliary data
+in the memory buffer for the tensor. Therefore, writing data into the
+memory buffer directly using the cl_mem handle leads to undefined
+behavior.
+
+If the tensor is already bound to a buffer object,
+clCreateBufferWithProperties call returns CL_TENSOR_BOUND_TO_BUFFER
+error code.
+|===
+
+=== Sample Codes
+
+Helper functions used in the follow up tensor code samples:
+
+[source,c]
+----
+cl_kernel create_matmul_kernel(
+  cl_context ctx, std::span<cl_device_id> device_span,
+  cl_tensor lhs, cl_tensor rhs, cl_tensor out) {
+  // A hypothetical matmul kernel signature in pseudo OpenCL C for
+  // illustrative purposes:
+  //
+  //   kernel void matmul(
+  //     global read_only tensor_t,
+  //     global read_only tensor_t,
+  //     global write_only tensor_t);
+
+  cl_kernel matmul_kernel = /* Omitted. */;
+  clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor), &lhs);
+  clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor), &rhs);
+  clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor), &out);
+  return matmul_kernel;
+}
+
+cl_kernel create_matmul_kernel(
+  cl_context ctx, std::span<cl_device_id> device_span,
+  cl_tensor lhs, cl_tensor rhs, cl_tensor out) {
+  // A hypothetical add kernel signature in pseudo OpenCL C for illustrative
+  // purposes:
+  //
+  // kernel void add(
+  //     global read_only tensor_t,
+  //     global read_only tensor_t,
+  //     global write_only tensor_t);
+
+  cl_tensor add_kernel = /* Omitted. */;
+  clSetKernelArg(add_kernel, 0, sizeof(cl_tensor), &lhs);
+  clSetKernelArg(add_kernel, 1, sizeof(cl_tensor), &rhs);
+  clSetKernelArg(add_kernel, 2, sizeof(cl_tensor), &out);
+  return add_kernel;
+}
+----
+An example usage of tensors on a command queue:
+
+[source,c]
+----
+constexpr size_t b = 64, m = 100, n = 200, k = 50;
+
+cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
+cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
+cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor t0  = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+
+cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
+cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
+
+// Allocate storage for the tensors. The buffer size must be set to zero
+// when the buffer is bound to a tensor. OpenCL implementation may
+// determine optimal data layout and the storage needed for it, based
+// on the tensor's uses (matmul kernel in this sample) so far.
+cl_int err;
+cl_mem in0_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, in0, 0}, CL_MEM_READ_ONLY,
+  0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */, nullptr, &err);
+cl_mem in1_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, in1, 0}, CL_MEM_READ_ONLY,
+  0, nullptr, &err);
+cl_mem in2_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, in2, 0}, CL_MEM_READ_ONLY,
+  0, nullptr, &err);
+cl_mem t0_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE,
+  0, nullptr, &err);
+cl_mem out_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, out, 0}, CL_MEM_WRITE_ONLY,
+  0, nullptr, &err);
+
+std::vector<float> in0_data = ...;
+std::vector<float> in1_data = ...;
+std::vector<float> out_data(b * m * n);
+
+// Copies data into in0 tensor while possibly rearranging the data to the
+// optimal data layout.
+clEnqueueWriteTensor(
+  cmd_q, in0, false, nullptr, nullptr, {b, m, k}, nullptr, in0_data.data(),
+  0, nullptr, nullptr);
+
+clEnqueueWriteTensor(
+  cmd_q, in1, false, nullptr, nullptr, {b, k, n}, nullptr, in1_data.data(),
+  0, nullptr, nullptr);
+clEnqueueNDRangeKernel(
+  cmd_q, matmul_kernel, 0, nullptr, nullptr, nullptr, 0, nullptr, nullptr);
+clEnqueueNDRangeKernel(
+  cmd_q, add_kernel, 0, nullptr, nullptr, nullptr, 0, nullptr, nullptr);
+clEnqueueReadTensor(
+  cmd_q, out, false, nullptr, nullptr, {b, m, n}, nullptr, out_data.data(),
+  0, nullptr, nullptr);
+----
+
+An example use of tensors in a command buffer when cl_khr_command_buffer
+extension is supported:
+
+[source,c]
+----
+constexpr size_t b = 64, m = 100, n = 200, k = 50;
+
+cl_int err;
+// Create tensors which are used as temporaries in a command buffer.
+// Command buffers allocate space for them as needed.
+//
+// NOTE: same temporary tensor handle used in multiple command buffers
+//       will have separate storage. IOW, command buffers may not exchange
+//       data via temporary buffers between them.
+cl_tensor in0 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
+  3, {b, m, k}, CL_TENSOR_FLOAT, err);
+cl_tensor in1 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
+  3, {b, k, n}, CL_TENSOR_FLOAT, err);
+cl_tensor in2 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
+  3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor t0  = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
+  3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor out = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
+  3, {b, m, n}, CL_TENSOR_FLOAT, err);
+
+cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
+cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
+
+// Binding a buffer to temporary tensor is not allowed.
+auto ignored = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE, 0, nullptr, &err);
+assert(err == CL_TENSOR_IS_TEMPORARY)
+
+std::vector<float> in0_data = ...;
+std::vector<float> in1_data = ...;
+std::vector<float> out_data(b * m * n);
+
+cl_command_buffer_khr cb =
+  clCreateCommandBufferKHR(num_queues, queue_list, nullptr, &err);
+
+cl_sync_point_khr in0_syncp, in1_syncp, matmul_syncp, add_syncp;
+clCommandWriteTensorKHR(
+  cmd_b, cmd_q, in0, false, nullptr, nullptr, {b, m, k}, nullptr,
+  in0_data.data(), 0, nullptr, &in0_syncp);
+clCommandWriteTensorKHR(
+  cmd_b, cmd_q, in1, false, nullptr, nullptr, {b, k, m}, nullptr,
+  in1_data.data(), 0, nullptr, &in1_syncp);
+clCommandNDRangeKernelKHR(
+  cmd_b, cmd_q, nullptr, matmul_kernel, 0, nullptr, nullptr, nullptr,
+  2, {in0_syncp, in2_syncp}, &matmul_syncp, nullptr);
+clCommandNDRangeKernelKHR(
+  cmd_b, cmd_q, nullptr, add_kernel, 0, nullptr, nullptr, nullptr,
+  1, {matmul_syncp}, &add_syncp, nullptr);
+clCommandReadTensorKHR(
+  cmd_b, cmd_q, out,  false, nullptr, nullptr, {b, k, m}, nullptr,
+  out_data.data(), 1, {add_syncp}, nullptr);
+
+// Finalize the command buffer. At this point the OpenCL
+// implementation may reserve enough storage for all the tensor
+// temporaries. Temporary tensors might be eliminated - for example,
+// OpenCL implementation could use 'out' tensor to store result of
+// matmul_kernel , thus, eliminating the need of 't0' tensor.
+clFinalizeCommandBufferKHR(cmd_b);
+
+// Temporary tensors used in a command buffer can't be read or written
+// into. A hypothetical reason is that the finalized command buffer
+// might not use some of the tensor.
+assert(clEnqueueReadTensor(..., t0, ...) == CL_INVALID_OPERATION);
+----
+
+=== Open Questions ===
diff --git a/ext/cl_khr_tensor.html b/ext/cl_khr_tensor.html
new file mode 100644
index 00000000..87892548
--- /dev/null
+++ b/ext/cl_khr_tensor.html
@@ -0,0 +1,1228 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<meta http-equiv="X-UA-Compatible" content="IE=edge">
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+<meta name="generator" content="Asciidoctor 2.0.16">
+<title>cl_khr_tensor</title>
+<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Open+Sans:300,300italic,400,400italic,600,600italic%7CNoto+Serif:400,400italic,700,700italic%7CDroid+Sans+Mono:400,700">
+<style>
+/*! Asciidoctor default stylesheet | MIT License | https://asciidoctor.org */
+/* Uncomment the following line when using as a custom stylesheet */
+/* @import "https://fonts.googleapis.com/css?family=Open+Sans:300,300italic,400,400italic,600,600italic%7CNoto+Serif:400,400italic,700,700italic%7CDroid+Sans+Mono:400,700"; */
+html{font-family:sans-serif;-webkit-text-size-adjust:100%}
+a{background:none}
+a:focus{outline:thin dotted}
+a:active,a:hover{outline:0}
+h1{font-size:2em;margin:.67em 0}
+b,strong{font-weight:bold}
+abbr{font-size:.9em}
+abbr[title]{cursor:help;border-bottom:1px dotted #dddddf;text-decoration:none}
+dfn{font-style:italic}
+hr{height:0}
+mark{background:#ff0;color:#000}
+code,kbd,pre,samp{font-family:monospace;font-size:1em}
+pre{white-space:pre-wrap}
+q{quotes:"\201C" "\201D" "\2018" "\2019"}
+small{font-size:80%}
+sub,sup{font-size:75%;line-height:0;position:relative;vertical-align:baseline}
+sup{top:-.5em}
+sub{bottom:-.25em}
+img{border:0}
+svg:not(:root){overflow:hidden}
+figure{margin:0}
+audio,video{display:inline-block}
+audio:not([controls]){display:none;height:0}
+fieldset{border:1px solid silver;margin:0 2px;padding:.35em .625em .75em}
+legend{border:0;padding:0}
+button,input,select,textarea{font-family:inherit;font-size:100%;margin:0}
+button,input{line-height:normal}
+button,select{text-transform:none}
+button,html input[type=button],input[type=reset],input[type=submit]{-webkit-appearance:button;cursor:pointer}
+button[disabled],html input[disabled]{cursor:default}
+input[type=checkbox],input[type=radio]{padding:0}
+button::-moz-focus-inner,input::-moz-focus-inner{border:0;padding:0}
+textarea{overflow:auto;vertical-align:top}
+table{border-collapse:collapse;border-spacing:0}
+*,::before,::after{box-sizing:border-box}
+html,body{font-size:100%}
+body{background:#fff;color:rgba(0,0,0,.8);padding:0;margin:0;font-family:"Noto Serif","DejaVu Serif",serif;line-height:1;position:relative;cursor:auto;-moz-tab-size:4;-o-tab-size:4;tab-size:4;word-wrap:anywhere;-moz-osx-font-smoothing:grayscale;-webkit-font-smoothing:antialiased}
+a:hover{cursor:pointer}
+img,object,embed{max-width:100%;height:auto}
+object,embed{height:100%}
+img{-ms-interpolation-mode:bicubic}
+.left{float:left!important}
+.right{float:right!important}
+.text-left{text-align:left!important}
+.text-right{text-align:right!important}
+.text-center{text-align:center!important}
+.text-justify{text-align:justify!important}
+.hide{display:none}
+img,object,svg{display:inline-block;vertical-align:middle}
+textarea{height:auto;min-height:50px}
+select{width:100%}
+.subheader,.admonitionblock td.content>.title,.audioblock>.title,.exampleblock>.title,.imageblock>.title,.listingblock>.title,.literalblock>.title,.stemblock>.title,.openblock>.title,.paragraph>.title,.quoteblock>.title,table.tableblock>.title,.verseblock>.title,.videoblock>.title,.dlist>.title,.olist>.title,.ulist>.title,.qlist>.title,.hdlist>.title{line-height:1.45;color:#7a2518;font-weight:400;margin-top:0;margin-bottom:.25em}
+div,dl,dt,dd,ul,ol,li,h1,h2,h3,#toctitle,.sidebarblock>.content>.title,h4,h5,h6,pre,form,p,blockquote,th,td{margin:0;padding:0}
+a{color:#2156a5;text-decoration:underline;line-height:inherit}
+a:hover,a:focus{color:#1d4b8f}
+a img{border:0}
+p{line-height:1.6;margin-bottom:1.25em;text-rendering:optimizeLegibility}
+p aside{font-size:.875em;line-height:1.35;font-style:italic}
+h1,h2,h3,#toctitle,.sidebarblock>.content>.title,h4,h5,h6{font-family:"Open Sans","DejaVu Sans",sans-serif;font-weight:300;font-style:normal;color:#ba3925;text-rendering:optimizeLegibility;margin-top:1em;margin-bottom:.5em;line-height:1.0125em}
+h1 small,h2 small,h3 small,#toctitle small,.sidebarblock>.content>.title small,h4 small,h5 small,h6 small{font-size:60%;color:#e99b8f;line-height:0}
+h1{font-size:2.125em}
+h2{font-size:1.6875em}
+h3,#toctitle,.sidebarblock>.content>.title{font-size:1.375em}
+h4,h5{font-size:1.125em}
+h6{font-size:1em}
+hr{border:solid #dddddf;border-width:1px 0 0;clear:both;margin:1.25em 0 1.1875em}
+em,i{font-style:italic;line-height:inherit}
+strong,b{font-weight:bold;line-height:inherit}
+small{font-size:60%;line-height:inherit}
+code{font-family:"Droid Sans Mono","DejaVu Sans Mono",monospace;font-weight:400;color:rgba(0,0,0,.9)}
+ul,ol,dl{line-height:1.6;margin-bottom:1.25em;list-style-position:outside;font-family:inherit}
+ul,ol{margin-left:1.5em}
+ul li ul,ul li ol{margin-left:1.25em;margin-bottom:0}
+ul.square li ul,ul.circle li ul,ul.disc li ul{list-style:inherit}
+ul.square{list-style-type:square}
+ul.circle{list-style-type:circle}
+ul.disc{list-style-type:disc}
+ol li ul,ol li ol{margin-left:1.25em;margin-bottom:0}
+dl dt{margin-bottom:.3125em;font-weight:bold}
+dl dd{margin-bottom:1.25em}
+blockquote{margin:0 0 1.25em;padding:.5625em 1.25em 0 1.1875em;border-left:1px solid #ddd}
+blockquote,blockquote p{line-height:1.6;color:rgba(0,0,0,.85)}
+@media screen and (min-width:768px){h1,h2,h3,#toctitle,.sidebarblock>.content>.title,h4,h5,h6{line-height:1.2}
+h1{font-size:2.75em}
+h2{font-size:2.3125em}
+h3,#toctitle,.sidebarblock>.content>.title{font-size:1.6875em}
+h4{font-size:1.4375em}}
+table{background:#fff;margin-bottom:1.25em;border:1px solid #dedede;word-wrap:normal}
+table thead,table tfoot{background:#f7f8f7}
+table thead tr th,table thead tr td,table tfoot tr th,table tfoot tr td{padding:.5em .625em .625em;font-size:inherit;color:rgba(0,0,0,.8);text-align:left}
+table tr th,table tr td{padding:.5625em .625em;font-size:inherit;color:rgba(0,0,0,.8)}
+table tr.even,table tr.alt{background:#f8f8f7}
+table thead tr th,table tfoot tr th,table tbody tr td,table tr td,table tfoot tr td{line-height:1.6}
+h1,h2,h3,#toctitle,.sidebarblock>.content>.title,h4,h5,h6{line-height:1.2;word-spacing:-.05em}
+h1 strong,h2 strong,h3 strong,#toctitle strong,.sidebarblock>.content>.title strong,h4 strong,h5 strong,h6 strong{font-weight:400}
+.center{margin-left:auto;margin-right:auto}
+.stretch{width:100%}
+.clearfix::before,.clearfix::after,.float-group::before,.float-group::after{content:" ";display:table}
+.clearfix::after,.float-group::after{clear:both}
+:not(pre).nobreak{word-wrap:normal}
+:not(pre).nowrap{white-space:nowrap}
+:not(pre).pre-wrap{white-space:pre-wrap}
+:not(pre):not([class^=L])>code{font-size:.9375em;font-style:normal!important;letter-spacing:0;padding:.1em .5ex;word-spacing:-.15em;background:#f7f7f8;border-radius:4px;line-height:1.45;text-rendering:optimizeSpeed}
+pre{color:rgba(0,0,0,.9);font-family:"Droid Sans Mono","DejaVu Sans Mono",monospace;line-height:1.45;text-rendering:optimizeSpeed}
+pre code,pre pre{color:inherit;font-size:inherit;line-height:inherit}
+pre>code{display:block}
+pre.nowrap,pre.nowrap pre{white-space:pre;word-wrap:normal}
+em em{font-style:normal}
+strong strong{font-weight:400}
+.keyseq{color:rgba(51,51,51,.8)}
+kbd{font-family:"Droid Sans Mono","DejaVu Sans Mono",monospace;display:inline-block;color:rgba(0,0,0,.8);font-size:.65em;line-height:1.45;background:#f7f7f7;border:1px solid #ccc;border-radius:3px;box-shadow:0 1px 0 rgba(0,0,0,.2),inset 0 0 0 .1em #fff;margin:0 .15em;padding:.2em .5em;vertical-align:middle;position:relative;top:-.1em;white-space:nowrap}
+.keyseq kbd:first-child{margin-left:0}
+.keyseq kbd:last-child{margin-right:0}
+.menuseq,.menuref{color:#000}
+.menuseq b:not(.caret),.menuref{font-weight:inherit}
+.menuseq{word-spacing:-.02em}
+.menuseq b.caret{font-size:1.25em;line-height:.8}
+.menuseq i.caret{font-weight:bold;text-align:center;width:.45em}
+b.button::before,b.button::after{position:relative;top:-1px;font-weight:400}
+b.button::before{content:"[";padding:0 3px 0 2px}
+b.button::after{content:"]";padding:0 2px 0 3px}
+p a>code:hover{color:rgba(0,0,0,.9)}
+#header,#content,#footnotes,#footer{width:100%;margin:0 auto;max-width:62.5em;*zoom:1;position:relative;padding-left:.9375em;padding-right:.9375em}
+#header::before,#header::after,#content::before,#content::after,#footnotes::before,#footnotes::after,#footer::before,#footer::after{content:" ";display:table}
+#header::after,#content::after,#footnotes::after,#footer::after{clear:both}
+#content{margin-top:1.25em}
+#content::before{content:none}
+#header>h1:first-child{color:rgba(0,0,0,.85);margin-top:2.25rem;margin-bottom:0}
+#header>h1:first-child+#toc{margin-top:8px;border-top:1px solid #dddddf}
+#header>h1:only-child,body.toc2 #header>h1:nth-last-child(2){border-bottom:1px solid #dddddf;padding-bottom:8px}
+#header .details{border-bottom:1px solid #dddddf;line-height:1.45;padding-top:.25em;padding-bottom:.25em;padding-left:.25em;color:rgba(0,0,0,.6);display:flex;flex-flow:row wrap}
+#header .details span:first-child{margin-left:-.125em}
+#header .details span.email a{color:rgba(0,0,0,.85)}
+#header .details br{display:none}
+#header .details br+span::before{content:"\00a0\2013\00a0"}
+#header .details br+span.author::before{content:"\00a0\22c5\00a0";color:rgba(0,0,0,.85)}
+#header .details br+span#revremark::before{content:"\00a0|\00a0"}
+#header #revnumber{text-transform:capitalize}
+#header #revnumber::after{content:"\00a0"}
+#content>h1:first-child:not([class]){color:rgba(0,0,0,.85);border-bottom:1px solid #dddddf;padding-bottom:8px;margin-top:0;padding-top:1rem;margin-bottom:1.25rem}
+#toc{border-bottom:1px solid #e7e7e9;padding-bottom:.5em}
+#toc>ul{margin-left:.125em}
+#toc ul.sectlevel0>li>a{font-style:italic}
+#toc ul.sectlevel0 ul.sectlevel1{margin:.5em 0}
+#toc ul{font-family:"Open Sans","DejaVu Sans",sans-serif;list-style-type:none}
+#toc li{line-height:1.3334;margin-top:.3334em}
+#toc a{text-decoration:none}
+#toc a:active{text-decoration:underline}
+#toctitle{color:#7a2518;font-size:1.2em}
+@media screen and (min-width:768px){#toctitle{font-size:1.375em}
+body.toc2{padding-left:15em;padding-right:0}
+#toc.toc2{margin-top:0!important;background:#f8f8f7;position:fixed;width:15em;left:0;top:0;border-right:1px solid #e7e7e9;border-top-width:0!important;border-bottom-width:0!important;z-index:1000;padding:1.25em 1em;height:100%;overflow:auto}
+#toc.toc2 #toctitle{margin-top:0;margin-bottom:.8rem;font-size:1.2em}
+#toc.toc2>ul{font-size:.9em;margin-bottom:0}
+#toc.toc2 ul ul{margin-left:0;padding-left:1em}
+#toc.toc2 ul.sectlevel0 ul.sectlevel1{padding-left:0;margin-top:.5em;margin-bottom:.5em}
+body.toc2.toc-right{padding-left:0;padding-right:15em}
+body.toc2.toc-right #toc.toc2{border-right-width:0;border-left:1px solid #e7e7e9;left:auto;right:0}}
+@media screen and (min-width:1280px){body.toc2{padding-left:20em;padding-right:0}
+#toc.toc2{width:20em}
+#toc.toc2 #toctitle{font-size:1.375em}
+#toc.toc2>ul{font-size:.95em}
+#toc.toc2 ul ul{padding-left:1.25em}
+body.toc2.toc-right{padding-left:0;padding-right:20em}}
+#content #toc{border:1px solid #e0e0dc;margin-bottom:1.25em;padding:1.25em;background:#f8f8f7;border-radius:4px}
+#content #toc>:first-child{margin-top:0}
+#content #toc>:last-child{margin-bottom:0}
+#footer{max-width:none;background:rgba(0,0,0,.8);padding:1.25em}
+#footer-text{color:hsla(0,0%,100%,.8);line-height:1.44}
+#content{margin-bottom:.625em}
+.sect1{padding-bottom:.625em}
+@media screen and (min-width:768px){#content{margin-bottom:1.25em}
+.sect1{padding-bottom:1.25em}}
+.sect1:last-child{padding-bottom:0}
+.sect1+.sect1{border-top:1px solid #e7e7e9}
+#content h1>a.anchor,h2>a.anchor,h3>a.anchor,#toctitle>a.anchor,.sidebarblock>.content>.title>a.anchor,h4>a.anchor,h5>a.anchor,h6>a.anchor{position:absolute;z-index:1001;width:1.5ex;margin-left:-1.5ex;display:block;text-decoration:none!important;visibility:hidden;text-align:center;font-weight:400}
+#content h1>a.anchor::before,h2>a.anchor::before,h3>a.anchor::before,#toctitle>a.anchor::before,.sidebarblock>.content>.title>a.anchor::before,h4>a.anchor::before,h5>a.anchor::before,h6>a.anchor::before{content:"\00A7";font-size:.85em;display:block;padding-top:.1em}
+#content h1:hover>a.anchor,#content h1>a.anchor:hover,h2:hover>a.anchor,h2>a.anchor:hover,h3:hover>a.anchor,#toctitle:hover>a.anchor,.sidebarblock>.content>.title:hover>a.anchor,h3>a.anchor:hover,#toctitle>a.anchor:hover,.sidebarblock>.content>.title>a.anchor:hover,h4:hover>a.anchor,h4>a.anchor:hover,h5:hover>a.anchor,h5>a.anchor:hover,h6:hover>a.anchor,h6>a.anchor:hover{visibility:visible}
+#content h1>a.link,h2>a.link,h3>a.link,#toctitle>a.link,.sidebarblock>.content>.title>a.link,h4>a.link,h5>a.link,h6>a.link{color:#ba3925;text-decoration:none}
+#content h1>a.link:hover,h2>a.link:hover,h3>a.link:hover,#toctitle>a.link:hover,.sidebarblock>.content>.title>a.link:hover,h4>a.link:hover,h5>a.link:hover,h6>a.link:hover{color:#a53221}
+details,.audioblock,.imageblock,.literalblock,.listingblock,.stemblock,.videoblock{margin-bottom:1.25em}
+details{margin-left:1.25rem}
+details>summary{cursor:pointer;display:block;position:relative;line-height:1.6;margin-bottom:.625rem;-webkit-tap-highlight-color:transparent}
+details>summary::before{content:"";border:solid transparent;border-left:solid;border-width:.3em 0 .3em .5em;position:absolute;top:.5em;left:-1.25rem;transform:translateX(15%)}
+details[open]>summary::before{border:solid transparent;border-top:solid;border-width:.5em .3em 0;transform:translateY(15%)}
+details>summary::after{content:"";width:1.25rem;height:1em;position:absolute;top:.3em;left:-1.25rem}
+.admonitionblock td.content>.title,.audioblock>.title,.exampleblock>.title,.imageblock>.title,.listingblock>.title,.literalblock>.title,.stemblock>.title,.openblock>.title,.paragraph>.title,.quoteblock>.title,table.tableblock>.title,.verseblock>.title,.videoblock>.title,.dlist>.title,.olist>.title,.ulist>.title,.qlist>.title,.hdlist>.title{text-rendering:optimizeLegibility;text-align:left;font-family:"Noto Serif","DejaVu Serif",serif;font-size:1rem;font-style:italic}
+table.tableblock.fit-content>caption.title{white-space:nowrap;width:0}
+.paragraph.lead>p,#preamble>.sectionbody>[class=paragraph]:first-of-type p{font-size:1.21875em;line-height:1.6;color:rgba(0,0,0,.85)}
+.admonitionblock>table{border-collapse:separate;border:0;background:none;width:100%}
+.admonitionblock>table td.icon{text-align:center;width:80px}
+.admonitionblock>table td.icon img{max-width:none}
+.admonitionblock>table td.icon .title{font-weight:bold;font-family:"Open Sans","DejaVu Sans",sans-serif;text-transform:uppercase}
+.admonitionblock>table td.content{padding-left:1.125em;padding-right:1.25em;border-left:1px solid #dddddf;color:rgba(0,0,0,.6);word-wrap:anywhere}
+.admonitionblock>table td.content>:last-child>:last-child{margin-bottom:0}
+.exampleblock>.content{border:1px solid #e6e6e6;margin-bottom:1.25em;padding:1.25em;background:#fff;border-radius:4px}
+.exampleblock>.content>:first-child{margin-top:0}
+.exampleblock>.content>:last-child{margin-bottom:0}
+.sidebarblock{border:1px solid #dbdbd6;margin-bottom:1.25em;padding:1.25em;background:#f3f3f2;border-radius:4px}
+.sidebarblock>:first-child{margin-top:0}
+.sidebarblock>:last-child{margin-bottom:0}
+.sidebarblock>.content>.title{color:#7a2518;margin-top:0;text-align:center}
+.exampleblock>.content>:last-child>:last-child,.exampleblock>.content .olist>ol>li:last-child>:last-child,.exampleblock>.content .ulist>ul>li:last-child>:last-child,.exampleblock>.content .qlist>ol>li:last-child>:last-child,.sidebarblock>.content>:last-child>:last-child,.sidebarblock>.content .olist>ol>li:last-child>:last-child,.sidebarblock>.content .ulist>ul>li:last-child>:last-child,.sidebarblock>.content .qlist>ol>li:last-child>:last-child{margin-bottom:0}
+.literalblock pre,.listingblock>.content>pre{border-radius:4px;overflow-x:auto;padding:1em;font-size:.8125em}
+@media screen and (min-width:768px){.literalblock pre,.listingblock>.content>pre{font-size:.90625em}}
+@media screen and (min-width:1280px){.literalblock pre,.listingblock>.content>pre{font-size:1em}}
+.literalblock pre,.listingblock>.content>pre:not(.highlight),.listingblock>.content>pre[class=highlight],.listingblock>.content>pre[class^="highlight "]{background:#f7f7f8}
+.literalblock.output pre{color:#f7f7f8;background:rgba(0,0,0,.9)}
+.listingblock>.content{position:relative}
+.listingblock code[data-lang]::before{display:none;content:attr(data-lang);position:absolute;font-size:.75em;top:.425rem;right:.5rem;line-height:1;text-transform:uppercase;color:inherit;opacity:.5}
+.listingblock:hover code[data-lang]::before{display:block}
+.listingblock.terminal pre .command::before{content:attr(data-prompt);padding-right:.5em;color:inherit;opacity:.5}
+.listingblock.terminal pre .command:not([data-prompt])::before{content:"$"}
+.listingblock pre.highlightjs{padding:0}
+.listingblock pre.highlightjs>code{padding:1em;border-radius:4px}
+.listingblock pre.prettyprint{border-width:0}
+.prettyprint{background:#f7f7f8}
+pre.prettyprint .linenums{line-height:1.45;margin-left:2em}
+pre.prettyprint li{background:none;list-style-type:inherit;padding-left:0}
+pre.prettyprint li code[data-lang]::before{opacity:1}
+pre.prettyprint li:not(:first-child) code[data-lang]::before{display:none}
+table.linenotable{border-collapse:separate;border:0;margin-bottom:0;background:none}
+table.linenotable td[class]{color:inherit;vertical-align:top;padding:0;line-height:inherit;white-space:normal}
+table.linenotable td.code{padding-left:.75em}
+table.linenotable td.linenos{border-right:1px solid;opacity:.35;padding-right:.5em}
+pre.pygments .lineno{border-right:1px solid;opacity:.35;display:inline-block;margin-right:.75em}
+pre.pygments .lineno::before{content:"";margin-right:-.125em}
+.quoteblock{margin:0 1em 1.25em 1.5em;display:table}
+.quoteblock:not(.excerpt)>.title{margin-left:-1.5em;margin-bottom:.75em}
+.quoteblock blockquote,.quoteblock p{color:rgba(0,0,0,.85);font-size:1.15rem;line-height:1.75;word-spacing:.1em;letter-spacing:0;font-style:italic;text-align:justify}
+.quoteblock blockquote{margin:0;padding:0;border:0}
+.quoteblock blockquote::before{content:"\201c";float:left;font-size:2.75em;font-weight:bold;line-height:.6em;margin-left:-.6em;color:#7a2518;text-shadow:0 1px 2px rgba(0,0,0,.1)}
+.quoteblock blockquote>.paragraph:last-child p{margin-bottom:0}
+.quoteblock .attribution{margin-top:.75em;margin-right:.5ex;text-align:right}
+.verseblock{margin:0 1em 1.25em}
+.verseblock pre{font-family:"Open Sans","DejaVu Sans",sans-serif;font-size:1.15rem;color:rgba(0,0,0,.85);font-weight:300;text-rendering:optimizeLegibility}
+.verseblock pre strong{font-weight:400}
+.verseblock .attribution{margin-top:1.25rem;margin-left:.5ex}
+.quoteblock .attribution,.verseblock .attribution{font-size:.9375em;line-height:1.45;font-style:italic}
+.quoteblock .attribution br,.verseblock .attribution br{display:none}
+.quoteblock .attribution cite,.verseblock .attribution cite{display:block;letter-spacing:-.025em;color:rgba(0,0,0,.6)}
+.quoteblock.abstract blockquote::before,.quoteblock.excerpt blockquote::before,.quoteblock .quoteblock blockquote::before{display:none}
+.quoteblock.abstract blockquote,.quoteblock.abstract p,.quoteblock.excerpt blockquote,.quoteblock.excerpt p,.quoteblock .quoteblock blockquote,.quoteblock .quoteblock p{line-height:1.6;word-spacing:0}
+.quoteblock.abstract{margin:0 1em 1.25em;display:block}
+.quoteblock.abstract>.title{margin:0 0 .375em;font-size:1.15em;text-align:center}
+.quoteblock.excerpt>blockquote,.quoteblock .quoteblock{padding:0 0 .25em 1em;border-left:.25em solid #dddddf}
+.quoteblock.excerpt,.quoteblock .quoteblock{margin-left:0}
+.quoteblock.excerpt blockquote,.quoteblock.excerpt p,.quoteblock .quoteblock blockquote,.quoteblock .quoteblock p{color:inherit;font-size:1.0625rem}
+.quoteblock.excerpt .attribution,.quoteblock .quoteblock .attribution{color:inherit;font-size:.85rem;text-align:left;margin-right:0}
+p.tableblock:last-child{margin-bottom:0}
+td.tableblock>.content{margin-bottom:1.25em;word-wrap:anywhere}
+td.tableblock>.content>:last-child{margin-bottom:-1.25em}
+table.tableblock,th.tableblock,td.tableblock{border:0 solid #dedede}
+table.grid-all>*>tr>*{border-width:1px}
+table.grid-cols>*>tr>*{border-width:0 1px}
+table.grid-rows>*>tr>*{border-width:1px 0}
+table.frame-all{border-width:1px}
+table.frame-ends{border-width:1px 0}
+table.frame-sides{border-width:0 1px}
+table.frame-none>colgroup+*>:first-child>*,table.frame-sides>colgroup+*>:first-child>*{border-top-width:0}
+table.frame-none>:last-child>:last-child>*,table.frame-sides>:last-child>:last-child>*{border-bottom-width:0}
+table.frame-none>*>tr>:first-child,table.frame-ends>*>tr>:first-child{border-left-width:0}
+table.frame-none>*>tr>:last-child,table.frame-ends>*>tr>:last-child{border-right-width:0}
+table.stripes-all tr,table.stripes-odd tr:nth-of-type(odd),table.stripes-even tr:nth-of-type(even),table.stripes-hover tr:hover{background:#f8f8f7}
+th.halign-left,td.halign-left{text-align:left}
+th.halign-right,td.halign-right{text-align:right}
+th.halign-center,td.halign-center{text-align:center}
+th.valign-top,td.valign-top{vertical-align:top}
+th.valign-bottom,td.valign-bottom{vertical-align:bottom}
+th.valign-middle,td.valign-middle{vertical-align:middle}
+table thead th,table tfoot th{font-weight:bold}
+tbody tr th{background:#f7f8f7}
+tbody tr th,tbody tr th p,tfoot tr th,tfoot tr th p{color:rgba(0,0,0,.8);font-weight:bold}
+p.tableblock>code:only-child{background:none;padding:0}
+p.tableblock{font-size:1em}
+ol{margin-left:1.75em}
+ul li ol{margin-left:1.5em}
+dl dd{margin-left:1.125em}
+dl dd:last-child,dl dd:last-child>:last-child{margin-bottom:0}
+ol>li p,ul>li p,ul dd,ol dd,.olist .olist,.ulist .ulist,.ulist .olist,.olist .ulist{margin-bottom:.625em}
+ul.checklist,ul.none,ol.none,ul.no-bullet,ol.no-bullet,ol.unnumbered,ul.unstyled,ol.unstyled{list-style-type:none}
+ul.no-bullet,ol.no-bullet,ol.unnumbered{margin-left:.625em}
+ul.unstyled,ol.unstyled{margin-left:0}
+ul.checklist>li>p:first-child{margin-left:-1em}
+ul.checklist>li>p:first-child>.fa-square-o:first-child,ul.checklist>li>p:first-child>.fa-check-square-o:first-child{width:1.25em;font-size:.8em;position:relative;bottom:.125em}
+ul.checklist>li>p:first-child>input[type=checkbox]:first-child{margin-right:.25em}
+ul.inline{display:flex;flex-flow:row wrap;list-style:none;margin:0 0 .625em -1.25em}
+ul.inline>li{margin-left:1.25em}
+.unstyled dl dt{font-weight:400;font-style:normal}
+ol.arabic{list-style-type:decimal}
+ol.decimal{list-style-type:decimal-leading-zero}
+ol.loweralpha{list-style-type:lower-alpha}
+ol.upperalpha{list-style-type:upper-alpha}
+ol.lowerroman{list-style-type:lower-roman}
+ol.upperroman{list-style-type:upper-roman}
+ol.lowergreek{list-style-type:lower-greek}
+.hdlist>table,.colist>table{border:0;background:none}
+.hdlist>table>tbody>tr,.colist>table>tbody>tr{background:none}
+td.hdlist1,td.hdlist2{vertical-align:top;padding:0 .625em}
+td.hdlist1{font-weight:bold;padding-bottom:1.25em}
+td.hdlist2{word-wrap:anywhere}
+.literalblock+.colist,.listingblock+.colist{margin-top:-.5em}
+.colist td:not([class]):first-child{padding:.4em .75em 0;line-height:1;vertical-align:top}
+.colist td:not([class]):first-child img{max-width:none}
+.colist td:not([class]):last-child{padding:.25em 0}
+.thumb,.th{line-height:0;display:inline-block;border:4px solid #fff;box-shadow:0 0 0 1px #ddd}
+.imageblock.left{margin:.25em .625em 1.25em 0}
+.imageblock.right{margin:.25em 0 1.25em .625em}
+.imageblock>.title{margin-bottom:0}
+.imageblock.thumb,.imageblock.th{border-width:6px}
+.imageblock.thumb>.title,.imageblock.th>.title{padding:0 .125em}
+.image.left,.image.right{margin-top:.25em;margin-bottom:.25em;display:inline-block;line-height:0}
+.image.left{margin-right:.625em}
+.image.right{margin-left:.625em}
+a.image{text-decoration:none;display:inline-block}
+a.image object{pointer-events:none}
+sup.footnote,sup.footnoteref{font-size:.875em;position:static;vertical-align:super}
+sup.footnote a,sup.footnoteref a{text-decoration:none}
+sup.footnote a:active,sup.footnoteref a:active{text-decoration:underline}
+#footnotes{padding-top:.75em;padding-bottom:.75em;margin-bottom:.625em}
+#footnotes hr{width:20%;min-width:6.25em;margin:-.25em 0 .75em;border-width:1px 0 0}
+#footnotes .footnote{padding:0 .375em 0 .225em;line-height:1.3334;font-size:.875em;margin-left:1.2em;margin-bottom:.2em}
+#footnotes .footnote a:first-of-type{font-weight:bold;text-decoration:none;margin-left:-1.05em}
+#footnotes .footnote:last-of-type{margin-bottom:0}
+#content #footnotes{margin-top:-.625em;margin-bottom:0;padding:.75em 0}
+.gist .file-data>table{border:0;background:#fff;width:100%;margin-bottom:0}
+.gist .file-data>table td.line-data{width:99%}
+div.unbreakable{page-break-inside:avoid}
+.big{font-size:larger}
+.small{font-size:smaller}
+.underline{text-decoration:underline}
+.overline{text-decoration:overline}
+.line-through{text-decoration:line-through}
+.aqua{color:#00bfbf}
+.aqua-background{background:#00fafa}
+.black{color:#000}
+.black-background{background:#000}
+.blue{color:#0000bf}
+.blue-background{background:#0000fa}
+.fuchsia{color:#bf00bf}
+.fuchsia-background{background:#fa00fa}
+.gray{color:#606060}
+.gray-background{background:#7d7d7d}
+.green{color:#006000}
+.green-background{background:#007d00}
+.lime{color:#00bf00}
+.lime-background{background:#00fa00}
+.maroon{color:#600000}
+.maroon-background{background:#7d0000}
+.navy{color:#000060}
+.navy-background{background:#00007d}
+.olive{color:#606000}
+.olive-background{background:#7d7d00}
+.purple{color:#600060}
+.purple-background{background:#7d007d}
+.red{color:#bf0000}
+.red-background{background:#fa0000}
+.silver{color:#909090}
+.silver-background{background:#bcbcbc}
+.teal{color:#006060}
+.teal-background{background:#007d7d}
+.white{color:#bfbfbf}
+.white-background{background:#fafafa}
+.yellow{color:#bfbf00}
+.yellow-background{background:#fafa00}
+span.icon>.fa{cursor:default}
+a span.icon>.fa{cursor:inherit}
+.admonitionblock td.icon [class^="fa icon-"]{font-size:2.5em;text-shadow:1px 1px 2px rgba(0,0,0,.5);cursor:default}
+.admonitionblock td.icon .icon-note::before{content:"\f05a";color:#19407c}
+.admonitionblock td.icon .icon-tip::before{content:"\f0eb";text-shadow:1px 1px 2px rgba(155,155,0,.8);color:#111}
+.admonitionblock td.icon .icon-warning::before{content:"\f071";color:#bf6900}
+.admonitionblock td.icon .icon-caution::before{content:"\f06d";color:#bf3400}
+.admonitionblock td.icon .icon-important::before{content:"\f06a";color:#bf0000}
+.conum[data-value]{display:inline-block;color:#fff!important;background:rgba(0,0,0,.8);border-radius:50%;text-align:center;font-size:.75em;width:1.67em;height:1.67em;line-height:1.67em;font-family:"Open Sans","DejaVu Sans",sans-serif;font-style:normal;font-weight:bold}
+.conum[data-value] *{color:#fff!important}
+.conum[data-value]+b{display:none}
+.conum[data-value]::after{content:attr(data-value)}
+pre .conum[data-value]{position:relative;top:-.125em}
+b.conum *{color:inherit!important}
+.conum:not([data-value]):empty{display:none}
+dt,th.tableblock,td.content,div.footnote{text-rendering:optimizeLegibility}
+h1,h2,p,td.content,span.alt,summary{letter-spacing:-.01em}
+p strong,td.content strong,div.footnote strong{letter-spacing:-.005em}
+p,blockquote,dt,td.content,span.alt,summary{font-size:1.0625rem}
+p{margin-bottom:1.25rem}
+.sidebarblock p,.sidebarblock dt,.sidebarblock td.content,p.tableblock{font-size:1em}
+.exampleblock>.content{background:#fffef7;border-color:#e0e0dc;box-shadow:0 1px 4px #e0e0dc}
+.print-only{display:none!important}
+@page{margin:1.25cm .75cm}
+@media print{*{box-shadow:none!important;text-shadow:none!important}
+html{font-size:80%}
+a{color:inherit!important;text-decoration:underline!important}
+a.bare,a[href^="#"],a[href^="mailto:"]{text-decoration:none!important}
+a[href^="http:"]:not(.bare)::after,a[href^="https:"]:not(.bare)::after{content:"(" attr(href) ")";display:inline-block;font-size:.875em;padding-left:.25em}
+abbr[title]{border-bottom:1px dotted}
+abbr[title]::after{content:" (" attr(title) ")"}
+pre,blockquote,tr,img,object,svg{page-break-inside:avoid}
+thead{display:table-header-group}
+svg{max-width:100%}
+p,blockquote,dt,td.content{font-size:1em;orphans:3;widows:3}
+h2,h3,#toctitle,.sidebarblock>.content>.title{page-break-after:avoid}
+#header,#content,#footnotes,#footer{max-width:none}
+#toc,.sidebarblock,.exampleblock>.content{background:none!important}
+#toc{border-bottom:1px solid #dddddf!important;padding-bottom:0!important}
+body.book #header{text-align:center}
+body.book #header>h1:first-child{border:0!important;margin:2.5em 0 1em}
+body.book #header .details{border:0!important;display:block;padding:0!important}
+body.book #header .details span:first-child{margin-left:0!important}
+body.book #header .details br{display:block}
+body.book #header .details br+span::before{content:none!important}
+body.book #toc{border:0!important;text-align:left!important;padding:0!important;margin:0!important}
+body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-break-before:always}
+.listingblock code[data-lang]::before{display:block}
+#footer{padding:0 .9375em}
+.hide-on-print{display:none!important}
+.print-only{display:block!important}
+.hide-for-print{display:none!important}
+.show-for-print{display:inherit!important}}
+@media amzn-kf8,print{#header>h1:first-child{margin-top:1.25rem}
+.sect1{padding:0!important}
+.sect1+.sect1{border:0}
+#footer{background:none}
+#footer-text{color:rgba(0,0,0,.6);font-size:.9em}}
+@media amzn-kf8{#header,#content,#footnotes,#footer{padding:0}}
+</style>
+</head>
+<body class="article">
+<div id="header">
+<h1>cl_khr_tensor</h1>
+</div>
+<div id="content">
+<div class="sect1">
+<h2 id="cl_khr_tensor">Tensor Data Type</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>Purpose of this extension is to provide &#8230;&#8203;</p>
+</div>
+<div class="sect2">
+<h3 id="_general_information">General information</h3>
+<div class="sect3">
+<h4 id="_name_strings">Name Strings</h4>
+<div class="paragraph">
+<p><code>cl_khr_tensor</code></p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="_version_history">Version history</h4>
+<table class="tableblock frame-all grid-all stretch">
+<colgroup>
+<col style="width: 20%;">
+<col style="width: 20%;">
+<col style="width: 60%;">
+</colgroup>
+<thead>
+<tr>
+<th class="tableblock halign-left valign-top"><strong>Date</strong></th>
+<th class="tableblock halign-left valign-top"><strong>Version</strong></th>
+<th class="tableblock halign-left valign-top"><strong>Description</strong></th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">2023-10-XX</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">0.1.0</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">First assigned version.</p></td>
+</tr>
+</tbody>
+</table>
+</div>
+<div class="sect3">
+<h4 id="_dependencies">Dependencies</h4>
+<div class="paragraph">
+<p>This extension is written against the OpenCL Specification version 3.0.14.</p>
+</div>
+<div class="paragraph">
+<p>This extension requires OpenCL 1.2 or later.</p>
+</div>
+<div class="paragraph">
+<p>This extension requires cl_khr_command_buffer.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="_contributors">Contributors</h4>
+<div class="paragraph">
+<p>Henry Linjamäki, Intel.<br></p>
+</div>
+</div>
+</div>
+<div class="sect2">
+<h3 id="_overview">Overview</h3>
+
+</div>
+<div class="sect2">
+<h3 id="_modifications_to_opencl">Modifications to OpenCL</h3>
+<div class="sect3">
+<h4 id="_new_opencl_functions">New OpenCL Functions</h4>
+<div class="paragraph">
+<p>To create a tensor use:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="highlight"><code class="language-c" data-lang="c">cl_tensor clCreateTensor(
+    cl_context context,
+    const cl_tensor_peoperties *properties,
+    size_t rank,
+    size_t shape,
+    cl_tensor_type dtype,
+    cl_int *errcode_ret);</code></pre>
+</div>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><em>context</em> is a valid OpenCL context used to create the tensor object.</p>
+</li>
+<li>
+<p><em>properties</em> is an optional list of properties for the tensor object
+and their corresponding values. The list is terminated with the
+special property 0. If no properties are required, properties may be
+NULL.</p>
+</li>
+<li>
+<p><em>rank</em> is the number of dimensions. Zero value creates a "scalar"
+tensor which has no dimensions but has storage for one element.</p>
+</li>
+<li>
+<p><em>shape</em> is a list of sizes of the dimensions. The length of the list
+must be <em>rank</em> elements. <em>shape</em> can be NULL if <em>rank</em> value is
+zero. All the first <em>rank</em> values in the list must be non-zero.</p>
+</li>
+<li>
+<p><em>dtype</em> is the element type of <em>tensor</em>. Refer to the
+<a href="#TensorDtypes">Tensor element types</a> table for the types.</p>
+</li>
+<li>
+<p><em>errcode_ret</em> may return an appropriate error code. If errcode_ret
+is NULL, no error code is returned.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>clCreateTensor function creates a <code>rank</code>-dimensional tensor with
+<code>shape[0] * shape[1] * &#8230;&#8203; * shape[rank-1]</code> elements of <em>dtype</em>
+type. At the creation time of the tensor, it does not have
+storage. The storage is assigned to the tensor either by:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>calling clCreateBufferWithProperties() with CL_MEM_BIND_TO_TENSOR or</p>
+</li>
+<li>
+<p>automatically by command buffers - possibly on-demand basis - if the
+tensor is created with CL_TENSOR_COMMAND_BUFFER_TEMPORARY property
+set on.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>A command that refers to a tensor must be bound to a valid buffer
+object before enqueuing the command into a command queue unless the
+command is recorded in a command buffer and
+CL_TENSOR_COMMAND_BUFFER_TEMPORARY is set to true.</p>
+</div>
+<div class="paragraph">
+<p><strong>clCreateTensor</strong> returns a valid non-zero tensor object and errcode_ret
+is set to CL_SUCCESS if the tensor object is created
+successfully. Otherwise, they return a NULL value with one of the
+following error values returned in errcode_ret:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>CL_INVALID_CONTEXT if context is not a valid context.</p>
+</li>
+<li>
+<p>CL_INVALID_PROPERTY if a property name in properties is not a
+supported property name, if the value specified for a supported
+property name is not valid, or if the same property name is
+specified more than once.</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if a value specified in dtype is invalid.</p>
+</li>
+<li>
+<p>CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
+required by the OpenCL implementation on the host.</p>
+</li>
+</ul>
+</div>
+<table id="TensorDtypes" class="tableblock frame-all grid-all stripes-odd stretch">
+<caption class="title">Table 1. Tensor element types</caption>
+<colgroup>
+<col style="width: 33.3333%;">
+<col style="width: 66.6667%;">
+</colgroup>
+<thead>
+<tr>
+<th class="tableblock halign-left valign-top"><strong>Tensor element data type</strong></th>
+<th class="tableblock halign-left valign-top"><strong>Description</strong></th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_BOOL</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">1-bit signedless integer.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_INT8</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">8-bit signed integer.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_INT16</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">16-bit signed integer.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_INT32</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">32-bit signed integer.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_INT64</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">64-bit signed integer.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_UINT8</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">8-bit signed integer.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_UINT16</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">16-bit signed integer.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_UINT32</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">32-bit signed integer.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_UINT64</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">64-bit signed integer.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_HALF</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Half precision floating-point value.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_BFLOAT16</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">16-bit brain floating-point value.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_FLOAT</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Single precision floating-point value.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DOUBLE</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Double precision floating-point value.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_COMPLEX64</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">64-bit complex floating point value with
+  32-bit real and imaginary part.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_COMPLEX128</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">128-bit complex floating point value with
+  64-bit real and imaginary part.</p></td>
+</tr>
+</tbody>
+</table>
+<table class="tableblock frame-all grid-all stripes-odd stretch">
+<caption class="title">Table 2. Tensor properties</caption>
+<colgroup>
+<col style="width: 40%;">
+<col style="width: 20%;">
+<col style="width: 40%;">
+</colgroup>
+<thead>
+<tr>
+<th class="tableblock halign-left valign-top"><strong>Tensor Property</strong></th>
+<th class="tableblock halign-left valign-top"><strong>Property Value</strong></th>
+<th class="tableblock halign-left valign-top"><strong>Description</strong></th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_COMMAND_BUFFER_TEMPORARY</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_bool</p></td>
+<td class="tableblock halign-left valign-top"><div class="content"><div class="paragraph">
+<p>If the value is true, create a "temporary" tensor that only can be
+used on commands recorded in command buffers. The storage of the
+temporary tensors are managed by command buffers. When a temporary
+tensor is used by multiple command buffer, the tensor receive separate
+storage for each command buffer.</p>
+</div>
+<div class="paragraph">
+<p>Temporary tensors may not be bound to buffer objects.</p>
+</div>
+<div class="paragraph">
+<p>Data stored in temporary tensors are not preserved across command
+buffer executions.</p>
+</div></div></td>
+</tr>
+</tbody>
+</table>
+<div class="paragraph">
+<p>To retain a tensor object, call the function</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="highlight"><code class="language-c" data-lang="c">cl_int clRetainTensorObject(
+  cl_tensor tensor);</code></pre>
+</div>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><em>tensor</em> is the tensor object to be retained.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>The <em>tensor</em> reference count is incremented.</p>
+</div>
+<div class="paragraph">
+<p><strong>clRetainTensor</strong> returns CL_SUCCESS if the function is executed
+successfully. Otherwise, it returns one of the following errors:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>CL_INVALID_TENSOR if tensor is not a valid tensor object.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>To release a tensor object, call the function</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="highlight"><code class="language-c" data-lang="c">cl_int clReleaseTensorObject(
+  cl_tensor tensor);</code></pre>
+</div>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><em>tensor</em> is the tensor object to be released.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>The <em>tensor</em> reference count is decremented.</p>
+</div>
+<div class="paragraph">
+<p>The tensor object is deleted once the number of instances that are
+retained to tensor become zero and the tensor object is no longer
+needed by any enqueued or recorded commands that use <em>tensor</em>. Using
+this function to release a reference that was not obtained by creating
+the object or by calling <strong>clRetainTensor</strong> causes undefined behavior.</p>
+</div>
+<div class="paragraph">
+<p><strong>clReleaseTensor</strong> returns CL_SUCCESS if the function is executed
+successfully. Otherwise, it returns one of the following errors:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>CL_INVALID_TENSOR if tensor is not a valid tensor object.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>To return information about a tensor object, call the function</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="highlight"><code class="language-c" data-lang="c">cl_int clGetTensorInfo(
+  cl_tensor tensor,
+  cl_tensor_info param_name,
+  size_t param_value_size,
+  void* param_value,
+  size_t* param_value_size_ret);</code></pre>
+</div>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><em>tensor</em> specifies the tensor object being queried.</p>
+</li>
+<li>
+<p><em>param_name</em> specifies the information to query. The list of
+supported param_name types and the information returned in
+<em>param_value</em> by clGetTensorInfo is described in the <a href="#Tensor Object
+Queries">[Tensor Object
+Queries]</a> table.</p>
+</li>
+<li>
+<p><em>param_value</em> is a pointer to memory where the appropriate result
+being queried is returned. If <em>param_value</em> is NULL, it is ignored.</p>
+</li>
+<li>
+<p><em>param_value_size</em> is used to specify the size in bytes of memory
+pointed to by <em>param_value</em>. This size must be ≥ size of return type
+as described in the <a href="#Tensor Object Queries">[Tensor Object Queries]</a> table.</p>
+</li>
+<li>
+<p><em>param_value_size_ret</em> returns the actual size in bytes of data
+being queried by <em>param_name</em>. If <em>param_value_size_ret</em> is NULL, it is
+ignored.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p><strong>clGetTensorInfo</strong> returns CL_SUCCESS if the function is executed
+ succesfully. Otherwise, it returns one of the following errors:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>CL_INVALID_TENSOR if <em>tensor</em> is not a valid tensor object.</p>
+</li>
+</ul>
+</div>
+<table class="tableblock frame-all grid-all stripes-odd stretch">
+<caption class="title">Table 3. List of supported param_names by clGetTensorInfo</caption>
+<colgroup>
+<col style="width: 40%;">
+<col style="width: 20%;">
+<col style="width: 40%;">
+</colgroup>
+<tbody>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_RANK</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">size_t</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Return the tensor rank.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_SHAPE</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">size_t[]</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Return the tensor shape.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_tensor_type</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Return the tensor data type.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_COMMAND_BUFFER_TEMPORARY</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_bool</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Return true if the
+tensor is temporary tensor for command buffers.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_BOUND_TO_BUFFER</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_bool</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Return true if the tensor is
+bound to a buffer. If CL_TENSOR_COMMAND_BUFFER_TEMPORARY is true, then
+CL_TENSOR_BOUND_TO_BUFFER must return false.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_BUFFER</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_mem</p></td>
+<td class="tableblock halign-left valign-top"><div class="content"><div class="paragraph">
+<p>If CL_TENSOR_BOUND_TO_BUFFER is true,
+return the buffer object the tensor is bound to. Otherwise,
+clGetTensorInfo call returns:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>CL_INVALID_MEM_OBJECT if the tensor is not bound to a buffer object.</p>
+</li>
+<li>
+<p>CL_INVALID_PROPERTY otherwise.</p>
+</li>
+</ul>
+</div></div></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_CONTEXT</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_context</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Return the context specified when
+  the tensor object is created.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_REFERENCE_COUNT</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_uint</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Return the tensor reference
+count.</p></td>
+</tr>
+</tbody>
+</table>
+<div class="paragraph">
+<p>To read from a tensor to host memory / buffer object or to write to a
+tensor object from host memory / buffer object call one of the functions.</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="highlight"><code class="language-c" data-lang="c">cl_int clEnqueueReadTensor(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  cl_mem buffer,
+  void* host_ptr,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);</code></pre>
+</div>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="highlight"><code class="language-c" data-lang="c">cl_int clEnqueueWriteTensor(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  cl_mem buffer,
+  void* host_ptr,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);</code></pre>
+</div>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><em>command_queue</em> is a valid host command-queue in which the read /
+write command will be queued. <em>command_queue</em> and <em>tensor</em> must be
+created with the same OpenCL context.</p>
+</li>
+<li>
+<p><em>tensor</em> refers to a valid tensor object which is bound to a buffer.</p>
+</li>
+<li>
+<p><em>blocking_command</em> indicate if the read and write operations are
+blocking or non-blocking (see below).</p>
+</li>
+<li>
+<p><em>buffer</em> refers to a valid buffer object where data is to be
+read into or to be written from when the value of <em>host_ptr</em> is
+NULL. If <em>host_ptr</em> is non-NULL then value of <em>buffer</em> is ignored.</p>
+</li>
+<li>
+<p><em>host_ptr</em> is the pointer to buffer in host memory where data is to
+be read into or to be written from when the value is non-NULL.</p>
+</li>
+<li>
+<p><em>event_wait_list</em> and <em>num_events_in_wait_list</em> specify events that
+need to complete before this particular command can be executed. If
+<em>event_wait_list</em> is NULL, then this particular command does not
+wait on any event to complete. If <em>event_wait_list</em> is NULL,
+<em>num_events_in_wait_list</em> must be 0. If <em>event_wait_list</em> is not
+NULL, the list of events pointed to by <em>event_wait_list</em> must be
+valid and <em>num_events_in_wait_list</em> must be greater than 0. The
+events specified in <em>event_wait_list</em> act as synchronization
+points. The context associated with events in <em>event_wait_list</em> and
+<em>command_queue</em> must be the same. The memory associated with
+<em>event_wait_list</em> can be reused or freed after the function returns.</p>
+</li>
+<li>
+<p><em>event</em> returns an event object that identifies this read / write
+command and can be used to query or queue a wait for this command to
+complete. If <em>event</em> is NULL or the enqueue is unsuccessful, no
+event will be created and therefore it will not be possible to query
+the status of this command or to wait for this command to
+complete. If <em>event_wait_list</em> and <em>event</em> are not NULL, <em>event</em>
+must not refer to an element of the <em>event_wait_list</em> array.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>For a read and write operation, the elements of N-dimensional tensor are
+related to host memory / buffer object as followed:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre>tensor.element(i0, i1, ..., i&lt;N-2&gt;, i&lt;N-1&gt;)) == (tensor.dtype)buffer_or_host_ptr[
+  i0 * tensor.shape[1] * tensor.shape[2] * ... * tensor.shape[N-1] +
+  i1 * tensor.shape[2] * tensor.shape[3] * ... * tensor.shape[N-1] +
+  ... +
+  i&lt;N-2&gt; * tensor.shape[i(N-1)] +
+  i&lt;N-1&gt;]</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Where <code>iX</code> is a tensor coordinate index with inclusive range of <code>0..&lt;shape[X]&gt;</code>.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="_add_new_buffer_property_in_section_5_2_1">Add New Buffer Property in Section 5.2.1</h4>
+<table class="tableblock frame-all grid-all stripes-odd stretch">
+<colgroup>
+<col style="width: 40%;">
+<col style="width: 20%;">
+<col style="width: 40%;">
+</colgroup>
+<tbody>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_MEM_BIND_TO_TENSOR</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_tensor</p></td>
+<td class="tableblock halign-left valign-top"><div class="content"><div class="paragraph">
+<p>Use the created buffer as
+storage for the given valid tensor. To succeed creating the buffer,
+the target tensor may not have storage already, must not have
+CL_TENSOR_COMMAND_BUFFER_TEMPORARY property set on and <em>size</em> argument
+of the clCreateBufferWithProperties() must be zero.</p>
+</div>
+<div class="paragraph">
+<p>Size of the memory buffer is implementation-defined and it can be
+queried with clGetTensorInfo().</p>
+</div>
+<div class="paragraph">
+<p>Memory layout of the tensor in the created memory buffer is
+implementation-defined and opaque to the applications and it may
+change at unspecified points. Implementation may store auxiliary data
+in the memory buffer for the tensor. Therefore, writing data into the
+memory buffer directly using the cl_mem handle leads to undefined
+behavior.</p>
+</div>
+<div class="paragraph">
+<p>If the tensor is already bound to a buffer object,
+clCreateBufferWithProperties call returns CL_TENSOR_BOUND_TO_BUFFER
+error code.</p>
+</div></div></td>
+</tr>
+</tbody>
+</table>
+</div>
+</div>
+<div class="sect2">
+<h3 id="_sample_codes">Sample Codes</h3>
+<div class="paragraph">
+<p>Helper functions used in the follow up tensor code samples:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="highlight"><code class="language-c" data-lang="c">cl_kernel create_matmul_kernel(
+  cl_context ctx, std::span&lt;cl_device_id&gt; device_span,
+  cl_tensor lhs, cl_tensor rhs, cl_tensor out) {
+  // A hypothetical matmul kernel signature in pseudo OpenCL C for
+  // illustrative purposes:
+  //
+  //   kernel void matmul(
+  //     global read_only tensor_t,
+  //     global read_only tensor_t,
+  //     global write_only tensor_t);
+
+  cl_kernel matmul_kernel = /* Omitted. */;
+  clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor), &amp;lhs);
+  clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor), &amp;rhs);
+  clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor), &amp;out);
+  return matmul_kernel;
+}
+
+cl_kernel create_matmul_kernel(
+  cl_context ctx, std::span&lt;cl_device_id&gt; device_span,
+  cl_tensor lhs, cl_tensor rhs, cl_tensor out) {
+  // A hypothetical add kernel signature in pseudo OpenCL C for illustrative
+  // purposes:
+  //
+  // kernel void add(
+  //     global read_only tensor_t,
+  //     global read_only tensor_t,
+  //     global write_only tensor_t);
+
+  cl_tensor add_kernel = /* Omitted. */;
+  clSetKernelArg(add_kernel, 0, sizeof(cl_tensor), &amp;lhs);
+  clSetKernelArg(add_kernel, 1, sizeof(cl_tensor), &amp;rhs);
+  clSetKernelArg(add_kernel, 2, sizeof(cl_tensor), &amp;out);
+  return add_kernel;
+}</code></pre>
+</div>
+</div>
+<div class="paragraph">
+<p>An example usage of tensors on a command queue:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="highlight"><code class="language-c" data-lang="c">constexpr size_t b = 64, m = 100, n = 200, k = 50;
+
+cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
+cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
+cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor t0  = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+
+cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
+cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
+
+// Allocate storage for the tensors. The buffer size must be set to zero
+// when the buffer is bound to a tensor. OpenCL implementation may
+// determine optimal data layout and the storage needed for it, based
+// on the tensor's uses (matmul kernel in this sample) so far.
+cl_int err;
+cl_mem in0_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, in0, 0}, CL_MEM_READ_ONLY,
+  0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */, nullptr, &amp;err);
+cl_mem in1_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, in1, 0}, CL_MEM_READ_ONLY,
+  0, nullptr, &amp;err);
+cl_mem in2_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, in2, 0}, CL_MEM_READ_ONLY,
+  0, nullptr, &amp;err);
+cl_mem t0_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE,
+  0, nullptr, &amp;err);
+cl_mem out_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, out, 0}, CL_MEM_WRITE_ONLY,
+  0, nullptr, &amp;err);
+
+std::vector&lt;float&gt; in0_data = ...;
+std::vector&lt;float&gt; in1_data = ...;
+std::vector&lt;float&gt; out_data(b * m * n);
+
+// Copies data into in0 tensor while possibly rearranging the data to the
+// optimal data layout.
+clEnqueueWriteTensor(
+  cmd_q, in0, false, nullptr, nullptr, {b, m, k}, nullptr, in0_data.data(),
+  0, nullptr, nullptr);
+
+clEnqueueWriteTensor(
+  cmd_q, in1, false, nullptr, nullptr, {b, k, n}, nullptr, in1_data.data(),
+  0, nullptr, nullptr);
+clEnqueueNDRangeKernel(
+  cmd_q, matmul_kernel, 0, nullptr, nullptr, nullptr, 0, nullptr, nullptr);
+clEnqueueNDRangeKernel(
+  cmd_q, add_kernel, 0, nullptr, nullptr, nullptr, 0, nullptr, nullptr);
+clEnqueueReadTensor(
+  cmd_q, out, false, nullptr, nullptr, {b, m, n}, nullptr, out_data.data(),
+  0, nullptr, nullptr);</code></pre>
+</div>
+</div>
+<div class="paragraph">
+<p>An example use of tensors in a command buffer when cl_khr_command_buffer
+extension is supported:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="highlight"><code class="language-c" data-lang="c">constexpr size_t b = 64, m = 100, n = 200, k = 50;
+
+cl_int err;
+// Create tensors which are used as temporaries in a command buffer.
+// Command buffers allocate space for them as needed.
+//
+// NOTE: same temporary tensor handle used in multiple command buffers
+//       will have separate storage. IOW, command buffers may not exchange
+//       data via temporary buffers between them.
+cl_tensor in0 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
+  3, {b, m, k}, CL_TENSOR_FLOAT, err);
+cl_tensor in1 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
+  3, {b, k, n}, CL_TENSOR_FLOAT, err);
+cl_tensor in2 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
+  3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor t0  = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
+  3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor out = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
+  3, {b, m, n}, CL_TENSOR_FLOAT, err);
+
+cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
+cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
+
+// Binding a buffer to temporary tensor is not allowed.
+auto ignored = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE, 0, nullptr, &amp;err);
+assert(err == CL_TENSOR_IS_TEMPORARY)
+
+std::vector&lt;float&gt; in0_data = ...;
+std::vector&lt;float&gt; in1_data = ...;
+std::vector&lt;float&gt; out_data(b * m * n);
+
+cl_command_buffer_khr cb =
+  clCreateCommandBufferKHR(num_queues, queue_list, nullptr, &amp;err);
+
+cl_sync_point_khr in0_syncp, in1_syncp, matmul_syncp, add_syncp;
+clCommandWriteTensorKHR(
+  cmd_b, cmd_q, in0, false, nullptr, nullptr, {b, m, k}, nullptr,
+  in0_data.data(), 0, nullptr, &amp;in0_syncp);
+clCommandWriteTensorKHR(
+  cmd_b, cmd_q, in1, false, nullptr, nullptr, {b, k, m}, nullptr,
+  in1_data.data(), 0, nullptr, &amp;in1_syncp);
+clCommandNDRangeKernelKHR(
+  cmd_b, cmd_q, nullptr, matmul_kernel, 0, nullptr, nullptr, nullptr,
+  2, {in0_syncp, in2_syncp}, &amp;matmul_syncp, nullptr);
+clCommandNDRangeKernelKHR(
+  cmd_b, cmd_q, nullptr, add_kernel, 0, nullptr, nullptr, nullptr,
+  1, {matmul_syncp}, &amp;add_syncp, nullptr);
+clCommandReadTensorKHR(
+  cmd_b, cmd_q, out,  false, nullptr, nullptr, {b, k, m}, nullptr,
+  out_data.data(), 1, {add_syncp}, nullptr);
+
+// Finalize the command buffer. At this point the OpenCL
+// implementation may reserve enough storage for all the tensor
+// temporaries. Temporary tensors might be eliminated - for example,
+// OpenCL implementation could use 'out' tensor to store result of
+// matmul_kernel , thus, eliminating the need of 't0' tensor.
+clFinalizeCommandBufferKHR(cmd_b);
+
+// Temporary tensors used in a command buffer can't be read or written
+// into. A hypothetical reason is that the finalized command buffer
+// might not use some of the tensor.
+assert(clEnqueueReadTensor(..., t0, ...) == CL_INVALID_OPERATION);</code></pre>
+</div>
+</div>
+</div>
+<div class="sect2">
+<h3 id="_open_questions">Open Questions</h3>
+
+</div>
+</div>
+</div>
+</div>
+<div id="footer">
+<div id="footer-text">
+Last updated 2023-10-30 16:51:10 +0200
+</div>
+</div>
+</body>
+</html>
\ No newline at end of file

From 1f0be1eb7b6ac4a0f6131569708940e1b6b87544 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <linehill@users.noreply.github.com>
Date: Thu, 2 Nov 2023 14:16:43 +0200
Subject: [PATCH 02/26] Apply suggestions from code review
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Ben Ashbaugh <ben.ashbaugh@intel.com>
Co-authored-by: Pekka Jääskeläinen <pekka.jaaskelainen@tuni.fi>
---
 ext/cl_khr_tensor.asciidoc | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc
index cd17a42b..1df37e9e 100644
--- a/ext/cl_khr_tensor.asciidoc
+++ b/ext/cl_khr_tensor.asciidoc
@@ -51,7 +51,7 @@ cl_tensor clCreateTensor(
     cl_context context,
     const cl_tensor_peoperties *properties,
     size_t rank,
-    size_t shape,
+    const size_t* shape,
     cl_tensor_type dtype,
     cl_int *errcode_ret);
 ----
@@ -88,7 +88,7 @@ storage. The storage is assigned to the tensor either by:
   set on.
 
 A command that refers to a tensor must be bound to a valid buffer
-object before enqueuing the command into a command queue unless the
+object before enqueuing the command that refers to the tensor into a command queue unless the
 command is recorded in a command buffer and
 CL_TENSOR_COMMAND_BUFFER_TEMPORARY is set to true.
 
@@ -124,7 +124,7 @@ following error values returned in errcode_ret:
 | CL_TENSOR_UINT16     | 16-bit signed integer.
 | CL_TENSOR_UINT32     | 32-bit signed integer.
 | CL_TENSOR_UINT64     | 64-bit signed integer.
-| CL_TENSOR_HALF       | Half precision floating-point value.
+| CL_TENSOR_HALF       | Half precision floating-point.
 | CL_TENSOR_BFLOAT16   | 16-bit brain floating-point value.
 | CL_TENSOR_FLOAT      | Single precision floating-point value.
 | CL_TENSOR_DOUBLE     | Double precision floating-point value.
@@ -144,7 +144,7 @@ following error values returned in errcode_ret:
 a| If the value is true, create a "temporary" tensor that only can be
 used on commands recorded in command buffers. The storage of the
 temporary tensors are managed by command buffers. When a temporary
-tensor is used by multiple command buffer, the tensor receive separate
+tensor is used by multiple command buffers, the tensor receives separate
 storage for each command buffer.
 
 // IOW, Data may not be exchanged between command buffers through
@@ -171,7 +171,7 @@ The _tensor_ reference count is incremented.
 *clRetainTensor* returns CL_SUCCESS if the function is executed
 successfully. Otherwise, it returns one of the following errors:
 
-* CL_INVALID_TENSOR if tensor is not a valid tensor object.
+* CL_INVALID_TENSOR if the tensor is not a valid tensor object.
 
 To release a tensor object, call the function
 
@@ -242,7 +242,7 @@ cl_int clGetTensorInfo(
 | CL_TENSOR_DTYPE | cl_tensor_type | Return the tensor data type.
 
 | CL_TENSOR_COMMAND_BUFFER_TEMPORARY | cl_bool | Return true if the
-tensor is temporary tensor for command buffers.
+tensor is a temporary tensor for command buffers.
 
 | CL_TENSOR_BOUND_TO_BUFFER | cl_bool | Return true if the tensor is
 bound to a buffer. If CL_TENSOR_COMMAND_BUFFER_TEMPORARY is true, then
@@ -263,8 +263,8 @@ clGetTensorInfo call returns:
 count.
 |===
 
-To read from a tensor to host memory / buffer object or to write to a
-tensor object from host memory / buffer object call one of the functions.
+The following functions are for reading from a tensor to host memory / buffer object or to write to a
+tensor object from host memory / buffer object.
 
 [source,c]
 ----
@@ -286,7 +286,7 @@ cl_int clEnqueueWriteTensor(
   cl_tensor tensor,
   cl_bool blocking_command,
   cl_mem buffer,
-  void* host_ptr,
+  const void* host_ptr,
   cl_uint num_events_in_wait_list,
   const cl_event* event_wait_list,
   cl_event* event);
@@ -329,10 +329,10 @@ cl_int clEnqueueWriteTensor(
   must not refer to an element of the _event_wait_list_ array.
 
 For a read and write operation, the elements of N-dimensional tensor are
-related to host memory / buffer object as followed:
+related to host memory / buffer object as follows:
 
 ----
-tensor.element(i0, i1, ..., i<N-2>, i<N-1>)) == (tensor.dtype)buffer_or_host_ptr[
+tensor.element(i0, i1, ..., i<N-2>, i<N-1>) == (tensor.dtype)buffer_or_host_ptr[
   i0 * tensor.shape[1] * tensor.shape[2] * ... * tensor.shape[N-1] +
   i1 * tensor.shape[2] * tensor.shape[3] * ... * tensor.shape[N-1] +
   ... +
@@ -505,7 +505,7 @@ cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
 // Binding a buffer to temporary tensor is not allowed.
 auto ignored = clCreateBufferWithProperties(
   ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE, 0, nullptr, &err);
-assert(err == CL_TENSOR_IS_TEMPORARY)
+assert(err == CL_TENSOR_IS_TEMPORARY);
 
 std::vector<float> in0_data = ...;
 std::vector<float> in1_data = ...;

From a801aaf4fcec40d31203799de9a3390bf427f957 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 2 Nov 2023 09:15:20 +0200
Subject: [PATCH 03/26] * Add brief introduction.

* cl_khr_tensor -> cl_exp_tensor.

* Remove cl_khr_command_buffer requirement.
---
 ext/cl_khr_tensor.asciidoc | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc
index 1df37e9e..05c7ad52 100644
--- a/ext/cl_khr_tensor.asciidoc
+++ b/ext/cl_khr_tensor.asciidoc
@@ -1,20 +1,25 @@
 // Copyright 2023 The Khronos Group. This work is licensed under a
 // Creative Commons Attribution 4.0 International License; see
 // http://creativecommons.org/licenses/by/4.0/
-= cl_khr_tensor
+= cl_exp_tensor
 
 :source-highlighter: coreray
 
-[[cl_khr_tensor]]
+[[cl_exp_tensor]]
 == Tensor Data Type
 
-Purpose of this extension is to provide ...
+This extension provides a new opaque OpenCL datatype called
+`cl_tensor`. It is used for storing N-dimensional tensor data in
+implementation-defined memory layout which may be optimized based on
+tensor's use cases. The datatype is designed to be efficiently used
+within the `cl_khr_command_buffers` extension to capture task graphs
+which can utilize tensors as input, output and temporary storage.
 
 === General information
 
 ==== Name Strings
 
-`cl_khr_tensor`
+`cl_exp_tensor`
 
 ==== Version history
 
@@ -30,8 +35,6 @@ This extension is written against the OpenCL Specification version 3.0.14.
 
 This extension requires OpenCL 1.2 or later.
 
-This extension requires cl_khr_command_buffer.
-
 ==== Contributors
 
 Henry Linjamäki, Intel. +

From b890c30db0c532169d5133bd5599e1af793214ca Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 2 Nov 2023 09:18:38 +0200
Subject: [PATCH 04/26] Add contributors

---
 ext/cl_khr_tensor.asciidoc | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc
index 05c7ad52..5cba054c 100644
--- a/ext/cl_khr_tensor.asciidoc
+++ b/ext/cl_khr_tensor.asciidoc
@@ -38,6 +38,8 @@ This extension requires OpenCL 1.2 or later.
 ==== Contributors
 
 Henry Linjamäki, Intel. +
+Pekka Jääslkeläinen, Intel and Tampere University. +
+Ben Ashbaugh, Intel. +
 
 === Overview
 

From fafb30b0dc9ec4381956009a901cd0a57644c9ce Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 2 Nov 2023 09:19:16 +0200
Subject: [PATCH 05/26] * Fix name for add kernel creator

---
 ext/cl_khr_tensor.asciidoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc
index 5cba054c..0115b054 100644
--- a/ext/cl_khr_tensor.asciidoc
+++ b/ext/cl_khr_tensor.asciidoc
@@ -403,7 +403,7 @@ cl_kernel create_matmul_kernel(
   return matmul_kernel;
 }
 
-cl_kernel create_matmul_kernel(
+cl_kernel create_add_kernel(
   cl_context ctx, std::span<cl_device_id> device_span,
   cl_tensor lhs, cl_tensor rhs, cl_tensor out) {
   // A hypothetical add kernel signature in pseudo OpenCL C for illustrative

From 740f3f22d8d043f5acf409cffc8d207932a81558 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 2 Nov 2023 09:21:51 +0200
Subject: [PATCH 06/26] * cl_tensor_type -> cl_tensor _datatype.

* Fix signed -> unsigned.

* Single line cl{Retain,Release}TensorObject declaration.
---
 ext/cl_khr_tensor.asciidoc | 28 +++++++++++++---------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc
index 0115b054..bed45d97 100644
--- a/ext/cl_khr_tensor.asciidoc
+++ b/ext/cl_khr_tensor.asciidoc
@@ -56,8 +56,8 @@ cl_tensor clCreateTensor(
     cl_context context,
     const cl_tensor_peoperties *properties,
     size_t rank,
-    const size_t* shape,
-    cl_tensor_type dtype,
+    const size_t shape,
+    cl_tensor_datatype dtype,
     cl_int *errcode_ret);
 ----
 
@@ -125,17 +125,17 @@ following error values returned in errcode_ret:
 | CL_TENSOR_INT16      | 16-bit signed integer.
 | CL_TENSOR_INT32      | 32-bit signed integer.
 | CL_TENSOR_INT64      | 64-bit signed integer.
-| CL_TENSOR_UINT8      | 8-bit signed integer.
-| CL_TENSOR_UINT16     | 16-bit signed integer.
-| CL_TENSOR_UINT32     | 32-bit signed integer.
-| CL_TENSOR_UINT64     | 64-bit signed integer.
+| CL_TENSOR_UINT8      | 8-bit unsigned integer.
+| CL_TENSOR_UINT16     | 16-bit unsigned integer.
+| CL_TENSOR_UINT32     | 32-bit unsigned integer.
+| CL_TENSOR_UINT64     | 64-bit unsigned integer.
 | CL_TENSOR_HALF       | Half precision floating-point.
-| CL_TENSOR_BFLOAT16   | 16-bit brain floating-point value.
-| CL_TENSOR_FLOAT      | Single precision floating-point value.
-| CL_TENSOR_DOUBLE     | Double precision floating-point value.
-| CL_TENSOR_COMPLEX64  | 64-bit complex floating point value with
+| CL_TENSOR_BFLOAT16   | 16-bit brain floating-point.
+| CL_TENSOR_FLOAT      | Single precision floating-point.
+| CL_TENSOR_DOUBLE     | Double precision floating-point.
+| CL_TENSOR_COMPLEX64  | 64-bit complex floating point with
   32-bit real and imaginary part.
-| CL_TENSOR_COMPLEX128 | 128-bit complex floating point value with
+| CL_TENSOR_COMPLEX128 | 128-bit complex floating point with
   64-bit real and imaginary part.
 |===
 
@@ -165,8 +165,7 @@ To retain a tensor object, call the function
 
 [source,c]
 ----
-cl_int clRetainTensorObject(
-  cl_tensor tensor);
+cl_int clRetainTensorObject(cl_tensor tensor);
 ----
 
 * _tensor_ is the tensor object to be retained.
@@ -182,8 +181,7 @@ To release a tensor object, call the function
 
 [source,c]
 ----
-cl_int clReleaseTensorObject(
-  cl_tensor tensor);
+cl_int clReleaseTensorObject(cl_tensor tensor);
 ----
 
 * _tensor_ is the tensor object to be released.

From 701daa3dc3f65a8d89f4608d75a7adf6f489f26f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 2 Nov 2023 09:59:18 +0200
Subject: [PATCH 07/26] * clEnqueue(Read,Write)Tensor ->
 clEnqueue(TranslateFrom,TranslateTo)Tensor.

* Clarify in clEnqueue{TranslateFrom,TranslateTo}Tensor that data read
  from / written to the tensor in opaque manner.
---
 ext/cl_khr_tensor.asciidoc | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc
index bed45d97..99e65370 100644
--- a/ext/cl_khr_tensor.asciidoc
+++ b/ext/cl_khr_tensor.asciidoc
@@ -271,7 +271,7 @@ tensor object from host memory / buffer object.
 
 [source,c]
 ----
-cl_int clEnqueueReadTensor(
+cl_int clEnqueueTranslateFromTensor(
   cl_command_queue command_queue,
   cl_tensor tensor,
   cl_bool blocking_command,
@@ -284,7 +284,7 @@ cl_int clEnqueueReadTensor(
 
 [source,c]
 ----
-cl_int clEnqueueWriteTensor(
+cl_int clEnqueueTranslateToTensor(
   cl_command_queue command_queue,
   cl_tensor tensor,
   cl_bool blocking_command,
@@ -331,8 +331,14 @@ cl_int clEnqueueWriteTensor(
   complete. If _event_wait_list_ and _event_ are not NULL, _event_
   must not refer to an element of the _event_wait_list_ array.
 
-For a read and write operation, the elements of N-dimensional tensor are
-related to host memory / buffer object as follows:
+The *clEnqueueTranslateToTensor* function copies contents of the buffer
+object / host allocation to tensor's storage in
+implementation-defined, opaque memory layout. The
+*clEnqueueTranslateFromTensor* function copies data from tensor's
+storage to buffer object / host allocation.
+
+The elements of buffer object / host allocation are mapped to tensor
+coordinates as follows:
 
 ----
 tensor.element(i0, i1, ..., i<N-2>, i<N-1>) == (tensor.dtype)buffer_or_host_ptr[
@@ -343,7 +349,11 @@ tensor.element(i0, i1, ..., i<N-2>, i<N-1>) == (tensor.dtype)buffer_or_host_ptr[
   i<N-1>]
 ----
 
-Where `iX` is a tensor coordinate index with inclusive range of `0..<shape[X]>`.
+Where `iX` is a tensor coordinate index with inclusive range of
+`0..<shape[X]-1>`. The `tensor.element()` represents an abstract
+function that accesses a tensor element in its storage at given
+coordinate. The method how the coordinates translate to tensor storage
+addresses is unspecified.
 
 // TODO: add clEnqueueCopyTensor
 

From aa9ead742ab84740e11c4ff371d51e0ebf94a538 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 2 Nov 2023 13:38:29 +0200
Subject: [PATCH 08/26] Refactor command buffer temporary property out of
 tensor

---
 ext/cl_khr_tensor.asciidoc | 139 +++++++++++++++++++++----------------
 1 file changed, 78 insertions(+), 61 deletions(-)

diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc
index 99e65370..0de088c7 100644
--- a/ext/cl_khr_tensor.asciidoc
+++ b/ext/cl_khr_tensor.asciidoc
@@ -66,7 +66,8 @@ cl_tensor clCreateTensor(
 * _properties_ is an optional list of properties for the tensor object
   and their corresponding values. The list is terminated with the
   special property 0. If no properties are required, properties may be
-  NULL.
+  NULL. This extension does not define any optional properties for
+  tensors.
 
 * _rank_ is the number of dimensions. Zero value creates a "scalar"
   tensor which has no dimensions but has storage for one element.
@@ -84,18 +85,11 @@ cl_tensor clCreateTensor(
 clCreateTensor function creates a `rank`-dimensional tensor with
 `shape[0] * shape[1] * ... * shape[rank-1]` elements of _dtype_
 type. At the creation time of the tensor, it does not have
-storage. The storage is assigned to the tensor either by:
-
-* calling clCreateBufferWithProperties() with CL_MEM_BIND_TO_TENSOR or
-
-* automatically by command buffers - possibly on-demand basis - if the
-  tensor is created with CL_TENSOR_COMMAND_BUFFER_TEMPORARY property
-  set on.
+storage. The storage is assigned to the tensor by calling
+clCreateBufferWithProperties() with CL_MEM_BIND_TO_TENSOR.
 
 A command that refers to a tensor must be bound to a valid buffer
-object before enqueuing the command that refers to the tensor into a command queue unless the
-command is recorded in a command buffer and
-CL_TENSOR_COMMAND_BUFFER_TEMPORARY is set to true.
+object before enqueuing or recording the command.
 
 *clCreateTensor* returns a valid non-zero tensor object and errcode_ret
 is set to CL_SUCCESS if the tensor object is created
@@ -139,28 +133,6 @@ following error values returned in errcode_ret:
   64-bit real and imaginary part.
 |===
 
-.Tensor properties
-[cols="2,1,2",stripes=odd]
-|===
-| *Tensor Property* | *Property Value* | *Description*
-
-| CL_TENSOR_COMMAND_BUFFER_TEMPORARY | cl_bool
-
-a| If the value is true, create a "temporary" tensor that only can be
-used on commands recorded in command buffers. The storage of the
-temporary tensors are managed by command buffers. When a temporary
-tensor is used by multiple command buffers, the tensor receives separate
-storage for each command buffer.
-
-// IOW, Data may not be exchanged between command buffers through
-// temporary tensors.
-
-Temporary tensors may not be bound to buffer objects.
-
-Data stored in temporary tensors are not preserved across command
-buffer executions.
-|===
-
 To retain a tensor object, call the function
 
 [source,c]
@@ -244,12 +216,8 @@ cl_int clGetTensorInfo(
 | CL_TENSOR_SHAPE | size_t[]       | Return the tensor shape.
 | CL_TENSOR_DTYPE | cl_tensor_type | Return the tensor data type.
 
-| CL_TENSOR_COMMAND_BUFFER_TEMPORARY | cl_bool | Return true if the
-tensor is a temporary tensor for command buffers.
-
 | CL_TENSOR_BOUND_TO_BUFFER | cl_bool | Return true if the tensor is
-bound to a buffer. If CL_TENSOR_COMMAND_BUFFER_TEMPORARY is true, then
-CL_TENSOR_BOUND_TO_BUFFER must return false.
+bound to a buffer.
 
 | CL_TENSOR_BUFFER | cl_mem a| If CL_TENSOR_BOUND_TO_BUFFER is true,
 return the buffer object the tensor is bound to. Otherwise,
@@ -366,11 +334,34 @@ addresses is unspecified.
 
 [cols="2,1,2",stripes=odd]
 |===
+| CL_MEM_COMMAND_BUFFER_TEMPORARY | cl_bool
+
+a| This property can be set if *cl_khr_command_buffer* extension is
+supported.
+
+If the value is true, create a "temporary" buffer object that only can
+be used on commands recorded in command buffers. Non-recording
+command enqueue functions must return CL_INVALID_OPERATION if the
+command refers to a temporary buffer object.
+
+The temporary buffer objects are managed by command buffers. When a
+temporary buffer object is used by multiple command buffer, the object
+receives disjoint storage for each command buffer.
+
+// Consequently, Data may not be exchanged between command buffers through
+// temporary buffers.
+
+Storage of the temporary buffer objects may be allocated on-demand
+basis. At the times the buffer is not needed, OpenCL implementations
+may reuse storage for other tasks within the command buffer.
+
+Contents of the temporary buffers are not guaranteed to be preserved
+across command buffer executions.
+
 | CL_MEM_BIND_TO_TENSOR | cl_tensor a| Use the created buffer as
 storage for the given valid tensor. To succeed creating the buffer,
-the target tensor may not have storage already, must not have
-CL_TENSOR_COMMAND_BUFFER_TEMPORARY property set on and _size_ argument
-of the clCreateBufferWithProperties() must be zero.
+the target tensor may not have storage already and _size_
+argument of the clCreateBufferWithProperties() must be zero.
 
 Size of the memory buffer is implementation-defined and it can be
 queried with clGetTensorInfo().
@@ -387,6 +378,26 @@ clCreateBufferWithProperties call returns CL_TENSOR_BOUND_TO_BUFFER
 error code.
 |===
 
+==== Add New Memory Object Query in Section 5.5.5
+
+[cols="2,1,2",stripes=odd]
+|===
+| CL_MEM_COMMAND_BUFFER_TEMPORARY | cl_bool | This property can be
+queried if *cl_khr_command_buffer* extension is supported.
+
+Return true if the _memobj_ is temporary buffer object for command
+buffers.
+|===
+
+==== Add New Error Codes in Appendix F
+
+[cols="2,3", stripes=odd]
+|===
+| CL_TENSOR_BOUND_TO_BUFFER | Returned when attempting to bind a
+  buffer object to a tensor which already has been bound to the same
+  or another.
+|===
+
 === Sample Codes
 
 Helper functions used in the follow up tensor code samples:
@@ -495,30 +506,36 @@ extension is supported:
 constexpr size_t b = 64, m = 100, n = 200, k = 50;
 
 cl_int err;
-// Create tensors which are used as temporaries in a command buffer.
-// Command buffers allocate space for them as needed.
-//
-// NOTE: same temporary tensor handle used in multiple command buffers
-//       will have separate storage. IOW, command buffers may not exchange
-//       data via temporary buffers between them.
-cl_tensor in0 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
-  3, {b, m, k}, CL_TENSOR_FLOAT, err);
-cl_tensor in1 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
-  3, {b, k, n}, CL_TENSOR_FLOAT, err);
-cl_tensor in2 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
-  3, {b, m, n}, CL_TENSOR_FLOAT, err);
-cl_tensor t0  = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
-  3, {b, m, n}, CL_TENSOR_FLOAT, err);
-cl_tensor out = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
-  3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
+cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
+cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor t0  = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
 
 cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
 cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
 
-// Binding a buffer to temporary tensor is not allowed.
-auto ignored = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE, 0, nullptr, &err);
-assert(err == CL_TENSOR_IS_TEMPORARY);
+// Bind command buffer managed storage to tensors.
+//
+// NOTE: same temporary tensor handle used in multiple command buffers
+//       will have separate storage. IOW, command buffers may not exchange
+//       data via temporary buffers between them.
+cl_mem in0_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in0, 0},
+  CL_MEM_READ_ONLY, 0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */,
+  nullptr, &err);
+cl_mem in1_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in1, 0},
+  CL_MEM_READ_ONLY, 0, nullptr, &err);
+cl_mem in2_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in2, 0},
+  CL_MEM_READ_ONLY, 0, nullptr, &err);
+cl_mem t0_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, t0, 0},
+  CL_MEM_READ_WRITE, 0, nullptr, &err);
+cl_mem out_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, out, 0},
+  CL_MEM_WRITE_ONLY, 0, nullptr, &err);
 
 std::vector<float> in0_data = ...;
 std::vector<float> in1_data = ...;

From 88a0a84709923d042b4b7dcbf19a36576e0b421f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 2 Nov 2023 13:41:39 +0200
Subject: [PATCH 09/26] Fix cl_tensor_type -> cl_tensor_datatype

---
 ext/cl_khr_tensor.asciidoc | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc
index 0de088c7..22a6cd00 100644
--- a/ext/cl_khr_tensor.asciidoc
+++ b/ext/cl_khr_tensor.asciidoc
@@ -56,7 +56,7 @@ cl_tensor clCreateTensor(
     cl_context context,
     const cl_tensor_peoperties *properties,
     size_t rank,
-    const size_t shape,
+    const size_t* shape,
     cl_tensor_datatype dtype,
     cl_int *errcode_ret);
 ----
@@ -212,9 +212,9 @@ cl_int clGetTensorInfo(
 .List of supported param_names by clGetTensorInfo
 [cols="2,1,2",stripes=odd]
 |===
-| CL_TENSOR_RANK  | size_t         | Return the tensor rank.
-| CL_TENSOR_SHAPE | size_t[]       | Return the tensor shape.
-| CL_TENSOR_DTYPE | cl_tensor_type | Return the tensor data type.
+| CL_TENSOR_RANK  | size_t             | Return the tensor rank.
+| CL_TENSOR_SHAPE | size_t[]           | Return the tensor shape.
+| CL_TENSOR_DTYPE | cl_tensor_datatype | Return the tensor data type.
 
 | CL_TENSOR_BOUND_TO_BUFFER | cl_bool | Return true if the tensor is
 bound to a buffer.

From 37fe00630d7932a5b871eeb56f1288ba207dc583 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 2 Nov 2023 13:42:16 +0200
Subject: [PATCH 10/26] Add an open question

---
 ext/cl_khr_tensor.asciidoc | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc
index 22a6cd00..e91f81df 100644
--- a/ext/cl_khr_tensor.asciidoc
+++ b/ext/cl_khr_tensor.asciidoc
@@ -575,3 +575,10 @@ assert(clEnqueueReadTensor(..., t0, ...) == CL_INVALID_OPERATION);
 ----
 
 === Open Questions ===
+
+. Should we have support for tensors with undefined shape and tensors
+  with unknown / symbolic dimension sizes like in ONNX?
+
+// https://onnx.ai/onnx/repo-docs/ShapeInference.html
+
+*UNRESOLVED*

From 0a43252c1cbc5f5fec745fd83051a38cff822908 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 2 Nov 2023 13:46:58 +0200
Subject: [PATCH 11/26] Add CL_INVALID_TENSOR error code

---
 ext/cl_khr_tensor.asciidoc | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc
index e91f81df..1b2a9686 100644
--- a/ext/cl_khr_tensor.asciidoc
+++ b/ext/cl_khr_tensor.asciidoc
@@ -396,6 +396,8 @@ buffers.
 | CL_TENSOR_BOUND_TO_BUFFER | Returned when attempting to bind a
   buffer object to a tensor which already has been bound to the same
   or another.
+| CL_INVALID_TENSOR | Returned then the specified tensor is not a
+  valid tensor object.
 |===
 
 === Sample Codes

From d10d149267045549ffb1e16286d04445e3cadc68 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 2 Nov 2023 13:59:25 +0200
Subject: [PATCH 12/26] Require either buffer or host_ptr to be non-NULL

---
 ext/cl_khr_tensor.asciidoc | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_khr_tensor.asciidoc
index 1b2a9686..f1437dd3 100644
--- a/ext/cl_khr_tensor.asciidoc
+++ b/ext/cl_khr_tensor.asciidoc
@@ -272,12 +272,10 @@ cl_int clEnqueueTranslateToTensor(
 * _blocking_command_ indicate if the read and write operations are
   blocking or non-blocking (see below).
 
-* _buffer_ refers to a valid buffer object where data is to be
-  read into or to be written from when the value of _host_ptr_ is
-  NULL. If _host_ptr_ is non-NULL then value of _buffer_ is ignored.
-
-* _host_ptr_ is the pointer to buffer in host memory where data is to
-  be read into or to be written from when the value is non-NULL.
+* _buffer_ and _host_ptr_ refer to a valid buffer object / host
+  allocation where data is to be read into or to be written from.
+  Either the _buffer_ or _host_ptr_ can be non-NULL in which case the
+  non-NULL argument is used as the operand for the operation.
 
 * _event_wait_list_ and _num_events_in_wait_list_ specify events that
   need to complete before this particular command can be executed. If

From f40eedaa57a2252d4e00ca66cf3a49969b700fb3 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 2 Nov 2023 14:27:14 +0200
Subject: [PATCH 13/26] Regenerate html for cl_exp_tensor

---
 ext/cl_khr_tensor.html | 298 +++++++++++++++++++++++------------------
 1 file changed, 168 insertions(+), 130 deletions(-)

diff --git a/ext/cl_khr_tensor.html b/ext/cl_khr_tensor.html
index 87892548..c232ddea 100644
--- a/ext/cl_khr_tensor.html
+++ b/ext/cl_khr_tensor.html
@@ -5,7 +5,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
 <meta name="generator" content="Asciidoctor 2.0.16">
-<title>cl_khr_tensor</title>
+<title>cl_exp_tensor</title>
 <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Open+Sans:300,300italic,400,400italic,600,600italic%7CNoto+Serif:400,400italic,700,700italic%7CDroid+Sans+Mono:400,700">
 <style>
 /*! Asciidoctor default stylesheet | MIT License | https://asciidoctor.org */
@@ -439,21 +439,26 @@
 </head>
 <body class="article">
 <div id="header">
-<h1>cl_khr_tensor</h1>
+<h1>cl_exp_tensor</h1>
 </div>
 <div id="content">
 <div class="sect1">
-<h2 id="cl_khr_tensor">Tensor Data Type</h2>
+<h2 id="cl_exp_tensor">Tensor Data Type</h2>
 <div class="sectionbody">
 <div class="paragraph">
-<p>Purpose of this extension is to provide &#8230;&#8203;</p>
+<p>This extension provides a new opaque OpenCL datatype called
+<code>cl_tensor</code>. It is used for storing N-dimensional tensor data in
+implementation-defined memory layout which may be optimized based on
+tensor&#8217;s use cases. The datatype is designed to be efficiently used
+within the <code>cl_khr_command_buffers</code> extension to capture task graphs
+which can utilize tensors as input, output and temporary storage.</p>
 </div>
 <div class="sect2">
 <h3 id="_general_information">General information</h3>
 <div class="sect3">
 <h4 id="_name_strings">Name Strings</h4>
 <div class="paragraph">
-<p><code>cl_khr_tensor</code></p>
+<p><code>cl_exp_tensor</code></p>
 </div>
 </div>
 <div class="sect3">
@@ -488,14 +493,13 @@ <h4 id="_dependencies">Dependencies</h4>
 <div class="paragraph">
 <p>This extension requires OpenCL 1.2 or later.</p>
 </div>
-<div class="paragraph">
-<p>This extension requires cl_khr_command_buffer.</p>
-</div>
 </div>
 <div class="sect3">
 <h4 id="_contributors">Contributors</h4>
 <div class="paragraph">
-<p>Henry Linjamäki, Intel.<br></p>
+<p>Henry Linjamäki, Intel.<br>
+Pekka Jääslkeläinen, Intel and Tampere University.<br>
+Ben Ashbaugh, Intel.<br></p>
 </div>
 </div>
 </div>
@@ -516,8 +520,8 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
     cl_context context,
     const cl_tensor_peoperties *properties,
     size_t rank,
-    size_t shape,
-    cl_tensor_type dtype,
+    const size_t* shape,
+    cl_tensor_datatype dtype,
     cl_int *errcode_ret);</code></pre>
 </div>
 </div>
@@ -530,7 +534,8 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 <p><em>properties</em> is an optional list of properties for the tensor object
 and their corresponding values. The list is terminated with the
 special property 0. If no properties are required, properties may be
-NULL.</p>
+NULL. This extension does not define any optional properties for
+tensors.</p>
 </li>
 <li>
 <p><em>rank</em> is the number of dimensions. Zero value creates a "scalar"
@@ -555,25 +560,12 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 <p>clCreateTensor function creates a <code>rank</code>-dimensional tensor with
 <code>shape[0] * shape[1] * &#8230;&#8203; * shape[rank-1]</code> elements of <em>dtype</em>
 type. At the creation time of the tensor, it does not have
-storage. The storage is assigned to the tensor either by:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p>calling clCreateBufferWithProperties() with CL_MEM_BIND_TO_TENSOR or</p>
-</li>
-<li>
-<p>automatically by command buffers - possibly on-demand basis - if the
-tensor is created with CL_TENSOR_COMMAND_BUFFER_TEMPORARY property
-set on.</p>
-</li>
-</ul>
+storage. The storage is assigned to the tensor by calling
+clCreateBufferWithProperties() with CL_MEM_BIND_TO_TENSOR.</p>
 </div>
 <div class="paragraph">
 <p>A command that refers to a tensor must be bound to a valid buffer
-object before enqueuing the command into a command queue unless the
-command is recorded in a command buffer and
-CL_TENSOR_COMMAND_BUFFER_TEMPORARY is set to true.</p>
+object before enqueuing or recording the command.</p>
 </div>
 <div class="paragraph">
 <p><strong>clCreateTensor</strong> returns a valid non-zero tensor object and errcode_ret
@@ -636,90 +628,54 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_UINT8</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">8-bit signed integer.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">8-bit unsigned integer.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_UINT16</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">16-bit signed integer.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">16-bit unsigned integer.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_UINT32</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">32-bit signed integer.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">32-bit unsigned integer.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_UINT64</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">64-bit signed integer.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">64-bit unsigned integer.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_HALF</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">Half precision floating-point value.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Half precision floating-point.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_BFLOAT16</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">16-bit brain floating-point value.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">16-bit brain floating-point.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_FLOAT</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">Single precision floating-point value.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Single precision floating-point.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DOUBLE</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">Double precision floating-point value.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Double precision floating-point.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_COMPLEX64</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">64-bit complex floating point value with
+<td class="tableblock halign-left valign-top"><p class="tableblock">64-bit complex floating point with
   32-bit real and imaginary part.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_COMPLEX128</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">128-bit complex floating point value with
+<td class="tableblock halign-left valign-top"><p class="tableblock">128-bit complex floating point with
   64-bit real and imaginary part.</p></td>
 </tr>
 </tbody>
 </table>
-<table class="tableblock frame-all grid-all stripes-odd stretch">
-<caption class="title">Table 2. Tensor properties</caption>
-<colgroup>
-<col style="width: 40%;">
-<col style="width: 20%;">
-<col style="width: 40%;">
-</colgroup>
-<thead>
-<tr>
-<th class="tableblock halign-left valign-top"><strong>Tensor Property</strong></th>
-<th class="tableblock halign-left valign-top"><strong>Property Value</strong></th>
-<th class="tableblock halign-left valign-top"><strong>Description</strong></th>
-</tr>
-</thead>
-<tbody>
-<tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_COMMAND_BUFFER_TEMPORARY</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">cl_bool</p></td>
-<td class="tableblock halign-left valign-top"><div class="content"><div class="paragraph">
-<p>If the value is true, create a "temporary" tensor that only can be
-used on commands recorded in command buffers. The storage of the
-temporary tensors are managed by command buffers. When a temporary
-tensor is used by multiple command buffer, the tensor receive separate
-storage for each command buffer.</p>
-</div>
-<div class="paragraph">
-<p>Temporary tensors may not be bound to buffer objects.</p>
-</div>
-<div class="paragraph">
-<p>Data stored in temporary tensors are not preserved across command
-buffer executions.</p>
-</div></div></td>
-</tr>
-</tbody>
-</table>
 <div class="paragraph">
 <p>To retain a tensor object, call the function</p>
 </div>
 <div class="listingblock">
 <div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_int clRetainTensorObject(
-  cl_tensor tensor);</code></pre>
+<pre class="highlight"><code class="language-c" data-lang="c">cl_int clRetainTensorObject(cl_tensor tensor);</code></pre>
 </div>
 </div>
 <div class="ulist">
@@ -739,7 +695,7 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 <div class="ulist">
 <ul>
 <li>
-<p>CL_INVALID_TENSOR if tensor is not a valid tensor object.</p>
+<p>CL_INVALID_TENSOR if the tensor is not a valid tensor object.</p>
 </li>
 </ul>
 </div>
@@ -748,8 +704,7 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 </div>
 <div class="listingblock">
 <div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_int clReleaseTensorObject(
-  cl_tensor tensor);</code></pre>
+<pre class="highlight"><code class="language-c" data-lang="c">cl_int clReleaseTensorObject(cl_tensor tensor);</code></pre>
 </div>
 </div>
 <div class="ulist">
@@ -833,7 +788,7 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 </ul>
 </div>
 <table class="tableblock frame-all grid-all stripes-odd stretch">
-<caption class="title">Table 3. List of supported param_names by clGetTensorInfo</caption>
+<caption class="title">Table 2. List of supported param_names by clGetTensorInfo</caption>
 <colgroup>
 <col style="width: 40%;">
 <col style="width: 20%;">
@@ -852,21 +807,14 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">cl_tensor_type</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_tensor_datatype</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">Return the tensor data type.</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_COMMAND_BUFFER_TEMPORARY</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">cl_bool</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">Return true if the
-tensor is temporary tensor for command buffers.</p></td>
-</tr>
-<tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_BOUND_TO_BUFFER</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_bool</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">Return true if the tensor is
-bound to a buffer. If CL_TENSOR_COMMAND_BUFFER_TEMPORARY is true, then
-CL_TENSOR_BOUND_TO_BUFFER must return false.</p></td>
+bound to a buffer.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_BUFFER</p></td>
@@ -902,12 +850,12 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 </tbody>
 </table>
 <div class="paragraph">
-<p>To read from a tensor to host memory / buffer object or to write to a
-tensor object from host memory / buffer object call one of the functions.</p>
+<p>The following functions are for reading from a tensor to host memory / buffer object or to write to a
+tensor object from host memory / buffer object.</p>
 </div>
 <div class="listingblock">
 <div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_int clEnqueueReadTensor(
+<pre class="highlight"><code class="language-c" data-lang="c">cl_int clEnqueueTranslateFromTensor(
   cl_command_queue command_queue,
   cl_tensor tensor,
   cl_bool blocking_command,
@@ -920,12 +868,12 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 </div>
 <div class="listingblock">
 <div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_int clEnqueueWriteTensor(
+<pre class="highlight"><code class="language-c" data-lang="c">cl_int clEnqueueTranslateToTensor(
   cl_command_queue command_queue,
   cl_tensor tensor,
   cl_bool blocking_command,
   cl_mem buffer,
-  void* host_ptr,
+  const void* host_ptr,
   cl_uint num_events_in_wait_list,
   const cl_event* event_wait_list,
   cl_event* event);</code></pre>
@@ -946,13 +894,10 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 blocking or non-blocking (see below).</p>
 </li>
 <li>
-<p><em>buffer</em> refers to a valid buffer object where data is to be
-read into or to be written from when the value of <em>host_ptr</em> is
-NULL. If <em>host_ptr</em> is non-NULL then value of <em>buffer</em> is ignored.</p>
-</li>
-<li>
-<p><em>host_ptr</em> is the pointer to buffer in host memory where data is to
-be read into or to be written from when the value is non-NULL.</p>
+<p><em>buffer</em> and <em>host_ptr</em> refer to a valid buffer object / host
+allocation where data is to be read into or to be written from.
+Either the <em>buffer</em> or <em>host_ptr</em> can be non-NULL in which case the
+non-NULL argument is used as the operand for the operation.</p>
 </li>
 <li>
 <p><em>event_wait_list</em> and <em>num_events_in_wait_list</em> specify events that
@@ -979,12 +924,19 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 </ul>
 </div>
 <div class="paragraph">
-<p>For a read and write operation, the elements of N-dimensional tensor are
-related to host memory / buffer object as followed:</p>
+<p>The <strong>clEnqueueTranslateToTensor</strong> function copies contents of the buffer
+object / host allocation to tensor&#8217;s storage in
+implementation-defined, opaque memory layout. The
+<strong>clEnqueueTranslateFromTensor</strong> function copies data from tensor&#8217;s
+storage to buffer object / host allocation.</p>
+</div>
+<div class="paragraph">
+<p>The elements of buffer object / host allocation are mapped to tensor
+coordinates as follows:</p>
 </div>
 <div class="listingblock">
 <div class="content">
-<pre>tensor.element(i0, i1, ..., i&lt;N-2&gt;, i&lt;N-1&gt;)) == (tensor.dtype)buffer_or_host_ptr[
+<pre>tensor.element(i0, i1, ..., i&lt;N-2&gt;, i&lt;N-1&gt;) == (tensor.dtype)buffer_or_host_ptr[
   i0 * tensor.shape[1] * tensor.shape[2] * ... * tensor.shape[N-1] +
   i1 * tensor.shape[2] * tensor.shape[3] * ... * tensor.shape[N-1] +
   ... +
@@ -993,7 +945,11 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 </div>
 </div>
 <div class="paragraph">
-<p>Where <code>iX</code> is a tensor coordinate index with inclusive range of <code>0..&lt;shape[X]&gt;</code>.</p>
+<p>Where <code>iX</code> is a tensor coordinate index with inclusive range of
+<code>0..&lt;shape[X]-1&gt;</code>. The <code>tensor.element()</code> represents an abstract
+function that accesses a tensor element in its storage at given
+coordinate. The method how the coordinates translate to tensor storage
+addresses is unspecified.</p>
 </div>
 </div>
 <div class="sect3">
@@ -1004,6 +960,31 @@ <h4 id="_add_new_buffer_property_in_section_5_2_1">Add New Buffer Property in Se
 <col style="width: 20%;">
 <col style="width: 40%;">
 </colgroup>
+<thead>
+<tr>
+<th class="tableblock halign-left valign-top">CL_MEM_COMMAND_BUFFER_TEMPORARY</th>
+<th class="tableblock halign-left valign-top">cl_bool</th>
+<th class="tableblock halign-left valign-top">This property can be set if <strong>cl_khr_command_buffer</strong> extension is
+supported.
+
+If the value is true, create a "temporary" buffer object that only can
+be used on commands recorded in command buffers. Non-recording
+command enqueue functions must return CL_INVALID_OPERATION if the
+command refers to a temporary buffer object.
+
+The temporary buffer objects are managed by command buffers. When a
+temporary buffer object is used by multiple command buffer, the object
+receives disjoint storage for each command buffer.
+
+
+Storage of the temporary buffer objects may be allocated on-demand
+basis. At the times the buffer is not needed, OpenCL implementations
+may reuse storage for other tasks within the command buffer.
+
+Contents of the temporary buffers are not guaranteed to be preserved
+across command buffer executions.</th>
+</tr>
+</thead>
 <tbody>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_MEM_BIND_TO_TENSOR</p></td>
@@ -1011,9 +992,8 @@ <h4 id="_add_new_buffer_property_in_section_5_2_1">Add New Buffer Property in Se
 <td class="tableblock halign-left valign-top"><div class="content"><div class="paragraph">
 <p>Use the created buffer as
 storage for the given valid tensor. To succeed creating the buffer,
-the target tensor may not have storage already, must not have
-CL_TENSOR_COMMAND_BUFFER_TEMPORARY property set on and <em>size</em> argument
-of the clCreateBufferWithProperties() must be zero.</p>
+the target tensor may not have storage already and <em>size</em>
+argument of the clCreateBufferWithProperties() must be zero.</p>
 </div>
 <div class="paragraph">
 <p>Size of the memory buffer is implementation-defined and it can be
@@ -1036,6 +1016,48 @@ <h4 id="_add_new_buffer_property_in_section_5_2_1">Add New Buffer Property in Se
 </tbody>
 </table>
 </div>
+<div class="sect3">
+<h4 id="_add_new_memory_object_query_in_section_5_5_5">Add New Memory Object Query in Section 5.5.5</h4>
+<table class="tableblock frame-all grid-all stripes-odd stretch">
+<colgroup>
+<col style="width: 40%;">
+<col style="width: 20%;">
+<col style="width: 40%;">
+</colgroup>
+<tbody>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_MEM_COMMAND_BUFFER_TEMPORARY</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_bool</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">This property can be
+queried if <strong>cl_khr_command_buffer</strong> extension is supported.</p>
+<p class="tableblock">Return true if the <em>memobj</em> is temporary buffer object for command
+buffers.</p></td>
+</tr>
+</tbody>
+</table>
+</div>
+<div class="sect3">
+<h4 id="_add_new_error_codes_in_appendix_f">Add New Error Codes in Appendix F</h4>
+<table class="tableblock frame-all grid-all stripes-odd stretch">
+<colgroup>
+<col style="width: 40%;">
+<col style="width: 60%;">
+</colgroup>
+<tbody>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_BOUND_TO_BUFFER</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Returned when attempting to bind a
+  buffer object to a tensor which already has been bound to the same
+  or another.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_INVALID_TENSOR</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Returned then the specified tensor is not a
+  valid tensor object.</p></td>
+</tr>
+</tbody>
+</table>
+</div>
 </div>
 <div class="sect2">
 <h3 id="_sample_codes">Sample Codes</h3>
@@ -1062,7 +1084,7 @@ <h3 id="_sample_codes">Sample Codes</h3>
   return matmul_kernel;
 }
 
-cl_kernel create_matmul_kernel(
+cl_kernel create_add_kernel(
   cl_context ctx, std::span&lt;cl_device_id&gt; device_span,
   cl_tensor lhs, cl_tensor rhs, cl_tensor out) {
   // A hypothetical add kernel signature in pseudo OpenCL C for illustrative
@@ -1149,30 +1171,36 @@ <h3 id="_sample_codes">Sample Codes</h3>
 <pre class="highlight"><code class="language-c" data-lang="c">constexpr size_t b = 64, m = 100, n = 200, k = 50;
 
 cl_int err;
-// Create tensors which are used as temporaries in a command buffer.
-// Command buffers allocate space for them as needed.
-//
-// NOTE: same temporary tensor handle used in multiple command buffers
-//       will have separate storage. IOW, command buffers may not exchange
-//       data via temporary buffers between them.
-cl_tensor in0 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
-  3, {b, m, k}, CL_TENSOR_FLOAT, err);
-cl_tensor in1 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
-  3, {b, k, n}, CL_TENSOR_FLOAT, err);
-cl_tensor in2 = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
-  3, {b, m, n}, CL_TENSOR_FLOAT, err);
-cl_tensor t0  = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
-  3, {b, m, n}, CL_TENSOR_FLOAT, err);
-cl_tensor out = clCreateTensor(ctx, {CL_TENSOR_COMMAND_BUFFER_TEMPORARY, true, 0},
-  3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
+cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
+cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor t0  = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
 
 cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
 cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
 
-// Binding a buffer to temporary tensor is not allowed.
-auto ignored = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE, 0, nullptr, &amp;err);
-assert(err == CL_TENSOR_IS_TEMPORARY)
+// Bind command buffer managed storage to tensors.
+//
+// NOTE: same temporary tensor handle used in multiple command buffers
+//       will have separate storage. IOW, command buffers may not exchange
+//       data via temporary buffers between them.
+cl_mem in0_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in0, 0},
+  CL_MEM_READ_ONLY, 0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */,
+  nullptr, &amp;err);
+cl_mem in1_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in1, 0},
+  CL_MEM_READ_ONLY, 0, nullptr, &amp;err);
+cl_mem in2_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in2, 0},
+  CL_MEM_READ_ONLY, 0, nullptr, &amp;err);
+cl_mem t0_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, t0, 0},
+  CL_MEM_READ_WRITE, 0, nullptr, &amp;err);
+cl_mem out_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, out, 0},
+  CL_MEM_WRITE_ONLY, 0, nullptr, &amp;err);
 
 std::vector&lt;float&gt; in0_data = ...;
 std::vector&lt;float&gt; in1_data = ...;
@@ -1214,14 +1242,24 @@ <h3 id="_sample_codes">Sample Codes</h3>
 </div>
 <div class="sect2">
 <h3 id="_open_questions">Open Questions</h3>
-
+<div class="olist arabic">
+<ol class="arabic">
+<li>
+<p>Should we have support for tensors with undefined shape and tensors
+with unknown / symbolic dimension sizes like in ONNX?</p>
+</li>
+</ol>
+</div>
+<div class="paragraph">
+<p><strong>UNRESOLVED</strong></p>
+</div>
 </div>
 </div>
 </div>
 </div>
 <div id="footer">
 <div id="footer-text">
-Last updated 2023-10-30 16:51:10 +0200
+Last updated 2023-11-02 14:25:56 +0200
 </div>
 </div>
 </body>

From ef9003093f1031d0075d7212c3232ef88a74fcb0 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 9 Nov 2023 09:39:18 +0200
Subject: [PATCH 14/26] cl_khr_tensor.* -> cl_exp_tensor.*

---
 ext/{cl_khr_tensor.asciidoc => cl_exp_tensor.asciidoc} | 0
 ext/{cl_khr_tensor.html => cl_exp_tensor.html}         | 0
 2 files changed, 0 insertions(+), 0 deletions(-)
 rename ext/{cl_khr_tensor.asciidoc => cl_exp_tensor.asciidoc} (100%)
 rename ext/{cl_khr_tensor.html => cl_exp_tensor.html} (100%)

diff --git a/ext/cl_khr_tensor.asciidoc b/ext/cl_exp_tensor.asciidoc
similarity index 100%
rename from ext/cl_khr_tensor.asciidoc
rename to ext/cl_exp_tensor.asciidoc
diff --git a/ext/cl_khr_tensor.html b/ext/cl_exp_tensor.html
similarity index 100%
rename from ext/cl_khr_tensor.html
rename to ext/cl_exp_tensor.html

From ac8499f2330f76a30b024ab1193b810235f21829 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Wed, 15 Nov 2023 11:20:49 +0200
Subject: [PATCH 15/26] * Add overview

* new section for tensor data type

* add origin, region and pitch parameters for clEnqueueTranslate*Tensor.

* Update code samples.

* Add take on accessing tensors in OpenCL C.
---
 ext/cl_exp_tensor.asciidoc | 268 ++++++++++++++++++++---------
 ext/cl_exp_tensor.html     | 344 +++++++++++++++++++++++++++----------
 2 files changed, 444 insertions(+), 168 deletions(-)

diff --git a/ext/cl_exp_tensor.asciidoc b/ext/cl_exp_tensor.asciidoc
index f1437dd3..b7fd8429 100644
--- a/ext/cl_exp_tensor.asciidoc
+++ b/ext/cl_exp_tensor.asciidoc
@@ -26,7 +26,7 @@ which can utilize tensors as input, output and temporary storage.
 [cols="1,1,3",options="header",]
 |====
 | *Date*     | *Version* | *Description*
-| 2023-10-XX | 0.1.0     | First assigned version.
+| 2023-11-XX | 0.1.0     | First assigned version.
 |====
 
 ==== Dependencies
@@ -43,10 +43,57 @@ Ben Ashbaugh, Intel. +
 
 === Overview
 
+The new tensor object enables applications to describe N-dimensional
+arrays whose memory layout is abstract to applications. The goal and
+intent of this extension is to give leverage for:
+
+* implementations to have freedom of placement data of the tensors for
+  improving performance of the kernels which use them. This extension
+  should be designed so it allows implementations to determine optimal
+  memory layouts for the tensors based on their use cases for
+  increasing performance - for example, by analyzing kernels’ access
+  patterns - or, in case of built-in kernels, by inspecting tensor
+  arguments they operate on.
+
+* reduce details and boilerplate needed for porting performant
+  applications by being less dependent on platform or device specifics
+  on the memory layout / data arrangements which matters for
+  performance. Such specifics may include:
+
+** alignment of data (e.g. for avoiding misaligned memory accesses)
+
+** arrangement of data required by kernels (column-major vs row-major
+   for matrix multiplication, NHWC vs NCHW for neural network
+   convolution)
+
+** arrangement of the data into tiles (or “packing”) for improving
+   cache and TLB hits
+
+** arrangement of data into specific tiles in order to exploit complex
+   HW operations such as matrix multiplications (Intel AMX, AMD matrix
+   cores).
+
+** arrangement of data into rows separated by a stride in order to
+   avoid back conflicts in GPUs.
+
+The tensor data type is deemed to be effective with command buffers
+and built-in kernels - including kernels to be provided by defined
+built-in kernel (cl_khr_defined_builtin_kernels) extension under work.
 
 === Modifications to OpenCL
 
-==== New OpenCL Functions
+==== New Section: 5.x Tensor Objects
+
+A tensor object stores a N-dimensional array of elements. The memory
+layout of the tensor is opaque to the application. When a tensor
+object is created it initially does not have storage where the
+elements of the tensor are stored into. A storage is bind to a tensor
+by creating a memory buffer with CL_MEM_BIND_TO_BUFFER. Tensor objects
+without storage can be set as kernel arguments for kernels which
+accepts them. Kernels which have tensor arguments must have a storage
+assigned to them prior enqueuing the kernels for execution.
+
+==== New OpenCL Functions added to Tensor Objects section
 
 To create a tensor use:
 
@@ -108,29 +155,32 @@ following error values returned in errcode_ret:
 * CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
   required by the OpenCL implementation on the host.
 
-.Tensor element types
-[cols="1,2",stripes=odd]
+.Tensor element types. The API type indicates the corresponding type for copying elements from an host allocation / buffer object to tensor or vice versa.
+[cols="1,1,1",stripes=even]
 [#TensorDtypes]
 |===
-| *Tensor element data type* | *Description*
-
-| CL_TENSOR_BOOL       | 1-bit signedless integer.
-| CL_TENSOR_INT8       | 8-bit signed integer.
-| CL_TENSOR_INT16      | 16-bit signed integer.
-| CL_TENSOR_INT32      | 32-bit signed integer.
-| CL_TENSOR_INT64      | 64-bit signed integer.
-| CL_TENSOR_UINT8      | 8-bit unsigned integer.
-| CL_TENSOR_UINT16     | 16-bit unsigned integer.
-| CL_TENSOR_UINT32     | 32-bit unsigned integer.
-| CL_TENSOR_UINT64     | 64-bit unsigned integer.
-| CL_TENSOR_HALF       | Half precision floating-point.
-| CL_TENSOR_BFLOAT16   | 16-bit brain floating-point.
-| CL_TENSOR_FLOAT      | Single precision floating-point.
-| CL_TENSOR_DOUBLE     | Double precision floating-point.
-| CL_TENSOR_COMPLEX64  | 64-bit complex floating point with
-  32-bit real and imaginary part.
-| CL_TENSOR_COMPLEX128 | 128-bit complex floating point with
-  64-bit real and imaginary part.
+| *Tensor element data type* | *Description* | *API type*
+
+| CL_TENSOR_BOOL | 1-bit signedless integer.  |
+cl_uchar. footnote:[only LSB bit is considered when writing data to
+tensor. When reading data from tensor the boolean value will be
+written as 0 or 1. The boolean values in the tensor may be packed densenly]
+| CL_TENSOR_INT8       | 8-bit signed integer.            | cl_char.
+| CL_TENSOR_INT16      | 16-bit signed integer.           | cl_short.
+| CL_TENSOR_INT32      | 32-bit signed integer.           | cl_int.
+| CL_TENSOR_INT64      | 64-bit signed integer.           | cl_long.
+| CL_TENSOR_UINT8      | 8-bit unsigned integer.          | cl_uchar.
+| CL_TENSOR_UINT16     | 16-bit unsigned integer.         | cl_ushort.
+| CL_TENSOR_UINT32     | 32-bit unsigned integer.         | cl_uint.
+| CL_TENSOR_UINT64     | 64-bit unsigned integer.         | cl_ulong.
+| CL_TENSOR_HALF       | Half precision floating-point.   | cl_half.
+| CL_TENSOR_BFLOAT16   | 16-bit brain floating-point.     | cl_ushort
+| CL_TENSOR_FLOAT      | Single precision floating-point. | cl_float.
+| CL_TENSOR_DOUBLE     | Double precision floating-point. | cl_double.
+| CL_TENSOR_COMPLEX64  | 64-bit complex floating-point with
+  32-bit real and imaginary part. | cl_float2
+| CL_TENSOR_COMPLEX128 | 128-bit complex floating-point with
+  64-bit real and imaginary part. | cl_double2
 |===
 
 To retain a tensor object, call the function
@@ -234,8 +284,9 @@ clGetTensorInfo call returns:
 count.
 |===
 
-The following functions are for reading from a tensor to host memory / buffer object or to write to a
-tensor object from host memory / buffer object.
+The following functions are for reading from a tensor to host memory /
+buffer object or to write to a tensor object from host memory / buffer
+object.
 
 [source,c]
 ----
@@ -243,6 +294,10 @@ cl_int clEnqueueTranslateFromTensor(
   cl_command_queue command_queue,
   cl_tensor tensor,
   cl_bool blocking_command,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
   cl_mem buffer,
   void* host_ptr,
   cl_uint num_events_in_wait_list,
@@ -256,6 +311,10 @@ cl_int clEnqueueTranslateToTensor(
   cl_command_queue command_queue,
   cl_tensor tensor,
   cl_bool blocking_command,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
   cl_mem buffer,
   const void* host_ptr,
   cl_uint num_events_in_wait_list,
@@ -272,6 +331,24 @@ cl_int clEnqueueTranslateToTensor(
 * _blocking_command_ indicate if the read and write operations are
   blocking or non-blocking (see below).
 
+* _tensor_origin_ defines the offset coordinates in _tensor_ for start of
+  the regions to read / write tensor data. The length of the array
+  must be at least rank the the _tensor_.
+
+* _mem_origin_ defines the offset coordinates in the memory region
+  pointed by _buffer_ or _host_ptr_ expressed in elements of _tensor_
+  data type. The length of the array must be at least rank the the
+  _tensor_.
+
+* _region_ defines the region being read or written expressed in in
+  elements of _tensor_ data type. The length of the array must be at
+  least rank the the _tensor_. If _region_ is NULL then _tensor_'s
+  shape will be used as the region.
+
+* _mem_pitch_ defines the length of each dimension in elements to be
+  used for the memory region of _buffer_ or _host_ptr_. The length of
+  the array must be at least the rank of _tensor_ minus one.
+
 * _buffer_ and _host_ptr_ refer to a valid buffer object / host
   allocation where data is to be read into or to be written from.
   Either the _buffer_ or _host_ptr_ can be non-NULL in which case the
@@ -304,39 +381,60 @@ implementation-defined, opaque memory layout. The
 storage to buffer object / host allocation.
 
 The elements of buffer object / host allocation are mapped to tensor
-coordinates as follows:
+coordinates and vice versa as follows in pseudo C code:
 
+[source,c]
 ----
-tensor.element(i0, i1, ..., i<N-2>, i<N-1>) == (tensor.dtype)buffer_or_host_ptr[
-  i0 * tensor.shape[1] * tensor.shape[2] * ... * tensor.shape[N-1] +
-  i1 * tensor.shape[2] * tensor.shape[3] * ... * tensor.shape[N-1] +
+tensor_element(
+  tensor_origin[0] + i[0],
+  tensor_origin[1] + i[1],
+  ...,
+  tensor_origin[N-2] + i[N-2],
+  tensor_origin[N-2] + i[N-1]) ==
+((TENSOR_DATATYPE *)buffer_or_host_ptr)[
+  (mem_origin[0] + i[0]) * pitch(0) +
+  (mem_origin[1] + i[1]) * pitch(1) +
   ... +
-  i<N-2> * tensor.shape[i(N-1)] +
-  i<N-1>]
+  (mem_origin[N-2] + i[N-2]) * pitch(N-2) +
+  (mem_origin[N-1] + i[N-1])];
 ----
 
-Where `iX` is a tensor coordinate index with inclusive range of
-`0..<shape[X]-1>`. The `tensor.element()` represents an abstract
-function that accesses a tensor element in its storage at given
-coordinate. The method how the coordinates translate to tensor storage
-addresses is unspecified.
+Where the `N` is tensor rank, the `i[X]` is a tensor coordinate with
+inclusive range of `0..<region[X]-1>` and the `pitch` is computed as
+follows in pseudo C code:
+
+[source,c]
+----
+size_t pitch(size_t dim) {
+  size_t pitch = 1;
+  for (size_t i = dim; i < tensor_rank - 1; i++)
+    pitch *= mem_pitch != NULL ? mem_pitch[i] : region[i + 1];
+  return pitch;
+}
+----
+
+For `dim` in `0..(tensor_rank()-1)`. The `tensor_element()` represents
+an abstract function that accesses a tensor element in its storage at
+given coordinate. The method how the coordinates translate to tensor
+storage addresses is unspecified.
 
 // TODO: add clEnqueueCopyTensor
 
 // TODO: add clEnqueueFillTensor?
 
-// TODO: add command buffer variants for clEnqueue{copy,read,write}Tensor.
-
+TODO: add command buffer variants for clEnqueue*Tensor.
 
 ==== Add New Buffer Property in Section 5.2.1
 
 [cols="2,1,2",stripes=odd]
 |===
 | CL_MEM_COMMAND_BUFFER_TEMPORARY | cl_bool
-
 a| This property can be set if *cl_khr_command_buffer* extension is
 supported.
 
+NOTE: This property temporarily lives here and will be moved to
+a separate extension proposal.
+
 If the value is true, create a "temporary" buffer object that only can
 be used on commands recorded in command buffers. Non-recording
 command enqueue functions must return CL_INVALID_OPERATION if the
@@ -366,10 +464,11 @@ queried with clGetTensorInfo().
 
 Memory layout of the tensor in the created memory buffer is
 implementation-defined and opaque to the applications and it may
-change at unspecified points. Implementation may store auxiliary data
-in the memory buffer for the tensor. Therefore, writing data into the
-memory buffer directly using the cl_mem handle leads to undefined
-behavior.
+change at unspecified points.  Implementation may use non-contiguous
+allocations to store the tensor data and implementation may store
+auxiliary data within the allocations.  Therefore, reading from or
+writing to the memory buffer directly using the cl_mem handle leads to
+undefined behavior.
 
 If the tensor is already bound to a buffer object,
 clCreateBufferWithProperties call returns CL_TENSOR_BOUND_TO_BUFFER
@@ -410,10 +509,8 @@ cl_kernel create_matmul_kernel(
   // A hypothetical matmul kernel signature in pseudo OpenCL C for
   // illustrative purposes:
   //
-  //   kernel void matmul(
-  //     global read_only tensor_t,
-  //     global read_only tensor_t,
-  //     global write_only tensor_t);
+  //   kernel void matmul(global read_only tensor_t, global read_only tensor_t,
+  //                      global write_only tensor_t);
 
   cl_kernel matmul_kernel = /* Omitted. */;
   clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor), &lhs);
@@ -428,10 +525,8 @@ cl_kernel create_add_kernel(
   // A hypothetical add kernel signature in pseudo OpenCL C for illustrative
   // purposes:
   //
-  // kernel void add(
-  //     global read_only tensor_t,
-  //     global read_only tensor_t,
-  //     global write_only tensor_t);
+  // kernel void add(global read_only tensor_t, global read_only tensor_t,
+  //                 global write_only tensor_t);
 
   cl_tensor add_kernel = /* Omitted. */;
   clSetKernelArg(add_kernel, 0, sizeof(cl_tensor), &lhs);
@@ -446,6 +541,7 @@ An example usage of tensors on a command queue:
 ----
 constexpr size_t b = 64, m = 100, n = 200, k = 50;
 
+cl_int err;
 cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
 cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
 cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
@@ -455,11 +551,11 @@ cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err)
 cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
 cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
 
-// Allocate storage for the tensors. The buffer size must be set to zero
-// when the buffer is bound to a tensor. OpenCL implementation may
-// determine optimal data layout and the storage needed for it, based
-// on the tensor's uses (matmul kernel in this sample) so far.
-cl_int err;
+// Allocate storage for the tensors. The buffer size must be set to
+// zero when the buffer is bound to a tensor. OpenCL implementation
+// may determine optimal data layout and the storage needed for it,
+// based on the tensor's uses (the 'matmul' and 'add' kernels in this
+// sample) so far.
 cl_mem in0_mem = clCreateBufferWithProperties(
   ctx, {CL_MEM_BIND_TO_TENSOR, in0, 0}, CL_MEM_READ_ONLY,
   0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */, nullptr, &err);
@@ -482,20 +578,19 @@ std::vector<float> out_data(b * m * n);
 
 // Copies data into in0 tensor while possibly rearranging the data to the
 // optimal data layout.
-clEnqueueWriteTensor(
-  cmd_q, in0, false, nullptr, nullptr, {b, m, k}, nullptr, in0_data.data(),
-  0, nullptr, nullptr);
-
-clEnqueueWriteTensor(
-  cmd_q, in1, false, nullptr, nullptr, {b, k, n}, nullptr, in1_data.data(),
-  0, nullptr, nullptr);
+clEnqueueTranslateToTensor(
+  cmd_q, in0, false, {0, 0, 0}, {0, 0, 0}, {b, m, k},
+  nullptr, nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr);
+clEnqueueTranslateToTensor(
+  cmd_q, in1, false, {0, 0, 0}, {0, 0, 0}, {b, k, n},
+  nullptr, nullptr, nullptr, in1_data.data(), 0, nullptr, nullptr);
 clEnqueueNDRangeKernel(
-  cmd_q, matmul_kernel, 0, nullptr, nullptr, nullptr, 0, nullptr, nullptr);
+  cmd_q, matmul_kernel, 3, matmul_grid, nullptr, nullptr, 0, nullptr, nullptr);
 clEnqueueNDRangeKernel(
-  cmd_q, add_kernel, 0, nullptr, nullptr, nullptr, 0, nullptr, nullptr);
-clEnqueueReadTensor(
-  cmd_q, out, false, nullptr, nullptr, {b, m, n}, nullptr, out_data.data(),
-  0, nullptr, nullptr);
+  cmd_q, add_kernel, 3, add_grid, nullptr, nullptr, 0, nullptr, nullptr);
+clEnqueueTranslateFromTensor(
+  cmd_q, out, false,  {0, 0, 0}, {0, 0, 0}, {b, m, n},
+  nullptr, nullptr, nullptr, out_data.data(), 0, nullptr, nullptr);
 ----
 
 An example use of tensors in a command buffer when cl_khr_command_buffer
@@ -545,21 +640,21 @@ cl_command_buffer_khr cb =
   clCreateCommandBufferKHR(num_queues, queue_list, nullptr, &err);
 
 cl_sync_point_khr in0_syncp, in1_syncp, matmul_syncp, add_syncp;
-clCommandWriteTensorKHR(
-  cmd_b, cmd_q, in0, false, nullptr, nullptr, {b, m, k}, nullptr,
-  in0_data.data(), 0, nullptr, &in0_syncp);
-clCommandWriteTensorKHR(
-  cmd_b, cmd_q, in1, false, nullptr, nullptr, {b, k, m}, nullptr,
-  in1_data.data(), 0, nullptr, &in1_syncp);
+clCommandTranslateToTensorKHR(
+  cmd_b, cmd_q, in0, {0, 0, 0}, {0, 0, 0}, {b, m, k},
+  nullptr, nullptr, nullptr, in0_data.data(), 0, nullptr, &in0_syncp);
+clCommandTranslateToTensorKHR(
+  cmd_b, cmd_q, in1, {0, 0, 0}, {0, 0, 0}, {b, k, m},
+  nullptr, nullptr, nullptr, in1_data.data(), 0, nullptr, &in1_syncp);
 clCommandNDRangeKernelKHR(
-  cmd_b, cmd_q, nullptr, matmul_kernel, 0, nullptr, nullptr, nullptr,
+  cmd_b, cmd_q, nullptr, matmul_kernel, 3, matmul_grid, nullptr, nullptr,
   2, {in0_syncp, in2_syncp}, &matmul_syncp, nullptr);
 clCommandNDRangeKernelKHR(
-  cmd_b, cmd_q, nullptr, add_kernel, 0, nullptr, nullptr, nullptr,
+  cmd_b, cmd_q, nullptr, add_kernel, 3, add_grid, nullptr, nullptr,
   1, {matmul_syncp}, &add_syncp, nullptr);
-clCommandReadTensorKHR(
-  cmd_b, cmd_q, out,  false, nullptr, nullptr, {b, k, m}, nullptr,
-  out_data.data(), 1, {add_syncp}, nullptr);
+clCommandTranslateFromTensorKHR(
+  cmd_b, cmd_q, out, {0, 0, 0}, {0, 0, 0}, {b, k, m},
+  nullptr, nullptr, nullptr, out_data.data(), 1, {add_syncp}, nullptr);
 
 // Finalize the command buffer. At this point the OpenCL
 // implementation may reserve enough storage for all the tensor
@@ -571,14 +666,23 @@ clFinalizeCommandBufferKHR(cmd_b);
 // Temporary tensors used in a command buffer can't be read or written
 // into. A hypothetical reason is that the finalized command buffer
 // might not use some of the tensor.
-assert(clEnqueueReadTensor(..., t0, ...) == CL_INVALID_OPERATION);
+assert(clEnqueueTranslateFromTensor(..., t0, ...) == CL_INVALID_OPERATION);
 ----
 
 === Open Questions ===
 
 . Should we have support for tensors with undefined shape and tensors
   with unknown / symbolic dimension sizes like in ONNX?
-
++
+--
 // https://onnx.ai/onnx/repo-docs/ShapeInference.html
-
 *UNRESOLVED*
+--
+
+. Should we define OpenCL C language features for accessing tensors?
++
+--
+*RESOLVED*: OpenCL C support for tensors can be introduced later in a
+            separate extension. Built-in kernels may benefit from this
+            extension.
+--
diff --git a/ext/cl_exp_tensor.html b/ext/cl_exp_tensor.html
index c232ddea..e86b703c 100644
--- a/ext/cl_exp_tensor.html
+++ b/ext/cl_exp_tensor.html
@@ -478,7 +478,7 @@ <h4 id="_version_history">Version history</h4>
 </thead>
 <tbody>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">2023-10-XX</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">2023-11-XX</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">0.1.0</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">First assigned version.</p></td>
 </tr>
@@ -505,12 +505,78 @@ <h4 id="_contributors">Contributors</h4>
 </div>
 <div class="sect2">
 <h3 id="_overview">Overview</h3>
-
+<div class="paragraph">
+<p>The new tensor object enables applications to describe N-dimensional
+arrays whose memory layout is abstract to applications. The goal and
+intent of this extension is to give leverage for:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>implementations to have freedom of placement data of the tensors for
+improving performance of the kernels which use them. This extension
+should be designed so it allows implementations to determine optimal
+memory layouts for the tensors based on their use cases for
+increasing performance - for example, by analyzing kernels’ access
+patterns - or, in case of built-in kernels, by inspecting tensor
+arguments they operate on.</p>
+</li>
+<li>
+<p>reduce details and boilerplate needed for porting performant
+applications by being less dependent on platform or device specifics
+on the memory layout / data arrangements which matters for
+performance. Such specifics may include:</p>
+<div class="ulist">
+<ul>
+<li>
+<p>alignment of data (e.g. for avoiding misaligned memory accesses)</p>
+</li>
+<li>
+<p>arrangement of data required by kernels (column-major vs row-major
+for matrix multiplication, NHWC vs NCHW for neural network
+convolution)</p>
+</li>
+<li>
+<p>arrangement of the data into tiles (or “packing”) for improving
+cache and TLB hits</p>
+</li>
+<li>
+<p>arrangement of data into specific tiles in order to exploit complex
+HW operations such as matrix multiplications (Intel AMX, AMD matrix
+cores).</p>
+</li>
+<li>
+<p>arrangement of data into rows separated by a stride in order to
+avoid back conflicts in GPUs.</p>
+</li>
+</ul>
+</div>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>The tensor data type is deemed to be effective with command buffers
+and built-in kernels - including kernels to be provided by defined
+built-in kernel (cl_khr_defined_builtin_kernels) extension under work.</p>
+</div>
 </div>
 <div class="sect2">
 <h3 id="_modifications_to_opencl">Modifications to OpenCL</h3>
 <div class="sect3">
-<h4 id="_new_opencl_functions">New OpenCL Functions</h4>
+<h4 id="_new_section_5_x_tensor_objects">New Section: 5.x Tensor Objects</h4>
+<div class="paragraph">
+<p>A tensor object stores a N-dimensional array of elements. The memory
+layout of the tensor is opaque to the application. When a tensor
+object is created it initially does not have storage where the
+elements of the tensor are stored into. A storage is bind to a tensor
+by creating a memory buffer with CL_MEM_BIND_TO_BUFFER. Tensor objects
+without storage can be set as kernel arguments for kernels which
+accepts them. Kernels which have tensor arguments must have a storage
+assigned to them prior enqueuing the kernels for execution.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="_new_opencl_functions_added_to_tensor_objects_section">New OpenCL Functions added to Tensor Objects section</h4>
 <div class="paragraph">
 <p>To create a tensor use:</p>
 </div>
@@ -548,7 +614,7 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 </li>
 <li>
 <p><em>dtype</em> is the element type of <em>tensor</em>. Refer to the
-<a href="#TensorDtypes">Tensor element types</a> table for the types.</p>
+<a href="#TensorDtypes">Tensor element types. The API type indicates the corresponding type for copying elements from an host allocation / buffer object to tensor or vice versa.</a> table for the types.</p>
 </li>
 <li>
 <p><em>errcode_ret</em> may return an appropriate error code. If errcode_ret
@@ -593,80 +659,97 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 </li>
 </ul>
 </div>
-<table id="TensorDtypes" class="tableblock frame-all grid-all stripes-odd stretch">
-<caption class="title">Table 1. Tensor element types</caption>
+<table id="TensorDtypes" class="tableblock frame-all grid-all stripes-even stretch">
+<caption class="title">Table 1. Tensor element types. The API type indicates the corresponding type for copying elements from an host allocation / buffer object to tensor or vice versa.</caption>
 <colgroup>
 <col style="width: 33.3333%;">
-<col style="width: 66.6667%;">
+<col style="width: 33.3333%;">
+<col style="width: 33.3334%;">
 </colgroup>
 <thead>
 <tr>
 <th class="tableblock halign-left valign-top"><strong>Tensor element data type</strong></th>
 <th class="tableblock halign-left valign-top"><strong>Description</strong></th>
+<th class="tableblock halign-left valign-top"><strong>API type</strong></th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_BOOL</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">1-bit signedless integer.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_uchar. <sup class="footnote">[<a id="_footnoteref_1" class="footnote" href="#_footnotedef_1" title="View footnote.">1</a>]</sup></p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_INT8</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">8-bit signed integer.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_char.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_INT16</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">16-bit signed integer.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_short.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_INT32</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">32-bit signed integer.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_int.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_INT64</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">64-bit signed integer.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_long.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_UINT8</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">8-bit unsigned integer.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_uchar.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_UINT16</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">16-bit unsigned integer.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_ushort.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_UINT32</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">32-bit unsigned integer.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_uint.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_UINT64</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">64-bit unsigned integer.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_ulong.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_HALF</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">Half precision floating-point.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_half.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_BFLOAT16</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">16-bit brain floating-point.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_ushort</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_FLOAT</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">Single precision floating-point.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_float.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DOUBLE</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">Double precision floating-point.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_double.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_COMPLEX64</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">64-bit complex floating point with
+<td class="tableblock halign-left valign-top"><p class="tableblock">64-bit complex floating-point with
   32-bit real and imaginary part.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_float2</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_COMPLEX128</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">128-bit complex floating point with
+<td class="tableblock halign-left valign-top"><p class="tableblock">128-bit complex floating-point with
   64-bit real and imaginary part.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_double2</p></td>
 </tr>
 </tbody>
 </table>
@@ -850,8 +933,9 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 </tbody>
 </table>
 <div class="paragraph">
-<p>The following functions are for reading from a tensor to host memory / buffer object or to write to a
-tensor object from host memory / buffer object.</p>
+<p>The following functions are for reading from a tensor to host memory /
+buffer object or to write to a tensor object from host memory / buffer
+object.</p>
 </div>
 <div class="listingblock">
 <div class="content">
@@ -859,6 +943,10 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
   cl_command_queue command_queue,
   cl_tensor tensor,
   cl_bool blocking_command,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
   cl_mem buffer,
   void* host_ptr,
   cl_uint num_events_in_wait_list,
@@ -872,6 +960,10 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
   cl_command_queue command_queue,
   cl_tensor tensor,
   cl_bool blocking_command,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
   cl_mem buffer,
   const void* host_ptr,
   cl_uint num_events_in_wait_list,
@@ -894,6 +986,28 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 blocking or non-blocking (see below).</p>
 </li>
 <li>
+<p><em>tensor_origin</em> defines the offset coordinates in <em>tensor</em> for start of
+the regions to read / write tensor data. The length of the array
+must be at least rank the the <em>tensor</em>.</p>
+</li>
+<li>
+<p><em>mem_origin</em> defines the offset coordinates in the memory region
+pointed by <em>buffer</em> or <em>host_ptr</em> expressed in elements of <em>tensor</em>
+data type. The length of the array must be at least rank the the
+<em>tensor</em>.</p>
+</li>
+<li>
+<p><em>region</em> defines the region being read or written expressed in in
+elements of <em>tensor</em> data type. The length of the array must be at
+least rank the the <em>tensor</em>. If <em>region</em> is NULL then <em>tensor</em>'s
+shape will be used as the region.</p>
+</li>
+<li>
+<p><em>mem_pitch</em> defines the length of each dimension in elements to be
+used for the memory region of <em>buffer</em> or <em>host_ptr</em>. The length of
+the array must be at least the rank of <em>tensor</em> minus one.</p>
+</li>
+<li>
 <p><em>buffer</em> and <em>host_ptr</em> refer to a valid buffer object / host
 allocation where data is to be read into or to be written from.
 Either the <em>buffer</em> or <em>host_ptr</em> can be non-NULL in which case the
@@ -932,24 +1046,47 @@ <h4 id="_new_opencl_functions">New OpenCL Functions</h4>
 </div>
 <div class="paragraph">
 <p>The elements of buffer object / host allocation are mapped to tensor
-coordinates as follows:</p>
+coordinates and vice versa as follows in pseudo C code:</p>
 </div>
 <div class="listingblock">
 <div class="content">
-<pre>tensor.element(i0, i1, ..., i&lt;N-2&gt;, i&lt;N-1&gt;) == (tensor.dtype)buffer_or_host_ptr[
-  i0 * tensor.shape[1] * tensor.shape[2] * ... * tensor.shape[N-1] +
-  i1 * tensor.shape[2] * tensor.shape[3] * ... * tensor.shape[N-1] +
+<pre class="highlight"><code class="language-c" data-lang="c">tensor_element(
+  tensor_origin[0] + i[0],
+  tensor_origin[1] + i[1],
+  ...,
+  tensor_origin[N-2] + i[N-2],
+  tensor_origin[N-2] + i[N-1]) ==
+((TENSOR_DATATYPE *)buffer_or_host_ptr)[
+  (mem_origin[0] + i[0]) * pitch(0) +
+  (mem_origin[1] + i[1]) * pitch(1) +
   ... +
-  i&lt;N-2&gt; * tensor.shape[i(N-1)] +
-  i&lt;N-1&gt;]</pre>
+  (mem_origin[N-2] + i[N-2]) * pitch(N-2) +
+  (mem_origin[N-1] + i[N-1])];</code></pre>
 </div>
 </div>
 <div class="paragraph">
-<p>Where <code>iX</code> is a tensor coordinate index with inclusive range of
-<code>0..&lt;shape[X]-1&gt;</code>. The <code>tensor.element()</code> represents an abstract
-function that accesses a tensor element in its storage at given
-coordinate. The method how the coordinates translate to tensor storage
-addresses is unspecified.</p>
+<p>Where the <code>N</code> is tensor rank, the <code>i[X]</code> is a tensor coordinate with
+inclusive range of <code>0..&lt;region[X]-1&gt;</code> and the <code>pitch</code> is computed as
+follows in pseudo C code:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="highlight"><code class="language-c" data-lang="c">size_t pitch(size_t dim) {
+  size_t pitch = 1;
+  for (size_t i = dim; i &lt; tensor_rank - 1; i++)
+    pitch *= mem_pitch != NULL ? mem_pitch[i] : region[i + 1];
+  return pitch;
+}</code></pre>
+</div>
+</div>
+<div class="paragraph">
+<p>For <code>dim</code> in <code>0..(tensor_rank()-1)</code>. The <code>tensor_element()</code> represents
+an abstract function that accesses a tensor element in its storage at
+given coordinate. The method how the coordinates translate to tensor
+storage addresses is unspecified.</p>
+</div>
+<div class="paragraph">
+<p>TODO: add command buffer variants for clEnqueue*Tensor.</p>
 </div>
 </div>
 <div class="sect3">
@@ -960,32 +1097,48 @@ <h4 id="_add_new_buffer_property_in_section_5_2_1">Add New Buffer Property in Se
 <col style="width: 20%;">
 <col style="width: 40%;">
 </colgroup>
-<thead>
+<tbody>
 <tr>
-<th class="tableblock halign-left valign-top">CL_MEM_COMMAND_BUFFER_TEMPORARY</th>
-<th class="tableblock halign-left valign-top">cl_bool</th>
-<th class="tableblock halign-left valign-top">This property can be set if <strong>cl_khr_command_buffer</strong> extension is
-supported.
-
-If the value is true, create a "temporary" buffer object that only can
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_MEM_COMMAND_BUFFER_TEMPORARY</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_bool</p></td>
+<td class="tableblock halign-left valign-top"><div class="content"><div class="paragraph">
+<p>This property can be set if <strong>cl_khr_command_buffer</strong> extension is
+supported.</p>
+</div>
+<div class="admonitionblock note">
+<table>
+<tr>
+<td class="icon">
+<div class="title">Note</div>
+</td>
+<td class="content">
+This property temporarily lives here and will be moved to
+a separate extension proposal.
+</td>
+</tr>
+</table>
+</div>
+<div class="paragraph">
+<p>If the value is true, create a "temporary" buffer object that only can
 be used on commands recorded in command buffers. Non-recording
 command enqueue functions must return CL_INVALID_OPERATION if the
-command refers to a temporary buffer object.
-
-The temporary buffer objects are managed by command buffers. When a
+command refers to a temporary buffer object.</p>
+</div>
+<div class="paragraph">
+<p>The temporary buffer objects are managed by command buffers. When a
 temporary buffer object is used by multiple command buffer, the object
-receives disjoint storage for each command buffer.
-
-
-Storage of the temporary buffer objects may be allocated on-demand
+receives disjoint storage for each command buffer.</p>
+</div>
+<div class="paragraph">
+<p>Storage of the temporary buffer objects may be allocated on-demand
 basis. At the times the buffer is not needed, OpenCL implementations
-may reuse storage for other tasks within the command buffer.
-
-Contents of the temporary buffers are not guaranteed to be preserved
-across command buffer executions.</th>
+may reuse storage for other tasks within the command buffer.</p>
+</div>
+<div class="paragraph">
+<p>Contents of the temporary buffers are not guaranteed to be preserved
+across command buffer executions.</p>
+</div></div></td>
 </tr>
-</thead>
-<tbody>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_MEM_BIND_TO_TENSOR</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_tensor</p></td>
@@ -1002,10 +1155,11 @@ <h4 id="_add_new_buffer_property_in_section_5_2_1">Add New Buffer Property in Se
 <div class="paragraph">
 <p>Memory layout of the tensor in the created memory buffer is
 implementation-defined and opaque to the applications and it may
-change at unspecified points. Implementation may store auxiliary data
-in the memory buffer for the tensor. Therefore, writing data into the
-memory buffer directly using the cl_mem handle leads to undefined
-behavior.</p>
+change at unspecified points.  Implementation may use non-contiguous
+allocations to store the tensor data and implementation may store
+auxiliary data within the allocations.  Therefore, reading from or
+writing to the memory buffer directly using the cl_mem handle leads to
+undefined behavior.</p>
 </div>
 <div class="paragraph">
 <p>If the tensor is already bound to a buffer object,
@@ -1072,10 +1226,8 @@ <h3 id="_sample_codes">Sample Codes</h3>
   // A hypothetical matmul kernel signature in pseudo OpenCL C for
   // illustrative purposes:
   //
-  //   kernel void matmul(
-  //     global read_only tensor_t,
-  //     global read_only tensor_t,
-  //     global write_only tensor_t);
+  //   kernel void matmul(global read_only tensor_t, global read_only tensor_t,
+  //                      global write_only tensor_t);
 
   cl_kernel matmul_kernel = /* Omitted. */;
   clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor), &amp;lhs);
@@ -1090,10 +1242,8 @@ <h3 id="_sample_codes">Sample Codes</h3>
   // A hypothetical add kernel signature in pseudo OpenCL C for illustrative
   // purposes:
   //
-  // kernel void add(
-  //     global read_only tensor_t,
-  //     global read_only tensor_t,
-  //     global write_only tensor_t);
+  // kernel void add(global read_only tensor_t, global read_only tensor_t,
+  //                 global write_only tensor_t);
 
   cl_tensor add_kernel = /* Omitted. */;
   clSetKernelArg(add_kernel, 0, sizeof(cl_tensor), &amp;lhs);
@@ -1110,6 +1260,7 @@ <h3 id="_sample_codes">Sample Codes</h3>
 <div class="content">
 <pre class="highlight"><code class="language-c" data-lang="c">constexpr size_t b = 64, m = 100, n = 200, k = 50;
 
+cl_int err;
 cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
 cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
 cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
@@ -1119,11 +1270,11 @@ <h3 id="_sample_codes">Sample Codes</h3>
 cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
 cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
 
-// Allocate storage for the tensors. The buffer size must be set to zero
-// when the buffer is bound to a tensor. OpenCL implementation may
-// determine optimal data layout and the storage needed for it, based
-// on the tensor's uses (matmul kernel in this sample) so far.
-cl_int err;
+// Allocate storage for the tensors. The buffer size must be set to
+// zero when the buffer is bound to a tensor. OpenCL implementation
+// may determine optimal data layout and the storage needed for it,
+// based on the tensor's uses (the 'matmul' and 'add' kernels in this
+// sample) so far.
 cl_mem in0_mem = clCreateBufferWithProperties(
   ctx, {CL_MEM_BIND_TO_TENSOR, in0, 0}, CL_MEM_READ_ONLY,
   0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */, nullptr, &amp;err);
@@ -1146,20 +1297,19 @@ <h3 id="_sample_codes">Sample Codes</h3>
 
 // Copies data into in0 tensor while possibly rearranging the data to the
 // optimal data layout.
-clEnqueueWriteTensor(
-  cmd_q, in0, false, nullptr, nullptr, {b, m, k}, nullptr, in0_data.data(),
-  0, nullptr, nullptr);
-
-clEnqueueWriteTensor(
-  cmd_q, in1, false, nullptr, nullptr, {b, k, n}, nullptr, in1_data.data(),
-  0, nullptr, nullptr);
+clEnqueueTranslateToTensor(
+  cmd_q, in0, false, {0, 0, 0}, {0, 0, 0}, {b, m, k},
+  nullptr, nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr);
+clEnqueueTranslateToTensor(
+  cmd_q, in1, false, {0, 0, 0}, {0, 0, 0}, {b, k, n},
+  nullptr, nullptr, nullptr, in1_data.data(), 0, nullptr, nullptr);
 clEnqueueNDRangeKernel(
-  cmd_q, matmul_kernel, 0, nullptr, nullptr, nullptr, 0, nullptr, nullptr);
+  cmd_q, matmul_kernel, 3, matmul_grid, nullptr, nullptr, 0, nullptr, nullptr);
 clEnqueueNDRangeKernel(
-  cmd_q, add_kernel, 0, nullptr, nullptr, nullptr, 0, nullptr, nullptr);
-clEnqueueReadTensor(
-  cmd_q, out, false, nullptr, nullptr, {b, m, n}, nullptr, out_data.data(),
-  0, nullptr, nullptr);</code></pre>
+  cmd_q, add_kernel, 3, add_grid, nullptr, nullptr, 0, nullptr, nullptr);
+clEnqueueTranslateFromTensor(
+  cmd_q, out, false,  {0, 0, 0}, {0, 0, 0}, {b, m, n},
+  nullptr, nullptr, nullptr, out_data.data(), 0, nullptr, nullptr);</code></pre>
 </div>
 </div>
 <div class="paragraph">
@@ -1210,21 +1360,21 @@ <h3 id="_sample_codes">Sample Codes</h3>
   clCreateCommandBufferKHR(num_queues, queue_list, nullptr, &amp;err);
 
 cl_sync_point_khr in0_syncp, in1_syncp, matmul_syncp, add_syncp;
-clCommandWriteTensorKHR(
-  cmd_b, cmd_q, in0, false, nullptr, nullptr, {b, m, k}, nullptr,
-  in0_data.data(), 0, nullptr, &amp;in0_syncp);
-clCommandWriteTensorKHR(
-  cmd_b, cmd_q, in1, false, nullptr, nullptr, {b, k, m}, nullptr,
-  in1_data.data(), 0, nullptr, &amp;in1_syncp);
+clCommandTranslateToTensorKHR(
+  cmd_b, cmd_q, in0, {0, 0, 0}, {0, 0, 0}, {b, m, k},
+  nullptr, nullptr, nullptr, in0_data.data(), 0, nullptr, &amp;in0_syncp);
+clCommandTranslateToTensorKHR(
+  cmd_b, cmd_q, in1, {0, 0, 0}, {0, 0, 0}, {b, k, m},
+  nullptr, nullptr, nullptr, in1_data.data(), 0, nullptr, &amp;in1_syncp);
 clCommandNDRangeKernelKHR(
-  cmd_b, cmd_q, nullptr, matmul_kernel, 0, nullptr, nullptr, nullptr,
+  cmd_b, cmd_q, nullptr, matmul_kernel, 3, matmul_grid, nullptr, nullptr,
   2, {in0_syncp, in2_syncp}, &amp;matmul_syncp, nullptr);
 clCommandNDRangeKernelKHR(
-  cmd_b, cmd_q, nullptr, add_kernel, 0, nullptr, nullptr, nullptr,
+  cmd_b, cmd_q, nullptr, add_kernel, 3, add_grid, nullptr, nullptr,
   1, {matmul_syncp}, &amp;add_syncp, nullptr);
-clCommandReadTensorKHR(
-  cmd_b, cmd_q, out,  false, nullptr, nullptr, {b, k, m}, nullptr,
-  out_data.data(), 1, {add_syncp}, nullptr);
+clCommandTranslateFromTensorKHR(
+  cmd_b, cmd_q, out, {0, 0, 0}, {0, 0, 0}, {b, k, m},
+  nullptr, nullptr, nullptr, out_data.data(), 1, {add_syncp}, nullptr);
 
 // Finalize the command buffer. At this point the OpenCL
 // implementation may reserve enough storage for all the tensor
@@ -1236,7 +1386,7 @@ <h3 id="_sample_codes">Sample Codes</h3>
 // Temporary tensors used in a command buffer can't be read or written
 // into. A hypothetical reason is that the finalized command buffer
 // might not use some of the tensor.
-assert(clEnqueueReadTensor(..., t0, ...) == CL_INVALID_OPERATION);</code></pre>
+assert(clEnqueueTranslateFromTensor(..., t0, ...) == CL_INVALID_OPERATION);</code></pre>
 </div>
 </div>
 </div>
@@ -1247,19 +1397,41 @@ <h3 id="_open_questions">Open Questions</h3>
 <li>
 <p>Should we have support for tensors with undefined shape and tensors
 with unknown / symbolic dimension sizes like in ONNX?</p>
+<div class="openblock">
+<div class="content">
+<div class="paragraph">
+<p><strong>UNRESOLVED</strong></p>
+</div>
+</div>
+</div>
+</li>
+<li>
+<p>Should we define OpenCL C language features for accessing tensors?</p>
+<div class="openblock">
+<div class="content">
+<div class="paragraph">
+<p><strong>RESOLVED</strong>: OpenCL C support for tensors can be introduced later in a
+            separate extension. Built-in kernels may benefit from this
+            extension.</p>
+</div>
+</div>
+</div>
 </li>
 </ol>
 </div>
-<div class="paragraph">
-<p><strong>UNRESOLVED</strong></p>
 </div>
 </div>
 </div>
 </div>
+<div id="footnotes">
+<hr>
+<div class="footnote" id="_footnotedef_1">
+<a href="#_footnoteref_1">1</a>. only LSB bit is considered when writing data to tensor. When reading data from tensor the boolean value will be written as 0 or 1. The boolean values in the tensor may be packed densenly
+</div>
 </div>
 <div id="footer">
 <div id="footer-text">
-Last updated 2023-11-02 14:25:56 +0200
+Last updated 2023-11-15 11:19:22 +0200
 </div>
 </div>
 </body>

From 274e76ea5c90d0dd730b854a61b68cd2987b78eb Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <linehill@users.noreply.github.com>
Date: Wed, 15 Nov 2023 14:00:42 +0200
Subject: [PATCH 16/26] Apply suggestions from code review
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Pekka Jääskeläinen <pekka.jaaskelainen@tuni.fi>
---
 ext/cl_exp_tensor.asciidoc | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/ext/cl_exp_tensor.asciidoc b/ext/cl_exp_tensor.asciidoc
index b7fd8429..5cd8feda 100644
--- a/ext/cl_exp_tensor.asciidoc
+++ b/ext/cl_exp_tensor.asciidoc
@@ -44,18 +44,18 @@ Ben Ashbaugh, Intel. +
 === Overview
 
 The new tensor object enables applications to describe N-dimensional
-arrays whose memory layout is abstract to applications. The goal and
-intent of this extension is to give leverage for:
+arrays whose memory layout is opaque to applications. The goals
+of this extension are the following:
 
-* implementations to have freedom of placement data of the tensors for
+* Enable implementations to have freedom of placement data of the tensors for
   improving performance of the kernels which use them. This extension
-  should be designed so it allows implementations to determine optimal
+  is designed such it allows implementations to determine optimal
   memory layouts for the tensors based on their use cases for
-  increasing performance - for example, by analyzing kernels’ access
-  patterns - or, in case of built-in kernels, by inspecting tensor
+  increased performance, by means of, for example, analyzing kernels’ access
+  patterns or, in case of built-in kernels, by inspecting the tensor
   arguments they operate on.
 
-* reduce details and boilerplate needed for porting performant
+* Reduce details and boilerplate needed for performance portable implementation of
   applications by being less dependent on platform or device specifics
   on the memory layout / data arrangements which matters for
   performance. Such specifics may include:
@@ -74,23 +74,23 @@ intent of this extension is to give leverage for:
    cores).
 
 ** arrangement of data into rows separated by a stride in order to
-   avoid back conflicts in GPUs.
+   avoid bank conflicts in GPUs.
 
-The tensor data type is deemed to be effective with command buffers
-and built-in kernels - including kernels to be provided by defined
-built-in kernel (cl_khr_defined_builtin_kernels) extension under work.
+The tensor data type is designed to be efficiently used together with command buffers (cl_khr_command_buffers)
+and built-in kernels, including kernels to be provided by the Defined
+Built-in Kernels (cl_khr_defined_builtin_kernels) extension that is being prepared together with this extension.
 
 === Modifications to OpenCL
 
 ==== New Section: 5.x Tensor Objects
 
-A tensor object stores a N-dimensional array of elements. The memory
+A tensor object stores an N-dimensional array of elements. The memory
 layout of the tensor is opaque to the application. When a tensor
-object is created it initially does not have storage where the
-elements of the tensor are stored into. A storage is bind to a tensor
+object is created it is initially not associated to any storage for the tensor elements.
+ A storage is bound to a tensor
 by creating a memory buffer with CL_MEM_BIND_TO_BUFFER. Tensor objects
 without storage can be set as kernel arguments for kernels which
-accepts them. Kernels which have tensor arguments must have a storage
+accepts them. Kernels which have tensor arguments must have storage
 assigned to them prior enqueuing the kernels for execution.
 
 ==== New OpenCL Functions added to Tensor Objects section
@@ -684,5 +684,5 @@ assert(clEnqueueTranslateFromTensor(..., t0, ...) == CL_INVALID_OPERATION);
 --
 *RESOLVED*: OpenCL C support for tensors can be introduced later in a
             separate extension. Built-in kernels may benefit from this
-            extension.
+            extension as it is.
 --

From 2d0687a26bccd13488dc77e0e0e38f6179610131 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 16 Nov 2023 14:23:40 +0200
Subject: [PATCH 17/26] * Add command buffer counterparts for tensor
 translation commands

* Add error codes for tensor translation commands.

* Tweaked mem_pitch semantics.
---
 ext/cl_exp_tensor.asciidoc | 141 ++++++++++++++++++++++--
 ext/cl_exp_tensor.html     | 213 ++++++++++++++++++++++++++++++++-----
 2 files changed, 319 insertions(+), 35 deletions(-)

diff --git a/ext/cl_exp_tensor.asciidoc b/ext/cl_exp_tensor.asciidoc
index 5cd8feda..ddf9cf03 100644
--- a/ext/cl_exp_tensor.asciidoc
+++ b/ext/cl_exp_tensor.asciidoc
@@ -347,7 +347,9 @@ cl_int clEnqueueTranslateToTensor(
 
 * _mem_pitch_ defines the length of each dimension in elements to be
   used for the memory region of _buffer_ or _host_ptr_. The length of
-  the array must be at least the rank of _tensor_ minus one.
+  the array must be at least the rank of _tensor_ minus one. if
+  _mem_pitch_ is NULL or _mem_pitch_[i] is zero, _mem_pitch_[i] is
+  computed as _region_[i + 1].
 
 * _buffer_ and _host_ptr_ refer to a valid buffer object / host
   allocation where data is to be read into or to be written from.
@@ -408,7 +410,8 @@ follows in pseudo C code:
 size_t pitch(size_t dim) {
   size_t pitch = 1;
   for (size_t i = dim; i < tensor_rank - 1; i++)
-    pitch *= mem_pitch != NULL ? mem_pitch[i] : region[i + 1];
+    pitch *=
+      (mem_pitch != NULL || mem_pitch[i] == 0) ? mem_pitch[i] : region[i + 1];
   return pitch;
 }
 ----
@@ -418,11 +421,131 @@ an abstract function that accesses a tensor element in its storage at
 given coordinate. The method how the coordinates translate to tensor
 storage addresses is unspecified.
 
+*clEnqueueTranslateFsomTensor* and *clEnqueueTranslateToTensor*
+returns CL_SUCCESS if the function is executed
+successfully. Otherwise, it returns one of the following errors:
+
+* CL_INVALID_COMMAND_QUEUE if _command_queue_ is not a valid host
+  command-queue.
+
+* CL_INVALID_CONTEXT if the context associated with _command_queue_
+  and buffer are not the same or if the context associated with
+  _command_queue_ and events in _event_wait_list_ are not the same.
+
+* CL_INVALID_MEM_OBJECT if _buffer_ is not a valid buffer object.
+
+* CL_INVALID_VALUE if _tensor_origin_ or _mem_origin_ is NULL.
+
+* CL_INVALID_VALUE if the region being read or written specified by
+  (_mem_origin_, _region_, _mem_pitch_) is out of bounds.
+
+* CL_INVALID_VALUE if any _region_ array element is 0.
+
+* CL_INVALID_VALUE if _mem_pitch_ is not NULL and _mem_pitch_[i] is
+  not 0 and _mem_pitch_[i] is less than _region_[i].
+
+* CL_INVALID_VALUE if _buffer_ and _host_ptr_ both are NULL or non-NULL.
+
+* CL_INVALID_EVENT_WAIT_LIST if _event_wait_list_ is NULL and
+  _num_events_in_wait_list_ > 0, or _event_wait_list_ is not NULL and
+  _num_events_in_wait_list_ is 0, or if event objects in
+  _event_wait_list_ are not valid events.
+
+* CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST if the read and write
+  operations are blocking and the execution status of any of the
+  events in _event_wait_list_ is a negative integer value.
+
+* CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate
+  memory for data store associated with memory object the _tensor_ is
+  bound to.
+
+* CL_OUT_OF_RESOURCES if there is a failure to allocate resources
+  required by the OpenCL implementation on the device.
+
+* CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
+  required by the OpenCL implementation on the host.
+
 // TODO: add clEnqueueCopyTensor
 
 // TODO: add clEnqueueFillTensor?
 
-TODO: add command buffer variants for clEnqueue*Tensor.
+If *cl_khr_command_buffer* is is supported, then the following command
+buffer counterparts of the *clEnqueueTranslateFromTensor* and
+*clEnqueueTranslateToTensor* commands are available.
+
+[source,c]
+----
+cl_int clCommandTranslateFromTensorKHR(
+  cl_command_buffer_khr command_buffer,
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  void* host_ptr,
+  cl_uint num_sync_points_in_wait_list,
+  const cl_sync_point_khr* sync_point_wait_list,
+  cl_sync_point_khr* sync_point,
+  cl_mutable_command_khr* mutable_handle);
+----
+
+[source,c]
+----
+cl_int clCommandTranslateToTensorKHR(
+  cl_command_buffer_khr command_buffer,
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  const void* host_ptr,
+  cl_uint num_sync_points_in_wait_list,
+  const cl_sync_point_khr* sync_point_wait_list,
+  cl_sync_point_khr* sync_point,
+  cl_mutable_command_khr* mutable_handle);
+----
+
+* _command_buffer_ refers to valid command-buffer object.
+
+* For _command_queue_, _tensor_, _tensor_origin_, _mem_origin_,
+  _region_, _mem_pitch_, _buffer_ and _host_ptr_ parameters refer to
+  *clEnqueueTranslateFromTensor*.
+
+* For _num_sync_points_in_wait_list_, _sync_point_wait_list_,
+  _sync_point_, _mutable_handle_ parameters refer to
+  *clCommandCopyBufferKHR*.
+
+*clCommandTranslateFromTensorKHR* and *clCommandTranslateFromTensorKHR*
+returns CL_SUCCESS if the function is executed
+successfully. Otherwise, it returns one of the following errors:
+
+* CL_INVALID_COMMAND_QUEUE if _command_queue_ is not NULL.
+
+* CL_INVALID_COMMAND_BUFFER_KHR if _command_buffer_ is not a valid
+  command-buffer.
+
+* CL_INVALID_CONTEXT if the context associated with _command_queue_
+  and _command_buffer_ is not the same.
+
+* CL_INVALID_OPERATION if _command_buffer_ has been finalized.
+
+* CL_INVALID_VALUE if _mutable_handle_ is not NULL.
+
+* CL_INVALID_SYNC_POINT_WAIT_LIST_KHR if _sync_point_wait_list_ is
+  NULL and _num_sync_points_in_wait_list_ is > 0, or
+  _sync_point_wait_list_ is not NULL and _num_sync_points_in_wait_list_ is
+  0, or if synchronization-point objects in _sync_point_wait_list_ are
+  not valid synchronization-points.
+
+* CL_OUT_OF_RESOURCES if there is a failure to allocate resources
+  required by the OpenCL implementation on the device.
+
+* CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
+  required by the OpenCL implementation on the host.
 
 ==== Add New Buffer Property in Section 5.2.1
 
@@ -580,17 +703,17 @@ std::vector<float> out_data(b * m * n);
 // optimal data layout.
 clEnqueueTranslateToTensor(
   cmd_q, in0, false, {0, 0, 0}, {0, 0, 0}, {b, m, k},
-  nullptr, nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr);
+  nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr);
 clEnqueueTranslateToTensor(
   cmd_q, in1, false, {0, 0, 0}, {0, 0, 0}, {b, k, n},
-  nullptr, nullptr, nullptr, in1_data.data(), 0, nullptr, nullptr);
+  nullptr, nullptr, in1_data.data(), 0, nullptr, nullptr);
 clEnqueueNDRangeKernel(
   cmd_q, matmul_kernel, 3, matmul_grid, nullptr, nullptr, 0, nullptr, nullptr);
 clEnqueueNDRangeKernel(
   cmd_q, add_kernel, 3, add_grid, nullptr, nullptr, 0, nullptr, nullptr);
 clEnqueueTranslateFromTensor(
   cmd_q, out, false,  {0, 0, 0}, {0, 0, 0}, {b, m, n},
-  nullptr, nullptr, nullptr, out_data.data(), 0, nullptr, nullptr);
+  nullptr, nullptr, out_data.data(), 0, nullptr, nullptr);
 ----
 
 An example use of tensors in a command buffer when cl_khr_command_buffer
@@ -642,10 +765,10 @@ cl_command_buffer_khr cb =
 cl_sync_point_khr in0_syncp, in1_syncp, matmul_syncp, add_syncp;
 clCommandTranslateToTensorKHR(
   cmd_b, cmd_q, in0, {0, 0, 0}, {0, 0, 0}, {b, m, k},
-  nullptr, nullptr, nullptr, in0_data.data(), 0, nullptr, &in0_syncp);
+  nullptr, nullptr, in0_data.data(), 0, nullptr, &in0_syncp);
 clCommandTranslateToTensorKHR(
   cmd_b, cmd_q, in1, {0, 0, 0}, {0, 0, 0}, {b, k, m},
-  nullptr, nullptr, nullptr, in1_data.data(), 0, nullptr, &in1_syncp);
+  nullptr, nullptr, in1_data.data(), 0, nullptr, &in1_syncp);
 clCommandNDRangeKernelKHR(
   cmd_b, cmd_q, nullptr, matmul_kernel, 3, matmul_grid, nullptr, nullptr,
   2, {in0_syncp, in2_syncp}, &matmul_syncp, nullptr);
@@ -654,7 +777,7 @@ clCommandNDRangeKernelKHR(
   1, {matmul_syncp}, &add_syncp, nullptr);
 clCommandTranslateFromTensorKHR(
   cmd_b, cmd_q, out, {0, 0, 0}, {0, 0, 0}, {b, k, m},
-  nullptr, nullptr, nullptr, out_data.data(), 1, {add_syncp}, nullptr);
+  nullptr, nullptr, out_data.data(), 1, {add_syncp}, nullptr);
 
 // Finalize the command buffer. At this point the OpenCL
 // implementation may reserve enough storage for all the tensor
diff --git a/ext/cl_exp_tensor.html b/ext/cl_exp_tensor.html
index e86b703c..7303a372 100644
--- a/ext/cl_exp_tensor.html
+++ b/ext/cl_exp_tensor.html
@@ -507,22 +507,22 @@ <h4 id="_contributors">Contributors</h4>
 <h3 id="_overview">Overview</h3>
 <div class="paragraph">
 <p>The new tensor object enables applications to describe N-dimensional
-arrays whose memory layout is abstract to applications. The goal and
-intent of this extension is to give leverage for:</p>
+arrays whose memory layout is opaque to applications. The goals
+of this extension are the following:</p>
 </div>
 <div class="ulist">
 <ul>
 <li>
-<p>implementations to have freedom of placement data of the tensors for
+<p>Enable implementations to have freedom of placement data of the tensors for
 improving performance of the kernels which use them. This extension
-should be designed so it allows implementations to determine optimal
+is designed such it allows implementations to determine optimal
 memory layouts for the tensors based on their use cases for
-increasing performance - for example, by analyzing kernels’ access
-patterns - or, in case of built-in kernels, by inspecting tensor
+increased performance, by means of, for example, analyzing kernels’ access
+patterns or, in case of built-in kernels, by inspecting the tensor
 arguments they operate on.</p>
 </li>
 <li>
-<p>reduce details and boilerplate needed for porting performant
+<p>Reduce details and boilerplate needed for performance portable implementation of
 applications by being less dependent on platform or device specifics
 on the memory layout / data arrangements which matters for
 performance. Such specifics may include:</p>
@@ -547,7 +547,7 @@ <h3 id="_overview">Overview</h3>
 </li>
 <li>
 <p>arrangement of data into rows separated by a stride in order to
-avoid back conflicts in GPUs.</p>
+avoid bank conflicts in GPUs.</p>
 </li>
 </ul>
 </div>
@@ -555,9 +555,9 @@ <h3 id="_overview">Overview</h3>
 </ul>
 </div>
 <div class="paragraph">
-<p>The tensor data type is deemed to be effective with command buffers
-and built-in kernels - including kernels to be provided by defined
-built-in kernel (cl_khr_defined_builtin_kernels) extension under work.</p>
+<p>The tensor data type is designed to be efficiently used together with command buffers (cl_khr_command_buffers)
+and built-in kernels, including kernels to be provided by the Defined
+Built-in Kernels (cl_khr_defined_builtin_kernels) extension that is being prepared together with this extension.</p>
 </div>
 </div>
 <div class="sect2">
@@ -565,13 +565,13 @@ <h3 id="_modifications_to_opencl">Modifications to OpenCL</h3>
 <div class="sect3">
 <h4 id="_new_section_5_x_tensor_objects">New Section: 5.x Tensor Objects</h4>
 <div class="paragraph">
-<p>A tensor object stores a N-dimensional array of elements. The memory
+<p>A tensor object stores an N-dimensional array of elements. The memory
 layout of the tensor is opaque to the application. When a tensor
-object is created it initially does not have storage where the
-elements of the tensor are stored into. A storage is bind to a tensor
+object is created it is initially not associated to any storage for the tensor elements.
+ A storage is bound to a tensor
 by creating a memory buffer with CL_MEM_BIND_TO_BUFFER. Tensor objects
 without storage can be set as kernel arguments for kernels which
-accepts them. Kernels which have tensor arguments must have a storage
+accepts them. Kernels which have tensor arguments must have storage
 assigned to them prior enqueuing the kernels for execution.</p>
 </div>
 </div>
@@ -1005,7 +1005,9 @@ <h4 id="_new_opencl_functions_added_to_tensor_objects_section">New OpenCL Functi
 <li>
 <p><em>mem_pitch</em> defines the length of each dimension in elements to be
 used for the memory region of <em>buffer</em> or <em>host_ptr</em>. The length of
-the array must be at least the rank of <em>tensor</em> minus one.</p>
+the array must be at least the rank of <em>tensor</em> minus one. if
+<em>mem_pitch</em> is NULL or <em>mem_pitch</em>[i] is zero, <em>mem_pitch</em>[i] is
+computed as <em>region</em>[i + 1].</p>
 </li>
 <li>
 <p><em>buffer</em> and <em>host_ptr</em> refer to a valid buffer object / host
@@ -1074,7 +1076,8 @@ <h4 id="_new_opencl_functions_added_to_tensor_objects_section">New OpenCL Functi
 <pre class="highlight"><code class="language-c" data-lang="c">size_t pitch(size_t dim) {
   size_t pitch = 1;
   for (size_t i = dim; i &lt; tensor_rank - 1; i++)
-    pitch *= mem_pitch != NULL ? mem_pitch[i] : region[i + 1];
+    pitch *=
+      (mem_pitch != NULL || mem_pitch[i] == 0) ? mem_pitch[i] : region[i + 1];
   return pitch;
 }</code></pre>
 </div>
@@ -1086,7 +1089,165 @@ <h4 id="_new_opencl_functions_added_to_tensor_objects_section">New OpenCL Functi
 storage addresses is unspecified.</p>
 </div>
 <div class="paragraph">
-<p>TODO: add command buffer variants for clEnqueue*Tensor.</p>
+<p><strong>clEnqueueTranslateFsomTensor</strong> and <strong>clEnqueueTranslateToTensor</strong>
+returns CL_SUCCESS if the function is executed
+successfully. Otherwise, it returns one of the following errors:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>CL_INVALID_COMMAND_QUEUE if <em>command_queue</em> is not a valid host
+command-queue.</p>
+</li>
+<li>
+<p>CL_INVALID_CONTEXT if the context associated with <em>command_queue</em>
+and buffer are not the same or if the context associated with
+<em>command_queue</em> and events in <em>event_wait_list</em> are not the same.</p>
+</li>
+<li>
+<p>CL_INVALID_MEM_OBJECT if <em>buffer</em> is not a valid buffer object.</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if <em>tensor_origin</em> or <em>mem_origin</em> is NULL.</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if the region being read or written specified by
+(<em>mem_origin</em>, <em>region</em>, <em>mem_pitch</em>) is out of bounds.</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if any <em>region</em> array element is 0.</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if <em>mem_pitch</em> is not NULL and <em>mem_pitch</em>[i] is
+not 0 and <em>mem_pitch</em>[i] is less than <em>region</em>[i].</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if <em>buffer</em> and <em>host_ptr</em> both are NULL or non-NULL.</p>
+</li>
+<li>
+<p>CL_INVALID_EVENT_WAIT_LIST if <em>event_wait_list</em> is NULL and
+<em>num_events_in_wait_list</em> &gt; 0, or <em>event_wait_list</em> is not NULL and
+<em>num_events_in_wait_list</em> is 0, or if event objects in
+<em>event_wait_list</em> are not valid events.</p>
+</li>
+<li>
+<p>CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST if the read and write
+operations are blocking and the execution status of any of the
+events in <em>event_wait_list</em> is a negative integer value.</p>
+</li>
+<li>
+<p>CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate
+memory for data store associated with memory object the <em>tensor</em> is
+bound to.</p>
+</li>
+<li>
+<p>CL_OUT_OF_RESOURCES if there is a failure to allocate resources
+required by the OpenCL implementation on the device.</p>
+</li>
+<li>
+<p>CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
+required by the OpenCL implementation on the host.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>If <strong>cl_khr_command_buffer</strong> is is supported, then the following command
+buffer counterparts of the <strong>clEnqueueTranslateFromTensor</strong> and
+<strong>clEnqueueTranslateToTensor</strong> commands are available.</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="highlight"><code class="language-c" data-lang="c">cl_int clCommandTranslateFromTensorKHR(
+  cl_command_buffer_khr command_buffer,
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  void* host_ptr,
+  cl_uint num_sync_points_in_wait_list,
+  const cl_sync_point_khr* sync_point_wait_list,
+  cl_sync_point_khr* sync_point,
+  cl_mutable_command_khr* mutable_handle);</code></pre>
+</div>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="highlight"><code class="language-c" data-lang="c">cl_int clCommandTranslateToTensorKHR(
+  cl_command_buffer_khr command_buffer,
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  const void* host_ptr,
+  cl_uint num_sync_points_in_wait_list,
+  const cl_sync_point_khr* sync_point_wait_list,
+  cl_sync_point_khr* sync_point,
+  cl_mutable_command_khr* mutable_handle);</code></pre>
+</div>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><em>command_buffer</em> refers to valid command-buffer object.</p>
+</li>
+<li>
+<p>For <em>command_queue</em>, <em>tensor</em>, <em>tensor_origin</em>, <em>mem_origin</em>,
+<em>region</em>, <em>mem_pitch</em>, <em>buffer</em> and <em>host_ptr</em> parameters refer to
+<strong>clEnqueueTranslateFromTensor</strong>.</p>
+</li>
+<li>
+<p>For <em>num_sync_points_in_wait_list</em>, <em>sync_point_wait_list</em>,
+<em>sync_point</em>, <em>mutable_handle</em> parameters refer to
+<strong>clCommandCopyBufferKHR</strong>.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p><strong>clCommandTranslateFromTensorKHR</strong> and <strong>clCommandTranslateFromTensorKHR</strong>
+returns CL_SUCCESS if the function is executed
+successfully. Otherwise, it returns one of the following errors:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>CL_INVALID_COMMAND_QUEUE if <em>command_queue</em> is not NULL.</p>
+</li>
+<li>
+<p>CL_INVALID_COMMAND_BUFFER_KHR if <em>command_buffer</em> is not a valid
+command-buffer.</p>
+</li>
+<li>
+<p>CL_INVALID_CONTEXT if the context associated with <em>command_queue</em>
+and <em>command_buffer</em> is not the same.</p>
+</li>
+<li>
+<p>CL_INVALID_OPERATION if <em>command_buffer</em> has been finalized.</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if <em>mutable_handle</em> is not NULL.</p>
+</li>
+<li>
+<p>CL_INVALID_SYNC_POINT_WAIT_LIST_KHR if <em>sync_point_wait_list</em> is
+NULL and <em>num_sync_points_in_wait_list</em> is &gt; 0, or
+<em>sync_point_wait_list</em> is not NULL and <em>num_sync_points_in_wait_list</em> is
+0, or if synchronization-point objects in <em>sync_point_wait_list</em> are
+not valid synchronization-points.</p>
+</li>
+<li>
+<p>CL_OUT_OF_RESOURCES if there is a failure to allocate resources
+required by the OpenCL implementation on the device.</p>
+</li>
+<li>
+<p>CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
+required by the OpenCL implementation on the host.</p>
+</li>
+</ul>
 </div>
 </div>
 <div class="sect3">
@@ -1299,17 +1460,17 @@ <h3 id="_sample_codes">Sample Codes</h3>
 // optimal data layout.
 clEnqueueTranslateToTensor(
   cmd_q, in0, false, {0, 0, 0}, {0, 0, 0}, {b, m, k},
-  nullptr, nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr);
+  nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr);
 clEnqueueTranslateToTensor(
   cmd_q, in1, false, {0, 0, 0}, {0, 0, 0}, {b, k, n},
-  nullptr, nullptr, nullptr, in1_data.data(), 0, nullptr, nullptr);
+  nullptr, nullptr, in1_data.data(), 0, nullptr, nullptr);
 clEnqueueNDRangeKernel(
   cmd_q, matmul_kernel, 3, matmul_grid, nullptr, nullptr, 0, nullptr, nullptr);
 clEnqueueNDRangeKernel(
   cmd_q, add_kernel, 3, add_grid, nullptr, nullptr, 0, nullptr, nullptr);
 clEnqueueTranslateFromTensor(
   cmd_q, out, false,  {0, 0, 0}, {0, 0, 0}, {b, m, n},
-  nullptr, nullptr, nullptr, out_data.data(), 0, nullptr, nullptr);</code></pre>
+  nullptr, nullptr, out_data.data(), 0, nullptr, nullptr);</code></pre>
 </div>
 </div>
 <div class="paragraph">
@@ -1362,10 +1523,10 @@ <h3 id="_sample_codes">Sample Codes</h3>
 cl_sync_point_khr in0_syncp, in1_syncp, matmul_syncp, add_syncp;
 clCommandTranslateToTensorKHR(
   cmd_b, cmd_q, in0, {0, 0, 0}, {0, 0, 0}, {b, m, k},
-  nullptr, nullptr, nullptr, in0_data.data(), 0, nullptr, &amp;in0_syncp);
+  nullptr, nullptr, in0_data.data(), 0, nullptr, &amp;in0_syncp);
 clCommandTranslateToTensorKHR(
   cmd_b, cmd_q, in1, {0, 0, 0}, {0, 0, 0}, {b, k, m},
-  nullptr, nullptr, nullptr, in1_data.data(), 0, nullptr, &amp;in1_syncp);
+  nullptr, nullptr, in1_data.data(), 0, nullptr, &amp;in1_syncp);
 clCommandNDRangeKernelKHR(
   cmd_b, cmd_q, nullptr, matmul_kernel, 3, matmul_grid, nullptr, nullptr,
   2, {in0_syncp, in2_syncp}, &amp;matmul_syncp, nullptr);
@@ -1374,7 +1535,7 @@ <h3 id="_sample_codes">Sample Codes</h3>
   1, {matmul_syncp}, &amp;add_syncp, nullptr);
 clCommandTranslateFromTensorKHR(
   cmd_b, cmd_q, out, {0, 0, 0}, {0, 0, 0}, {b, k, m},
-  nullptr, nullptr, nullptr, out_data.data(), 1, {add_syncp}, nullptr);
+  nullptr, nullptr, out_data.data(), 1, {add_syncp}, nullptr);
 
 // Finalize the command buffer. At this point the OpenCL
 // implementation may reserve enough storage for all the tensor
@@ -1412,7 +1573,7 @@ <h3 id="_open_questions">Open Questions</h3>
 <div class="paragraph">
 <p><strong>RESOLVED</strong>: OpenCL C support for tensors can be introduced later in a
             separate extension. Built-in kernels may benefit from this
-            extension.</p>
+            extension as it is.</p>
 </div>
 </div>
 </div>
@@ -1431,7 +1592,7 @@ <h3 id="_open_questions">Open Questions</h3>
 </div>
 <div id="footer">
 <div id="footer-text">
-Last updated 2023-11-15 11:19:22 +0200
+Last updated 2023-11-16 17:25:21 +0200
 </div>
 </div>
 </body>

From e7e99feaeb14e6c3ebaa2e5649a08ae6dafad110 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Fri, 17 Nov 2023 12:21:20 +0200
Subject: [PATCH 18/26] * translate -> import/export

* Fix typos.
---
 ext/cl_exp_tensor.asciidoc | 38 ++++++++++++++++++------------------
 ext/cl_exp_tensor.html     | 40 +++++++++++++++++++-------------------
 2 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/ext/cl_exp_tensor.asciidoc b/ext/cl_exp_tensor.asciidoc
index ddf9cf03..5f8ac60b 100644
--- a/ext/cl_exp_tensor.asciidoc
+++ b/ext/cl_exp_tensor.asciidoc
@@ -290,7 +290,7 @@ object.
 
 [source,c]
 ----
-cl_int clEnqueueTranslateFromTensor(
+cl_int clEnqueueImportFromTensor(
   cl_command_queue command_queue,
   cl_tensor tensor,
   cl_bool blocking_command,
@@ -307,7 +307,7 @@ cl_int clEnqueueTranslateFromTensor(
 
 [source,c]
 ----
-cl_int clEnqueueTranslateToTensor(
+cl_int clEnqueueExportToTensor(
   cl_command_queue command_queue,
   cl_tensor tensor,
   cl_bool blocking_command,
@@ -376,10 +376,10 @@ cl_int clEnqueueTranslateToTensor(
   complete. If _event_wait_list_ and _event_ are not NULL, _event_
   must not refer to an element of the _event_wait_list_ array.
 
-The *clEnqueueTranslateToTensor* function copies contents of the buffer
+The *clEnqueueExportToTensor* function copies contents of the buffer
 object / host allocation to tensor's storage in
 implementation-defined, opaque memory layout. The
-*clEnqueueTranslateFromTensor* function copies data from tensor's
+*clEnqueueImportFromTensor* function copies data from tensor's
 storage to buffer object / host allocation.
 
 The elements of buffer object / host allocation are mapped to tensor
@@ -421,7 +421,7 @@ an abstract function that accesses a tensor element in its storage at
 given coordinate. The method how the coordinates translate to tensor
 storage addresses is unspecified.
 
-*clEnqueueTranslateFsomTensor* and *clEnqueueTranslateToTensor*
+*clEnqueueImportFromTensor* and *clEnqueueExportToTensor*
 returns CL_SUCCESS if the function is executed
 successfully. Otherwise, it returns one of the following errors:
 
@@ -469,13 +469,13 @@ successfully. Otherwise, it returns one of the following errors:
 
 // TODO: add clEnqueueFillTensor?
 
-If *cl_khr_command_buffer* is is supported, then the following command
-buffer counterparts of the *clEnqueueTranslateFromTensor* and
-*clEnqueueTranslateToTensor* commands are available.
+If *cl_khr_command_buffer* is supported, then the following command
+buffer counterparts of the *clEnqueueImportFromTensor* and
+*clEnqueueExportToTensor* commands are available.
 
 [source,c]
 ----
-cl_int clCommandTranslateFromTensorKHR(
+cl_int clCommandImportFromTensorKHR(
   cl_command_buffer_khr command_buffer,
   cl_command_queue command_queue,
   cl_tensor tensor,
@@ -493,7 +493,7 @@ cl_int clCommandTranslateFromTensorKHR(
 
 [source,c]
 ----
-cl_int clCommandTranslateToTensorKHR(
+cl_int clCommandExportToTensorKHR(
   cl_command_buffer_khr command_buffer,
   cl_command_queue command_queue,
   cl_tensor tensor,
@@ -513,13 +513,13 @@ cl_int clCommandTranslateToTensorKHR(
 
 * For _command_queue_, _tensor_, _tensor_origin_, _mem_origin_,
   _region_, _mem_pitch_, _buffer_ and _host_ptr_ parameters refer to
-  *clEnqueueTranslateFromTensor*.
+  *clEnqueueImportFromTensor*.
 
 * For _num_sync_points_in_wait_list_, _sync_point_wait_list_,
   _sync_point_, _mutable_handle_ parameters refer to
   *clCommandCopyBufferKHR*.
 
-*clCommandTranslateFromTensorKHR* and *clCommandTranslateFromTensorKHR*
+*clCommandImportFromTensorKHR* and *clCommandImportFromTensorKHR*
 returns CL_SUCCESS if the function is executed
 successfully. Otherwise, it returns one of the following errors:
 
@@ -701,17 +701,17 @@ std::vector<float> out_data(b * m * n);
 
 // Copies data into in0 tensor while possibly rearranging the data to the
 // optimal data layout.
-clEnqueueTranslateToTensor(
+clEnqueueExportToTensor(
   cmd_q, in0, false, {0, 0, 0}, {0, 0, 0}, {b, m, k},
   nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr);
-clEnqueueTranslateToTensor(
+clEnqueueExportToTensor(
   cmd_q, in1, false, {0, 0, 0}, {0, 0, 0}, {b, k, n},
   nullptr, nullptr, in1_data.data(), 0, nullptr, nullptr);
 clEnqueueNDRangeKernel(
   cmd_q, matmul_kernel, 3, matmul_grid, nullptr, nullptr, 0, nullptr, nullptr);
 clEnqueueNDRangeKernel(
   cmd_q, add_kernel, 3, add_grid, nullptr, nullptr, 0, nullptr, nullptr);
-clEnqueueTranslateFromTensor(
+clEnqueueImportFromTensor(
   cmd_q, out, false,  {0, 0, 0}, {0, 0, 0}, {b, m, n},
   nullptr, nullptr, out_data.data(), 0, nullptr, nullptr);
 ----
@@ -763,10 +763,10 @@ cl_command_buffer_khr cb =
   clCreateCommandBufferKHR(num_queues, queue_list, nullptr, &err);
 
 cl_sync_point_khr in0_syncp, in1_syncp, matmul_syncp, add_syncp;
-clCommandTranslateToTensorKHR(
+clCommandExportToTensorKHR(
   cmd_b, cmd_q, in0, {0, 0, 0}, {0, 0, 0}, {b, m, k},
   nullptr, nullptr, in0_data.data(), 0, nullptr, &in0_syncp);
-clCommandTranslateToTensorKHR(
+clCommandExportToTensorKHR(
   cmd_b, cmd_q, in1, {0, 0, 0}, {0, 0, 0}, {b, k, m},
   nullptr, nullptr, in1_data.data(), 0, nullptr, &in1_syncp);
 clCommandNDRangeKernelKHR(
@@ -775,7 +775,7 @@ clCommandNDRangeKernelKHR(
 clCommandNDRangeKernelKHR(
   cmd_b, cmd_q, nullptr, add_kernel, 3, add_grid, nullptr, nullptr,
   1, {matmul_syncp}, &add_syncp, nullptr);
-clCommandTranslateFromTensorKHR(
+clCommandImportFromTensorKHR(
   cmd_b, cmd_q, out, {0, 0, 0}, {0, 0, 0}, {b, k, m},
   nullptr, nullptr, out_data.data(), 1, {add_syncp}, nullptr);
 
@@ -789,7 +789,7 @@ clFinalizeCommandBufferKHR(cmd_b);
 // Temporary tensors used in a command buffer can't be read or written
 // into. A hypothetical reason is that the finalized command buffer
 // might not use some of the tensor.
-assert(clEnqueueTranslateFromTensor(..., t0, ...) == CL_INVALID_OPERATION);
+assert(clEnqueueImportFromTensor(..., t0, ...) == CL_INVALID_OPERATION);
 ----
 
 === Open Questions ===
diff --git a/ext/cl_exp_tensor.html b/ext/cl_exp_tensor.html
index 7303a372..ad5d348e 100644
--- a/ext/cl_exp_tensor.html
+++ b/ext/cl_exp_tensor.html
@@ -939,7 +939,7 @@ <h4 id="_new_opencl_functions_added_to_tensor_objects_section">New OpenCL Functi
 </div>
 <div class="listingblock">
 <div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_int clEnqueueTranslateFromTensor(
+<pre class="highlight"><code class="language-c" data-lang="c">cl_int clEnqueueImportFromTensor(
   cl_command_queue command_queue,
   cl_tensor tensor,
   cl_bool blocking_command,
@@ -956,7 +956,7 @@ <h4 id="_new_opencl_functions_added_to_tensor_objects_section">New OpenCL Functi
 </div>
 <div class="listingblock">
 <div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_int clEnqueueTranslateToTensor(
+<pre class="highlight"><code class="language-c" data-lang="c">cl_int clEnqueueExportToTensor(
   cl_command_queue command_queue,
   cl_tensor tensor,
   cl_bool blocking_command,
@@ -1040,10 +1040,10 @@ <h4 id="_new_opencl_functions_added_to_tensor_objects_section">New OpenCL Functi
 </ul>
 </div>
 <div class="paragraph">
-<p>The <strong>clEnqueueTranslateToTensor</strong> function copies contents of the buffer
+<p>The <strong>clEnqueueExportToTensor</strong> function copies contents of the buffer
 object / host allocation to tensor&#8217;s storage in
 implementation-defined, opaque memory layout. The
-<strong>clEnqueueTranslateFromTensor</strong> function copies data from tensor&#8217;s
+<strong>clEnqueueImportFromTensor</strong> function copies data from tensor&#8217;s
 storage to buffer object / host allocation.</p>
 </div>
 <div class="paragraph">
@@ -1089,7 +1089,7 @@ <h4 id="_new_opencl_functions_added_to_tensor_objects_section">New OpenCL Functi
 storage addresses is unspecified.</p>
 </div>
 <div class="paragraph">
-<p><strong>clEnqueueTranslateFsomTensor</strong> and <strong>clEnqueueTranslateToTensor</strong>
+<p><strong>clEnqueueImportFromTensor</strong> and <strong>clEnqueueExportToTensor</strong>
 returns CL_SUCCESS if the function is executed
 successfully. Otherwise, it returns one of the following errors:</p>
 </div>
@@ -1151,13 +1151,13 @@ <h4 id="_new_opencl_functions_added_to_tensor_objects_section">New OpenCL Functi
 </ul>
 </div>
 <div class="paragraph">
-<p>If <strong>cl_khr_command_buffer</strong> is is supported, then the following command
-buffer counterparts of the <strong>clEnqueueTranslateFromTensor</strong> and
-<strong>clEnqueueTranslateToTensor</strong> commands are available.</p>
+<p>If <strong>cl_khr_command_buffer</strong> is supported, then the following command
+buffer counterparts of the <strong>clEnqueueImportFromTensor</strong> and
+<strong>clEnqueueExportToTensor</strong> commands are available.</p>
 </div>
 <div class="listingblock">
 <div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_int clCommandTranslateFromTensorKHR(
+<pre class="highlight"><code class="language-c" data-lang="c">cl_int clCommandImportFromTensorKHR(
   cl_command_buffer_khr command_buffer,
   cl_command_queue command_queue,
   cl_tensor tensor,
@@ -1175,7 +1175,7 @@ <h4 id="_new_opencl_functions_added_to_tensor_objects_section">New OpenCL Functi
 </div>
 <div class="listingblock">
 <div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_int clCommandTranslateToTensorKHR(
+<pre class="highlight"><code class="language-c" data-lang="c">cl_int clCommandExportToTensorKHR(
   cl_command_buffer_khr command_buffer,
   cl_command_queue command_queue,
   cl_tensor tensor,
@@ -1199,7 +1199,7 @@ <h4 id="_new_opencl_functions_added_to_tensor_objects_section">New OpenCL Functi
 <li>
 <p>For <em>command_queue</em>, <em>tensor</em>, <em>tensor_origin</em>, <em>mem_origin</em>,
 <em>region</em>, <em>mem_pitch</em>, <em>buffer</em> and <em>host_ptr</em> parameters refer to
-<strong>clEnqueueTranslateFromTensor</strong>.</p>
+<strong>clEnqueueImportFromTensor</strong>.</p>
 </li>
 <li>
 <p>For <em>num_sync_points_in_wait_list</em>, <em>sync_point_wait_list</em>,
@@ -1209,7 +1209,7 @@ <h4 id="_new_opencl_functions_added_to_tensor_objects_section">New OpenCL Functi
 </ul>
 </div>
 <div class="paragraph">
-<p><strong>clCommandTranslateFromTensorKHR</strong> and <strong>clCommandTranslateFromTensorKHR</strong>
+<p><strong>clCommandImportFromTensorKHR</strong> and <strong>clCommandImportFromTensorKHR</strong>
 returns CL_SUCCESS if the function is executed
 successfully. Otherwise, it returns one of the following errors:</p>
 </div>
@@ -1458,17 +1458,17 @@ <h3 id="_sample_codes">Sample Codes</h3>
 
 // Copies data into in0 tensor while possibly rearranging the data to the
 // optimal data layout.
-clEnqueueTranslateToTensor(
+clEnqueueExportToTensor(
   cmd_q, in0, false, {0, 0, 0}, {0, 0, 0}, {b, m, k},
   nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr);
-clEnqueueTranslateToTensor(
+clEnqueueExportToTensor(
   cmd_q, in1, false, {0, 0, 0}, {0, 0, 0}, {b, k, n},
   nullptr, nullptr, in1_data.data(), 0, nullptr, nullptr);
 clEnqueueNDRangeKernel(
   cmd_q, matmul_kernel, 3, matmul_grid, nullptr, nullptr, 0, nullptr, nullptr);
 clEnqueueNDRangeKernel(
   cmd_q, add_kernel, 3, add_grid, nullptr, nullptr, 0, nullptr, nullptr);
-clEnqueueTranslateFromTensor(
+clEnqueueImportFromTensor(
   cmd_q, out, false,  {0, 0, 0}, {0, 0, 0}, {b, m, n},
   nullptr, nullptr, out_data.data(), 0, nullptr, nullptr);</code></pre>
 </div>
@@ -1521,10 +1521,10 @@ <h3 id="_sample_codes">Sample Codes</h3>
   clCreateCommandBufferKHR(num_queues, queue_list, nullptr, &amp;err);
 
 cl_sync_point_khr in0_syncp, in1_syncp, matmul_syncp, add_syncp;
-clCommandTranslateToTensorKHR(
+clCommandExportToTensorKHR(
   cmd_b, cmd_q, in0, {0, 0, 0}, {0, 0, 0}, {b, m, k},
   nullptr, nullptr, in0_data.data(), 0, nullptr, &amp;in0_syncp);
-clCommandTranslateToTensorKHR(
+clCommandExportToTensorKHR(
   cmd_b, cmd_q, in1, {0, 0, 0}, {0, 0, 0}, {b, k, m},
   nullptr, nullptr, in1_data.data(), 0, nullptr, &amp;in1_syncp);
 clCommandNDRangeKernelKHR(
@@ -1533,7 +1533,7 @@ <h3 id="_sample_codes">Sample Codes</h3>
 clCommandNDRangeKernelKHR(
   cmd_b, cmd_q, nullptr, add_kernel, 3, add_grid, nullptr, nullptr,
   1, {matmul_syncp}, &amp;add_syncp, nullptr);
-clCommandTranslateFromTensorKHR(
+clCommandImportFromTensorKHR(
   cmd_b, cmd_q, out, {0, 0, 0}, {0, 0, 0}, {b, k, m},
   nullptr, nullptr, out_data.data(), 1, {add_syncp}, nullptr);
 
@@ -1547,7 +1547,7 @@ <h3 id="_sample_codes">Sample Codes</h3>
 // Temporary tensors used in a command buffer can't be read or written
 // into. A hypothetical reason is that the finalized command buffer
 // might not use some of the tensor.
-assert(clEnqueueTranslateFromTensor(..., t0, ...) == CL_INVALID_OPERATION);</code></pre>
+assert(clEnqueueImportFromTensor(..., t0, ...) == CL_INVALID_OPERATION);</code></pre>
 </div>
 </div>
 </div>
@@ -1592,7 +1592,7 @@ <h3 id="_open_questions">Open Questions</h3>
 </div>
 <div id="footer">
 <div id="footer-text">
-Last updated 2023-11-16 17:25:21 +0200
+Last updated 2023-11-17 12:20:18 +0200
 </div>
 </div>
 </body>

From c1c0221240314f4e3dbd3ee73d0667e2bb4c6dbc Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 23 Nov 2023 10:21:36 +0200
Subject: [PATCH 19/26] Update date

---
 ext/cl_exp_tensor.asciidoc | 2 +-
 ext/cl_exp_tensor.html     | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/ext/cl_exp_tensor.asciidoc b/ext/cl_exp_tensor.asciidoc
index 5f8ac60b..66f3335a 100644
--- a/ext/cl_exp_tensor.asciidoc
+++ b/ext/cl_exp_tensor.asciidoc
@@ -26,7 +26,7 @@ which can utilize tensors as input, output and temporary storage.
 [cols="1,1,3",options="header",]
 |====
 | *Date*     | *Version* | *Description*
-| 2023-11-XX | 0.1.0     | First assigned version.
+| 2023-11-23 | 0.1.0     | First assigned version.
 |====
 
 ==== Dependencies
diff --git a/ext/cl_exp_tensor.html b/ext/cl_exp_tensor.html
index ad5d348e..84cba869 100644
--- a/ext/cl_exp_tensor.html
+++ b/ext/cl_exp_tensor.html
@@ -478,7 +478,7 @@ <h4 id="_version_history">Version history</h4>
 </thead>
 <tbody>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">2023-11-XX</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">2023-11-23</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">0.1.0</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">First assigned version.</p></td>
 </tr>
@@ -1592,7 +1592,7 @@ <h3 id="_open_questions">Open Questions</h3>
 </div>
 <div id="footer">
 <div id="footer-text">
-Last updated 2023-11-17 12:20:18 +0200
+Last updated 2023-11-23 10:21:09 +0200
 </div>
 </div>
 </body>

From a9c5402924fea4213bec0835ff2f6e2b85ba2d8b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Fri, 9 Aug 2024 10:19:44 +0300
Subject: [PATCH 20/26] Update cl_exp_tensor, version 0.2.0

---
 ext/cl_exp_tensor.asciidoc             |  811 ---------
 extensions/cl_exp_tensor.asciidoc      | 1058 ++++++++++++
 {ext => extensions}/cl_exp_tensor.html | 2167 ++++++++++++++----------
 3 files changed, 2348 insertions(+), 1688 deletions(-)
 delete mode 100644 ext/cl_exp_tensor.asciidoc
 create mode 100644 extensions/cl_exp_tensor.asciidoc
 rename {ext => extensions}/cl_exp_tensor.html (53%)

diff --git a/ext/cl_exp_tensor.asciidoc b/ext/cl_exp_tensor.asciidoc
deleted file mode 100644
index 66f3335a..00000000
--- a/ext/cl_exp_tensor.asciidoc
+++ /dev/null
@@ -1,811 +0,0 @@
-// Copyright 2023 The Khronos Group. This work is licensed under a
-// Creative Commons Attribution 4.0 International License; see
-// http://creativecommons.org/licenses/by/4.0/
-= cl_exp_tensor
-
-:source-highlighter: coreray
-
-[[cl_exp_tensor]]
-== Tensor Data Type
-
-This extension provides a new opaque OpenCL datatype called
-`cl_tensor`. It is used for storing N-dimensional tensor data in
-implementation-defined memory layout which may be optimized based on
-tensor's use cases. The datatype is designed to be efficiently used
-within the `cl_khr_command_buffers` extension to capture task graphs
-which can utilize tensors as input, output and temporary storage.
-
-=== General information
-
-==== Name Strings
-
-`cl_exp_tensor`
-
-==== Version history
-
-[cols="1,1,3",options="header",]
-|====
-| *Date*     | *Version* | *Description*
-| 2023-11-23 | 0.1.0     | First assigned version.
-|====
-
-==== Dependencies
-
-This extension is written against the OpenCL Specification version 3.0.14.
-
-This extension requires OpenCL 1.2 or later.
-
-==== Contributors
-
-Henry Linjamäki, Intel. +
-Pekka Jääslkeläinen, Intel and Tampere University. +
-Ben Ashbaugh, Intel. +
-
-=== Overview
-
-The new tensor object enables applications to describe N-dimensional
-arrays whose memory layout is opaque to applications. The goals
-of this extension are the following:
-
-* Enable implementations to have freedom of placement data of the tensors for
-  improving performance of the kernels which use them. This extension
-  is designed such it allows implementations to determine optimal
-  memory layouts for the tensors based on their use cases for
-  increased performance, by means of, for example, analyzing kernels’ access
-  patterns or, in case of built-in kernels, by inspecting the tensor
-  arguments they operate on.
-
-* Reduce details and boilerplate needed for performance portable implementation of
-  applications by being less dependent on platform or device specifics
-  on the memory layout / data arrangements which matters for
-  performance. Such specifics may include:
-
-** alignment of data (e.g. for avoiding misaligned memory accesses)
-
-** arrangement of data required by kernels (column-major vs row-major
-   for matrix multiplication, NHWC vs NCHW for neural network
-   convolution)
-
-** arrangement of the data into tiles (or “packing”) for improving
-   cache and TLB hits
-
-** arrangement of data into specific tiles in order to exploit complex
-   HW operations such as matrix multiplications (Intel AMX, AMD matrix
-   cores).
-
-** arrangement of data into rows separated by a stride in order to
-   avoid bank conflicts in GPUs.
-
-The tensor data type is designed to be efficiently used together with command buffers (cl_khr_command_buffers)
-and built-in kernels, including kernels to be provided by the Defined
-Built-in Kernels (cl_khr_defined_builtin_kernels) extension that is being prepared together with this extension.
-
-=== Modifications to OpenCL
-
-==== New Section: 5.x Tensor Objects
-
-A tensor object stores an N-dimensional array of elements. The memory
-layout of the tensor is opaque to the application. When a tensor
-object is created it is initially not associated to any storage for the tensor elements.
- A storage is bound to a tensor
-by creating a memory buffer with CL_MEM_BIND_TO_BUFFER. Tensor objects
-without storage can be set as kernel arguments for kernels which
-accepts them. Kernels which have tensor arguments must have storage
-assigned to them prior enqueuing the kernels for execution.
-
-==== New OpenCL Functions added to Tensor Objects section
-
-To create a tensor use:
-
-[source,c]
-----
-cl_tensor clCreateTensor(
-    cl_context context,
-    const cl_tensor_peoperties *properties,
-    size_t rank,
-    const size_t* shape,
-    cl_tensor_datatype dtype,
-    cl_int *errcode_ret);
-----
-
-* _context_ is a valid OpenCL context used to create the tensor object.
-
-* _properties_ is an optional list of properties for the tensor object
-  and their corresponding values. The list is terminated with the
-  special property 0. If no properties are required, properties may be
-  NULL. This extension does not define any optional properties for
-  tensors.
-
-* _rank_ is the number of dimensions. Zero value creates a "scalar"
-  tensor which has no dimensions but has storage for one element.
-
-* _shape_ is a list of sizes of the dimensions. The length of the list
-  must be _rank_ elements. _shape_ can be NULL if _rank_ value is
-  zero. All the first _rank_ values in the list must be non-zero.
-
-* _dtype_ is the element type of _tensor_. Refer to the
-  <<TensorDtypes>> table for the types.
-
-* _errcode_ret_ may return an appropriate error code. If errcode_ret
-  is NULL, no error code is returned.
-
-clCreateTensor function creates a `rank`-dimensional tensor with
-`shape[0] * shape[1] * ... * shape[rank-1]` elements of _dtype_
-type. At the creation time of the tensor, it does not have
-storage. The storage is assigned to the tensor by calling
-clCreateBufferWithProperties() with CL_MEM_BIND_TO_TENSOR.
-
-A command that refers to a tensor must be bound to a valid buffer
-object before enqueuing or recording the command.
-
-*clCreateTensor* returns a valid non-zero tensor object and errcode_ret
-is set to CL_SUCCESS if the tensor object is created
-successfully. Otherwise, they return a NULL value with one of the
-following error values returned in errcode_ret:
-
-* CL_INVALID_CONTEXT if context is not a valid context.
-
-* CL_INVALID_PROPERTY if a property name in properties is not a
-  supported property name, if the value specified for a supported
-  property name is not valid, or if the same property name is
-  specified more than once.
-
-* CL_INVALID_VALUE if a value specified in dtype is invalid.
-
-* CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
-  required by the OpenCL implementation on the host.
-
-.Tensor element types. The API type indicates the corresponding type for copying elements from an host allocation / buffer object to tensor or vice versa.
-[cols="1,1,1",stripes=even]
-[#TensorDtypes]
-|===
-| *Tensor element data type* | *Description* | *API type*
-
-| CL_TENSOR_BOOL | 1-bit signedless integer.  |
-cl_uchar. footnote:[only LSB bit is considered when writing data to
-tensor. When reading data from tensor the boolean value will be
-written as 0 or 1. The boolean values in the tensor may be packed densenly]
-| CL_TENSOR_INT8       | 8-bit signed integer.            | cl_char.
-| CL_TENSOR_INT16      | 16-bit signed integer.           | cl_short.
-| CL_TENSOR_INT32      | 32-bit signed integer.           | cl_int.
-| CL_TENSOR_INT64      | 64-bit signed integer.           | cl_long.
-| CL_TENSOR_UINT8      | 8-bit unsigned integer.          | cl_uchar.
-| CL_TENSOR_UINT16     | 16-bit unsigned integer.         | cl_ushort.
-| CL_TENSOR_UINT32     | 32-bit unsigned integer.         | cl_uint.
-| CL_TENSOR_UINT64     | 64-bit unsigned integer.         | cl_ulong.
-| CL_TENSOR_HALF       | Half precision floating-point.   | cl_half.
-| CL_TENSOR_BFLOAT16   | 16-bit brain floating-point.     | cl_ushort
-| CL_TENSOR_FLOAT      | Single precision floating-point. | cl_float.
-| CL_TENSOR_DOUBLE     | Double precision floating-point. | cl_double.
-| CL_TENSOR_COMPLEX64  | 64-bit complex floating-point with
-  32-bit real and imaginary part. | cl_float2
-| CL_TENSOR_COMPLEX128 | 128-bit complex floating-point with
-  64-bit real and imaginary part. | cl_double2
-|===
-
-To retain a tensor object, call the function
-
-[source,c]
-----
-cl_int clRetainTensorObject(cl_tensor tensor);
-----
-
-* _tensor_ is the tensor object to be retained.
-
-The _tensor_ reference count is incremented.
-
-*clRetainTensor* returns CL_SUCCESS if the function is executed
-successfully. Otherwise, it returns one of the following errors:
-
-* CL_INVALID_TENSOR if the tensor is not a valid tensor object.
-
-To release a tensor object, call the function
-
-[source,c]
-----
-cl_int clReleaseTensorObject(cl_tensor tensor);
-----
-
-* _tensor_ is the tensor object to be released.
-
-The _tensor_ reference count is decremented.
-
-The tensor object is deleted once the number of instances that are
-retained to tensor become zero and the tensor object is no longer
-needed by any enqueued or recorded commands that use _tensor_. Using
-this function to release a reference that was not obtained by creating
-the object or by calling *clRetainTensor* causes undefined behavior.
-
-*clReleaseTensor* returns CL_SUCCESS if the function is executed
-successfully. Otherwise, it returns one of the following errors:
-
-* CL_INVALID_TENSOR if tensor is not a valid tensor object.
-
-// TODO: add clSetTensorObjectDestructorCallback?
-
-To return information about a tensor object, call the function
-
-[source,c]
-----
-cl_int clGetTensorInfo(
-  cl_tensor tensor,
-  cl_tensor_info param_name,
-  size_t param_value_size,
-  void* param_value,
-  size_t* param_value_size_ret);
-----
-
-* _tensor_ specifies the tensor object being queried.
-
-* _param_name_ specifies the information to query. The list of
-  supported param_name types and the information returned in
-  _param_value_ by clGetTensorInfo is described in the <<Tensor Object
-  Queries>> table.
-
-* _param_value_ is a pointer to memory where the appropriate result
-  being queried is returned. If _param_value_ is NULL, it is ignored.
-
-* _param_value_size_ is used to specify the size in bytes of memory
-  pointed to by _param_value_. This size must be ≥ size of return type
-  as described in the <<Tensor Object Queries>> table.
-
-* _param_value_size_ret_ returns the actual size in bytes of data
-  being queried by _param_name_. If _param_value_size_ret_ is NULL, it is
-  ignored.
-
-*clGetTensorInfo* returns CL_SUCCESS if the function is executed
- succesfully. Otherwise, it returns one of the following errors:
-
-* CL_INVALID_TENSOR if _tensor_ is not a valid tensor object.
-
-[#Tensor Object Quaries]
-.List of supported param_names by clGetTensorInfo
-[cols="2,1,2",stripes=odd]
-|===
-| CL_TENSOR_RANK  | size_t             | Return the tensor rank.
-| CL_TENSOR_SHAPE | size_t[]           | Return the tensor shape.
-| CL_TENSOR_DTYPE | cl_tensor_datatype | Return the tensor data type.
-
-| CL_TENSOR_BOUND_TO_BUFFER | cl_bool | Return true if the tensor is
-bound to a buffer.
-
-| CL_TENSOR_BUFFER | cl_mem a| If CL_TENSOR_BOUND_TO_BUFFER is true,
-return the buffer object the tensor is bound to. Otherwise,
-clGetTensorInfo call returns:
-
-* CL_INVALID_MEM_OBJECT if the tensor is not bound to a buffer object.
-
-* CL_INVALID_PROPERTY otherwise.
-
-| CL_TENSOR_CONTEXT | cl_context | Return the context specified when
-  the tensor object is created.
-
-| CL_TENSOR_REFERENCE_COUNT | cl_uint | Return the tensor reference
-count.
-|===
-
-The following functions are for reading from a tensor to host memory /
-buffer object or to write to a tensor object from host memory / buffer
-object.
-
-[source,c]
-----
-cl_int clEnqueueImportFromTensor(
-  cl_command_queue command_queue,
-  cl_tensor tensor,
-  cl_bool blocking_command,
-  const size_t* tensor_origin,
-  const size_t* mem_origin,
-  const size_t* region,
-  const size_t* mem_pitch,
-  cl_mem buffer,
-  void* host_ptr,
-  cl_uint num_events_in_wait_list,
-  const cl_event* event_wait_list,
-  cl_event* event);
-----
-
-[source,c]
-----
-cl_int clEnqueueExportToTensor(
-  cl_command_queue command_queue,
-  cl_tensor tensor,
-  cl_bool blocking_command,
-  const size_t* tensor_origin,
-  const size_t* mem_origin,
-  const size_t* region,
-  const size_t* mem_pitch,
-  cl_mem buffer,
-  const void* host_ptr,
-  cl_uint num_events_in_wait_list,
-  const cl_event* event_wait_list,
-  cl_event* event);
-----
-
-* _command_queue_ is a valid host command-queue in which the read /
-  write command will be queued. _command_queue_ and _tensor_ must be
-  created with the same OpenCL context.
-
-* _tensor_ refers to a valid tensor object which is bound to a buffer.
-
-* _blocking_command_ indicate if the read and write operations are
-  blocking or non-blocking (see below).
-
-* _tensor_origin_ defines the offset coordinates in _tensor_ for start of
-  the regions to read / write tensor data. The length of the array
-  must be at least rank the the _tensor_.
-
-* _mem_origin_ defines the offset coordinates in the memory region
-  pointed by _buffer_ or _host_ptr_ expressed in elements of _tensor_
-  data type. The length of the array must be at least rank the the
-  _tensor_.
-
-* _region_ defines the region being read or written expressed in in
-  elements of _tensor_ data type. The length of the array must be at
-  least rank the the _tensor_. If _region_ is NULL then _tensor_'s
-  shape will be used as the region.
-
-* _mem_pitch_ defines the length of each dimension in elements to be
-  used for the memory region of _buffer_ or _host_ptr_. The length of
-  the array must be at least the rank of _tensor_ minus one. if
-  _mem_pitch_ is NULL or _mem_pitch_[i] is zero, _mem_pitch_[i] is
-  computed as _region_[i + 1].
-
-* _buffer_ and _host_ptr_ refer to a valid buffer object / host
-  allocation where data is to be read into or to be written from.
-  Either the _buffer_ or _host_ptr_ can be non-NULL in which case the
-  non-NULL argument is used as the operand for the operation.
-
-* _event_wait_list_ and _num_events_in_wait_list_ specify events that
-  need to complete before this particular command can be executed. If
-  _event_wait_list_ is NULL, then this particular command does not
-  wait on any event to complete. If _event_wait_list_ is NULL,
-  _num_events_in_wait_list_ must be 0. If _event_wait_list_ is not
-  NULL, the list of events pointed to by _event_wait_list_ must be
-  valid and _num_events_in_wait_list_ must be greater than 0. The
-  events specified in _event_wait_list_ act as synchronization
-  points. The context associated with events in _event_wait_list_ and
-  _command_queue_ must be the same. The memory associated with
-  _event_wait_list_ can be reused or freed after the function returns.
-
-* _event_ returns an event object that identifies this read / write
-  command and can be used to query or queue a wait for this command to
-  complete. If _event_ is NULL or the enqueue is unsuccessful, no
-  event will be created and therefore it will not be possible to query
-  the status of this command or to wait for this command to
-  complete. If _event_wait_list_ and _event_ are not NULL, _event_
-  must not refer to an element of the _event_wait_list_ array.
-
-The *clEnqueueExportToTensor* function copies contents of the buffer
-object / host allocation to tensor's storage in
-implementation-defined, opaque memory layout. The
-*clEnqueueImportFromTensor* function copies data from tensor's
-storage to buffer object / host allocation.
-
-The elements of buffer object / host allocation are mapped to tensor
-coordinates and vice versa as follows in pseudo C code:
-
-[source,c]
-----
-tensor_element(
-  tensor_origin[0] + i[0],
-  tensor_origin[1] + i[1],
-  ...,
-  tensor_origin[N-2] + i[N-2],
-  tensor_origin[N-2] + i[N-1]) ==
-((TENSOR_DATATYPE *)buffer_or_host_ptr)[
-  (mem_origin[0] + i[0]) * pitch(0) +
-  (mem_origin[1] + i[1]) * pitch(1) +
-  ... +
-  (mem_origin[N-2] + i[N-2]) * pitch(N-2) +
-  (mem_origin[N-1] + i[N-1])];
-----
-
-Where the `N` is tensor rank, the `i[X]` is a tensor coordinate with
-inclusive range of `0..<region[X]-1>` and the `pitch` is computed as
-follows in pseudo C code:
-
-[source,c]
-----
-size_t pitch(size_t dim) {
-  size_t pitch = 1;
-  for (size_t i = dim; i < tensor_rank - 1; i++)
-    pitch *=
-      (mem_pitch != NULL || mem_pitch[i] == 0) ? mem_pitch[i] : region[i + 1];
-  return pitch;
-}
-----
-
-For `dim` in `0..(tensor_rank()-1)`. The `tensor_element()` represents
-an abstract function that accesses a tensor element in its storage at
-given coordinate. The method how the coordinates translate to tensor
-storage addresses is unspecified.
-
-*clEnqueueImportFromTensor* and *clEnqueueExportToTensor*
-returns CL_SUCCESS if the function is executed
-successfully. Otherwise, it returns one of the following errors:
-
-* CL_INVALID_COMMAND_QUEUE if _command_queue_ is not a valid host
-  command-queue.
-
-* CL_INVALID_CONTEXT if the context associated with _command_queue_
-  and buffer are not the same or if the context associated with
-  _command_queue_ and events in _event_wait_list_ are not the same.
-
-* CL_INVALID_MEM_OBJECT if _buffer_ is not a valid buffer object.
-
-* CL_INVALID_VALUE if _tensor_origin_ or _mem_origin_ is NULL.
-
-* CL_INVALID_VALUE if the region being read or written specified by
-  (_mem_origin_, _region_, _mem_pitch_) is out of bounds.
-
-* CL_INVALID_VALUE if any _region_ array element is 0.
-
-* CL_INVALID_VALUE if _mem_pitch_ is not NULL and _mem_pitch_[i] is
-  not 0 and _mem_pitch_[i] is less than _region_[i].
-
-* CL_INVALID_VALUE if _buffer_ and _host_ptr_ both are NULL or non-NULL.
-
-* CL_INVALID_EVENT_WAIT_LIST if _event_wait_list_ is NULL and
-  _num_events_in_wait_list_ > 0, or _event_wait_list_ is not NULL and
-  _num_events_in_wait_list_ is 0, or if event objects in
-  _event_wait_list_ are not valid events.
-
-* CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST if the read and write
-  operations are blocking and the execution status of any of the
-  events in _event_wait_list_ is a negative integer value.
-
-* CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate
-  memory for data store associated with memory object the _tensor_ is
-  bound to.
-
-* CL_OUT_OF_RESOURCES if there is a failure to allocate resources
-  required by the OpenCL implementation on the device.
-
-* CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
-  required by the OpenCL implementation on the host.
-
-// TODO: add clEnqueueCopyTensor
-
-// TODO: add clEnqueueFillTensor?
-
-If *cl_khr_command_buffer* is supported, then the following command
-buffer counterparts of the *clEnqueueImportFromTensor* and
-*clEnqueueExportToTensor* commands are available.
-
-[source,c]
-----
-cl_int clCommandImportFromTensorKHR(
-  cl_command_buffer_khr command_buffer,
-  cl_command_queue command_queue,
-  cl_tensor tensor,
-  const size_t* tensor_origin,
-  const size_t* mem_origin,
-  const size_t* region,
-  const size_t* mem_pitch,
-  cl_mem buffer,
-  void* host_ptr,
-  cl_uint num_sync_points_in_wait_list,
-  const cl_sync_point_khr* sync_point_wait_list,
-  cl_sync_point_khr* sync_point,
-  cl_mutable_command_khr* mutable_handle);
-----
-
-[source,c]
-----
-cl_int clCommandExportToTensorKHR(
-  cl_command_buffer_khr command_buffer,
-  cl_command_queue command_queue,
-  cl_tensor tensor,
-  const size_t* tensor_origin,
-  const size_t* mem_origin,
-  const size_t* region,
-  const size_t* mem_pitch,
-  cl_mem buffer,
-  const void* host_ptr,
-  cl_uint num_sync_points_in_wait_list,
-  const cl_sync_point_khr* sync_point_wait_list,
-  cl_sync_point_khr* sync_point,
-  cl_mutable_command_khr* mutable_handle);
-----
-
-* _command_buffer_ refers to valid command-buffer object.
-
-* For _command_queue_, _tensor_, _tensor_origin_, _mem_origin_,
-  _region_, _mem_pitch_, _buffer_ and _host_ptr_ parameters refer to
-  *clEnqueueImportFromTensor*.
-
-* For _num_sync_points_in_wait_list_, _sync_point_wait_list_,
-  _sync_point_, _mutable_handle_ parameters refer to
-  *clCommandCopyBufferKHR*.
-
-*clCommandImportFromTensorKHR* and *clCommandImportFromTensorKHR*
-returns CL_SUCCESS if the function is executed
-successfully. Otherwise, it returns one of the following errors:
-
-* CL_INVALID_COMMAND_QUEUE if _command_queue_ is not NULL.
-
-* CL_INVALID_COMMAND_BUFFER_KHR if _command_buffer_ is not a valid
-  command-buffer.
-
-* CL_INVALID_CONTEXT if the context associated with _command_queue_
-  and _command_buffer_ is not the same.
-
-* CL_INVALID_OPERATION if _command_buffer_ has been finalized.
-
-* CL_INVALID_VALUE if _mutable_handle_ is not NULL.
-
-* CL_INVALID_SYNC_POINT_WAIT_LIST_KHR if _sync_point_wait_list_ is
-  NULL and _num_sync_points_in_wait_list_ is > 0, or
-  _sync_point_wait_list_ is not NULL and _num_sync_points_in_wait_list_ is
-  0, or if synchronization-point objects in _sync_point_wait_list_ are
-  not valid synchronization-points.
-
-* CL_OUT_OF_RESOURCES if there is a failure to allocate resources
-  required by the OpenCL implementation on the device.
-
-* CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
-  required by the OpenCL implementation on the host.
-
-==== Add New Buffer Property in Section 5.2.1
-
-[cols="2,1,2",stripes=odd]
-|===
-| CL_MEM_COMMAND_BUFFER_TEMPORARY | cl_bool
-a| This property can be set if *cl_khr_command_buffer* extension is
-supported.
-
-NOTE: This property temporarily lives here and will be moved to
-a separate extension proposal.
-
-If the value is true, create a "temporary" buffer object that only can
-be used on commands recorded in command buffers. Non-recording
-command enqueue functions must return CL_INVALID_OPERATION if the
-command refers to a temporary buffer object.
-
-The temporary buffer objects are managed by command buffers. When a
-temporary buffer object is used by multiple command buffer, the object
-receives disjoint storage for each command buffer.
-
-// Consequently, Data may not be exchanged between command buffers through
-// temporary buffers.
-
-Storage of the temporary buffer objects may be allocated on-demand
-basis. At the times the buffer is not needed, OpenCL implementations
-may reuse storage for other tasks within the command buffer.
-
-Contents of the temporary buffers are not guaranteed to be preserved
-across command buffer executions.
-
-| CL_MEM_BIND_TO_TENSOR | cl_tensor a| Use the created buffer as
-storage for the given valid tensor. To succeed creating the buffer,
-the target tensor may not have storage already and _size_
-argument of the clCreateBufferWithProperties() must be zero.
-
-Size of the memory buffer is implementation-defined and it can be
-queried with clGetTensorInfo().
-
-Memory layout of the tensor in the created memory buffer is
-implementation-defined and opaque to the applications and it may
-change at unspecified points.  Implementation may use non-contiguous
-allocations to store the tensor data and implementation may store
-auxiliary data within the allocations.  Therefore, reading from or
-writing to the memory buffer directly using the cl_mem handle leads to
-undefined behavior.
-
-If the tensor is already bound to a buffer object,
-clCreateBufferWithProperties call returns CL_TENSOR_BOUND_TO_BUFFER
-error code.
-|===
-
-==== Add New Memory Object Query in Section 5.5.5
-
-[cols="2,1,2",stripes=odd]
-|===
-| CL_MEM_COMMAND_BUFFER_TEMPORARY | cl_bool | This property can be
-queried if *cl_khr_command_buffer* extension is supported.
-
-Return true if the _memobj_ is temporary buffer object for command
-buffers.
-|===
-
-==== Add New Error Codes in Appendix F
-
-[cols="2,3", stripes=odd]
-|===
-| CL_TENSOR_BOUND_TO_BUFFER | Returned when attempting to bind a
-  buffer object to a tensor which already has been bound to the same
-  or another.
-| CL_INVALID_TENSOR | Returned then the specified tensor is not a
-  valid tensor object.
-|===
-
-=== Sample Codes
-
-Helper functions used in the follow up tensor code samples:
-
-[source,c]
-----
-cl_kernel create_matmul_kernel(
-  cl_context ctx, std::span<cl_device_id> device_span,
-  cl_tensor lhs, cl_tensor rhs, cl_tensor out) {
-  // A hypothetical matmul kernel signature in pseudo OpenCL C for
-  // illustrative purposes:
-  //
-  //   kernel void matmul(global read_only tensor_t, global read_only tensor_t,
-  //                      global write_only tensor_t);
-
-  cl_kernel matmul_kernel = /* Omitted. */;
-  clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor), &lhs);
-  clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor), &rhs);
-  clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor), &out);
-  return matmul_kernel;
-}
-
-cl_kernel create_add_kernel(
-  cl_context ctx, std::span<cl_device_id> device_span,
-  cl_tensor lhs, cl_tensor rhs, cl_tensor out) {
-  // A hypothetical add kernel signature in pseudo OpenCL C for illustrative
-  // purposes:
-  //
-  // kernel void add(global read_only tensor_t, global read_only tensor_t,
-  //                 global write_only tensor_t);
-
-  cl_tensor add_kernel = /* Omitted. */;
-  clSetKernelArg(add_kernel, 0, sizeof(cl_tensor), &lhs);
-  clSetKernelArg(add_kernel, 1, sizeof(cl_tensor), &rhs);
-  clSetKernelArg(add_kernel, 2, sizeof(cl_tensor), &out);
-  return add_kernel;
-}
-----
-An example usage of tensors on a command queue:
-
-[source,c]
-----
-constexpr size_t b = 64, m = 100, n = 200, k = 50;
-
-cl_int err;
-cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
-cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
-cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
-cl_tensor t0  = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
-cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
-
-cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
-cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
-
-// Allocate storage for the tensors. The buffer size must be set to
-// zero when the buffer is bound to a tensor. OpenCL implementation
-// may determine optimal data layout and the storage needed for it,
-// based on the tensor's uses (the 'matmul' and 'add' kernels in this
-// sample) so far.
-cl_mem in0_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, in0, 0}, CL_MEM_READ_ONLY,
-  0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */, nullptr, &err);
-cl_mem in1_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, in1, 0}, CL_MEM_READ_ONLY,
-  0, nullptr, &err);
-cl_mem in2_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, in2, 0}, CL_MEM_READ_ONLY,
-  0, nullptr, &err);
-cl_mem t0_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE,
-  0, nullptr, &err);
-cl_mem out_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, out, 0}, CL_MEM_WRITE_ONLY,
-  0, nullptr, &err);
-
-std::vector<float> in0_data = ...;
-std::vector<float> in1_data = ...;
-std::vector<float> out_data(b * m * n);
-
-// Copies data into in0 tensor while possibly rearranging the data to the
-// optimal data layout.
-clEnqueueExportToTensor(
-  cmd_q, in0, false, {0, 0, 0}, {0, 0, 0}, {b, m, k},
-  nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr);
-clEnqueueExportToTensor(
-  cmd_q, in1, false, {0, 0, 0}, {0, 0, 0}, {b, k, n},
-  nullptr, nullptr, in1_data.data(), 0, nullptr, nullptr);
-clEnqueueNDRangeKernel(
-  cmd_q, matmul_kernel, 3, matmul_grid, nullptr, nullptr, 0, nullptr, nullptr);
-clEnqueueNDRangeKernel(
-  cmd_q, add_kernel, 3, add_grid, nullptr, nullptr, 0, nullptr, nullptr);
-clEnqueueImportFromTensor(
-  cmd_q, out, false,  {0, 0, 0}, {0, 0, 0}, {b, m, n},
-  nullptr, nullptr, out_data.data(), 0, nullptr, nullptr);
-----
-
-An example use of tensors in a command buffer when cl_khr_command_buffer
-extension is supported:
-
-[source,c]
-----
-constexpr size_t b = 64, m = 100, n = 200, k = 50;
-
-cl_int err;
-cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
-cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
-cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
-cl_tensor t0  = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
-cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
-
-cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
-cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
-
-// Bind command buffer managed storage to tensors.
-//
-// NOTE: same temporary tensor handle used in multiple command buffers
-//       will have separate storage. IOW, command buffers may not exchange
-//       data via temporary buffers between them.
-cl_mem in0_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in0, 0},
-  CL_MEM_READ_ONLY, 0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */,
-  nullptr, &err);
-cl_mem in1_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in1, 0},
-  CL_MEM_READ_ONLY, 0, nullptr, &err);
-cl_mem in2_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in2, 0},
-  CL_MEM_READ_ONLY, 0, nullptr, &err);
-cl_mem t0_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, t0, 0},
-  CL_MEM_READ_WRITE, 0, nullptr, &err);
-cl_mem out_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, out, 0},
-  CL_MEM_WRITE_ONLY, 0, nullptr, &err);
-
-std::vector<float> in0_data = ...;
-std::vector<float> in1_data = ...;
-std::vector<float> out_data(b * m * n);
-
-cl_command_buffer_khr cb =
-  clCreateCommandBufferKHR(num_queues, queue_list, nullptr, &err);
-
-cl_sync_point_khr in0_syncp, in1_syncp, matmul_syncp, add_syncp;
-clCommandExportToTensorKHR(
-  cmd_b, cmd_q, in0, {0, 0, 0}, {0, 0, 0}, {b, m, k},
-  nullptr, nullptr, in0_data.data(), 0, nullptr, &in0_syncp);
-clCommandExportToTensorKHR(
-  cmd_b, cmd_q, in1, {0, 0, 0}, {0, 0, 0}, {b, k, m},
-  nullptr, nullptr, in1_data.data(), 0, nullptr, &in1_syncp);
-clCommandNDRangeKernelKHR(
-  cmd_b, cmd_q, nullptr, matmul_kernel, 3, matmul_grid, nullptr, nullptr,
-  2, {in0_syncp, in2_syncp}, &matmul_syncp, nullptr);
-clCommandNDRangeKernelKHR(
-  cmd_b, cmd_q, nullptr, add_kernel, 3, add_grid, nullptr, nullptr,
-  1, {matmul_syncp}, &add_syncp, nullptr);
-clCommandImportFromTensorKHR(
-  cmd_b, cmd_q, out, {0, 0, 0}, {0, 0, 0}, {b, k, m},
-  nullptr, nullptr, out_data.data(), 1, {add_syncp}, nullptr);
-
-// Finalize the command buffer. At this point the OpenCL
-// implementation may reserve enough storage for all the tensor
-// temporaries. Temporary tensors might be eliminated - for example,
-// OpenCL implementation could use 'out' tensor to store result of
-// matmul_kernel , thus, eliminating the need of 't0' tensor.
-clFinalizeCommandBufferKHR(cmd_b);
-
-// Temporary tensors used in a command buffer can't be read or written
-// into. A hypothetical reason is that the finalized command buffer
-// might not use some of the tensor.
-assert(clEnqueueImportFromTensor(..., t0, ...) == CL_INVALID_OPERATION);
-----
-
-=== Open Questions ===
-
-. Should we have support for tensors with undefined shape and tensors
-  with unknown / symbolic dimension sizes like in ONNX?
-+
---
-// https://onnx.ai/onnx/repo-docs/ShapeInference.html
-*UNRESOLVED*
---
-
-. Should we define OpenCL C language features for accessing tensors?
-+
---
-*RESOLVED*: OpenCL C support for tensors can be introduced later in a
-            separate extension. Built-in kernels may benefit from this
-            extension as it is.
---
diff --git a/extensions/cl_exp_tensor.asciidoc b/extensions/cl_exp_tensor.asciidoc
new file mode 100644
index 00000000..002ff366
--- /dev/null
+++ b/extensions/cl_exp_tensor.asciidoc
@@ -0,0 +1,1058 @@
+:data-uri:
+:icons: font
+//include::../config/attribs.txt[]
+//include::{generated}/api/api-dictionary.asciidoc[]
+:source-highlighter: coderay
+
+= cl_exp_tensor
+
+This extension provides new buffer abstraction - tensor objects - for
+managing N-dimensional data.
+
+== XXX - Not complete yet!!!
+
+== Name Strings
+
+`cl_exp_tensor`
+
+== Contact
+
+TODO
+
+== Contributors
+
+Henry Linjamäki, Intel. +
+Pekka Jääslkeläinen, Intel and Tampere University. +
+Ben Ashbaugh, Intel. +
+
+== Notice
+
+TODO
+
+== Status
+
+Draft spec, NOT APPROVED!!
+
+== Version
+
+Built On: {docdate} +
+Version: 0.2.0
+
+== Dependencies
+
+This extension is written against the OpenCL Specification version 3.0.14.
+
+This extension requires OpenCL 1.2 or later.
+
+== Overview
+
+The extension provides new tensor object abstraction. Tensor objects
+are similar to image types in regard they represents N-dimensional
+data of some application chosen data type and they may be mapped to
+dedicated hardware except that
+
+* higher than 3-dimensional data can be supported (limited by
+  devices' capabilities).
+
+* applications may choose how the data elements of the tensors are
+  laid out in the buffers using the tensor layout descriptions
+  provided in this extension.
+
+Applications may also choose the memory layouts of the tensors be
+implementation-specified, letting the driver to optimize the tensor
+data layout for better performance or to lay out the data as required by
+hardware functions (e.g. exposed via builtin kernels).
+
+The scope of this extension to provide host APIs for creating tensor
+objects and transfer data between tensors, host and other memory
+objects.
+
+A separate extension implemented on top of this extension,
+cl_exp_defined_builtin_kernels which provides "defined built-in
+kernels" (DKBs) which operates on tensors. It also provides mechanism
+for drivers to create DBKs that are optimized for the tensor arguments
+they operate on.
+
+== New API Functions
+
+[source,c]
+----
+cl_int clEnqueueImportFromTensorEXP(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  void* host_ptr,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);
+
+cl_int clEnqueueExportToTensorEXP(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  const void* host_ptr,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);
+
+cl_int clEnqueueCopyTensorEXP(
+  cl_command_queue command_queue,
+  cl_tensor src_tensor,
+  cl_tensor dst_tensor,
+  const cl_tensor_shape* src_origin,
+  const cl_tensor_shape* dst_origin,
+  const cl_tensor_shape* region,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);
+
+cl_int clCommandImportFromTensorEXP(
+  cl_command_buffer_khr command_buffer,
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  void* host_ptr,
+  cl_uint num_sync_points_in_wait_list,
+  const cl_sync_point_khr* sync_point_wait_list,
+  cl_sync_point_khr* sync_point,
+  cl_mutable_command_khr* mutable_handle);
+
+cl_int clCommandExportToTensorEXP(
+  cl_command_buffer_khr command_buffer,
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  const void* host_ptr,
+  cl_uint num_sync_points_in_wait_list,
+  const cl_sync_point_khr* sync_point_wait_list,
+  cl_sync_point_khr* sync_point,
+  cl_mutable_command_khr* mutable_handle);
+----
+
+== New API Types
+
+[source,c]
+----
+typedef cl_uint cl_tensor_layout_type_exp;
+typedef cl_uint cl_tensor_dim_exp;
+typedef cl_uint cl_tensor_layout_ml_type_exp;
+typedef cl_properties cl_tensor_properties_exp;
+
+#define CL_TENSOR_DESC_MAX_RANK_EXP       20u
+#define CL_TENSOR_DESC_MAX_PROPERTIES_EXP 16u
+
+typedef struct cl_tensor_desc_exp {
+    cl_uint               rank;
+    cl_tensor_datatype    dtype;
+    cl_tensor_properties_exp  properties[CL_TENSOR_DESC_MAX_PROPERTIES_EXP]
+    cl_tensor_shape       shape[CL_TENSOR_DESC_MAX_RANK_EXP];
+    const void*           layout;
+    cl_tensor_layout_type_exp layout_type;
+} cl_tensor_desc_exp;
+
+typedef struct cl_tensor_layout_blas_exp {
+    cl_tensor_dim_exp    leading_dims[CL_TENSOR_DESC_MAX_RANK_EXP];
+} cl_tensor_layout_blas_exp;
+
+typedef struct cl_tensor_layout_blas_pitched_exp {
+    cl_tensor_dim_exp    leading_dims[CL_TENSOR_DESC_MAX_RANK_EXP];
+    cl_tensor_stride     leading_strides[CL_TENSOR_DESC_MAX_RANK_EXP];
+} cl_tensor_layout_blas_pitched__exp;
+
+typedef struct cl_tensor_layout_ml_exp {
+  cl_tensor_layout_ml_type_exp ml_type;
+} cl_tensor_layout_ml_exp;
+----
+
+== New API Enums
+
+Accepted value for _properties_ parameter to
+*clCreateBufferWithProperties* for creating a tensor object:
+
+[source,c]
+----
+CL_MEM_TENSOR_EXP               0x????
+----
+
+Accepted values for the _param_name_ parameter to *clGetDeviceInfo*:
+
+[source,c]
+----
+CL_DEVICE_MAX_TENSOR_ARGS_EXP     0x????
+CL_DEVICE_MAX_TENSOR_RANK_EXP     0x????
+CL_DEVICE_MAX_TENSOR_ELEMENTS_EXP 0x????
+CL_DEVICE_MAX_TENSOR_STRIDE_EXP   0x????
+----
+
+Accepted values for *cl_tensor_datatype* type:
+
+[source,c]
+----
+CL_TENSOR_DTYPE_BOOL_EXP        0x????
+
+CL_TENSOR_DTYPE_INT4_EXP        0x????
+CL_TENSOR_DTYPE_INT8_EXP        0x????
+CL_TENSOR_DTYPE_INT16_EXP       0x????
+CL_TENSOR_DTYPE_INT32_EXP       0x????
+CL_TENSOR_DTYPE_INT64_EXP       0x????
+
+CL_TENSOR_DTYPE_UINT4_EXP       0x????
+CL_TENSOR_DTYPE_UINT8_EXP       0x????
+CL_TENSOR_DTYPE_UINT16_EXP      0x????
+CL_TENSOR_DTYPE_UINT32_EXP      0x????
+CL_TENSOR_DTYPE_UINT64_EXP      0x????
+
+CL_TENSOR_DTYPE_FP8_EXP         0x????
+CL_TENSOR_DTYPE_FP16_EXP        0x????
+CL_TENSOR_DTYPE_FP32_EXP        0x????
+CL_TENSOR_DTYPE_FP64_EXP        0x????
+
+CL_TENSOR_DTYPE_BFLOAT16_EXP    0x????
+
+CL_TENSOR_DTYPE_COMPLEX64_EXP   0x????
+CL_TENSOR_DTYPE_COMPLEX128_EXP  0x????
+----
+
+Accepted values for *cl_tensor_layout_type_exp*:
+
+[source,c]
+----
+CL_TENSOR_LAYOUT_OPAQUE_EXP       0x????
+CL_TENSOR_LAYOUT_BLAS_EXP         0x????
+CL_TENSOR_LAYOUT_BLAS_PITCHED_EXP 0x????
+CL_TENSOR_LAYOUT_ML_EXP           0x????
+----
+
+Accepted values for *cl_tensor_layout_ml_type_exp*:
+
+[source,c]
+----
+CL_TENSOR_LAYOUT_ML_C_EXP       0x????
+CL_TENSOR_LAYOUT_ML_NC_EXP      0x????
+CL_TENSOR_LAYOUT_ML_CN_EXP      0x????
+CL_TENSOR_LAYOUT_ML_HW_EXP      0x????
+CL_TENSOR_LAYOUT_ML_CHW_EXP     0x????
+CL_TENSOR_LAYOUT_ML_NCHW_EXP    0x????
+CL_TENSOR_LAYOUT_ML_NHWC_EXP    0x????
+----
+
+New error codes:
+
+[source,c]
+----
+CL_INVALID_TENSOR_RANK_EXP   0x????
+CL_INVALID_TENSOR_DTYPE_EXP  0x????
+CL_INVALID_TENSOR_SHAPE_EXP  0x????
+CL_INVALID_TENSOR_LAYOUT_EXP 0x????
+----
+
+=== Modifications to The OpenCL API Specification
+
+(Modify Section 4.2, *Querying Devices*) ::
++
+--
+(Add the following to Table 5., _List of supported _param_names_ by *clGetDeviceInfo*) ::
++
+--
+
+[cols="2,1,2",stripes=odd,options="header"]
+|===
+| Device Info
+| Return Type
+| Description
+
+// The following enumerators are introduced for Vulkan layering in
+// mind. The minimum values are copied from the Vulkan's tensor draft
+// spec.
+
+| CL_DEVICE_MAX_TENSOR_ARGS_EXP | cl_uint | Max number of tensor objects
+  arguments specified as arguments to.
+
+| CL_DEVICE_MAX_TENSOR_RANK_EXP | cl_uint | Max tensor rank. The minimum
+  value is 4.
+
+| CL_DEVICE_MAX_TENSOR_ELEMENTS_EXP | size_t | Maximum number of tensor
+  elements in total. The minimum value is 65536.
+
+| CL_DEVICE_MAX_TENSOR_PITCH_EXP | size_t | Maximum pitch value for
+  all pitch components for
+  <<cl-tensor-layout-blas,CL_TENSOR_LAYOUT_BLAS_PITCHED_EXP>> memory
+  layout.
+
+The minimum value is 65536.
+
+|===
+--
+--
+// End (Modify Section 4.2, *Querying Devices*)
+
+(Modify Section 5.2.1, *Creating Buffer Objects*) ::
++
+--
+(Add the following to Table 18.,  _Buffer creation properties_) ::
++
+--
+
+[cols="2,1,2",stripes=odd,options="header"]
+|===
+| cl_mem_properties
+| Property Value
+| Description
+
+| CL_MEM_TENSOR_EXP | cl_tensor_desc_exp a| Creates a tensor object with
+properties set in *cl_tensor_desc_exp* tensor description structure.
+
+The _size_ parameter of the *clCreateBufferWithProperties()* is
+ignored and may be set to zero. The required storage space needed is
+inferred from the tensor description. The storage size of the queried
+with *clGetMemObjectInfo()*. The storage size may change during
+the runtime unless constrained by the given tensor description.
+
+// The last sentence is for accommodating tensors with dynamic
+// dimension sizes and rank which are present in many ML frameworks.
+|===
+--
+
+(Add to list of error codes *clCreateBufferWithProperties()*) ::
++
+--
+
+* `CL_INVALID_VALUE` if `CL_MEM_TENSOR_EXP` property is specified and the
+  `rank` member of the `cl_tensor_desc_exp` structure has invalid or
+  unsupported value.
+
+* `CL_INVALID_TENSOR_SHAPE_EXP` if `CL_MEM_TENSOR_EXP` property is
+  specified and the `shape` member of the `cl_tensor_desc_exp`
+  structure has invalid or unsupported description.
+
+* `CL_INVALID_TENSOR_LAYOUT_TYPE_EXP` if `CL_MEM_TENSOR_EXP` property is
+  specified and the `layout_type` member of the `cl_tensor_desc_exp`
+  structure has an invalid enumeration constant.
+
+* `CL_INVALID_TENSOR_LAYOUT_EXP` if `CL_MEM_TENSOR_EXP` property is
+  specified and the `layout` member of the `cl_tensor_desc_exp` has an
+  invalid description.
+--
+--
+// End (Modify Section 5.2.1, *Creating Buffer Objects*)
+
+(Add the following to Section 5.2.2, *Reading, Writing and Copying Buffer Objects*) ::
++
+--
+The following functions are for reading from a tensor to host memory /
+buffer object or to write to a tensor object from host memory / buffer
+object.
+
+[source,c]
+----
+cl_int clEnqueueImportFromTensorEXP(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  void* host_ptr,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);
+----
+
+[source,c]
+----
+cl_int clEnqueueExportToTensorEXP(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  const void* host_ptr,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);
+----
+
+* _command_queue_ is a valid host command-queue in which the read /
+  write command will be queued. _command_queue_ and _tensor_ must be
+  created with the same OpenCL context.
+
+* _tensor_ refers to a valid tensor object which is bound to a buffer.
+
+* _blocking_command_ indicate if the read and write operations are
+  blocking or non-blocking (see below).
+
+* _tensor_origin_ defines the offset coordinates in _tensor_ for start of
+  the regions to read / write tensor data. The length of the array
+  must be at least rank the the _tensor_.
+
+* _mem_origin_ defines the offset coordinates in the memory region
+  pointed by _buffer_ or _host_ptr_ expressed in elements of _tensor_
+  data type. The length of the array must be at least rank the the
+  _tensor_.
+
+* _region_ defines the region being read or written expressed in in
+  elements of _tensor_ data type. The length of the array must be at
+  least rank the the _tensor_. If _region_ is NULL then _tensor_'s
+  shape will be used as the region.
+
+* _mem_pitch_ defines the length of each dimension in elements to be
+  used for the memory region of _buffer_ or _host_ptr_. The length of
+  the array must be at least the rank of _tensor_ minus one. if
+  _mem_pitch_ is NULL or _mem_pitch_[i] is zero, _mem_pitch_[i] is
+  computed as _region_[i + 1].
+
+* _buffer_ and _host_ptr_ refer to a valid buffer object / host
+  allocation where data is to be read into or to be written from.
+  Either the _buffer_ or _host_ptr_ can be non-NULL in which case the
+  non-NULL argument is used as the operand for the operation.
+
+* _event_wait_list_ and _num_events_in_wait_list_ specify events that
+  need to complete before this particular command can be executed. If
+  _event_wait_list_ is NULL, then this particular command does not
+  wait on any event to complete. If _event_wait_list_ is NULL,
+  _num_events_in_wait_list_ must be 0. If _event_wait_list_ is not
+  NULL, the list of events pointed to by _event_wait_list_ must be
+  valid and _num_events_in_wait_list_ must be greater than 0. The
+  events specified in _event_wait_list_ act as synchronization
+  points. The context associated with events in _event_wait_list_ and
+  _command_queue_ must be the same. The memory associated with
+  _event_wait_list_ can be reused or freed after the function returns.
+
+* _event_ returns an event object that identifies this read / write
+  command and can be used to query or queue a wait for this command to
+  complete. If _event_ is NULL or the enqueue is unsuccessful, no
+  event will be created and therefore it will not be possible to query
+  the status of this command or to wait for this command to
+  complete. If _event_wait_list_ and _event_ are not NULL, _event_
+  must not refer to an element of the _event_wait_list_ array.
+
+The *clEnqueueExportToTensorEXP* function copies contents of the buffer
+object / host allocation to tensor's storage in
+implementation-defined, opaque memory layout. The
+*clEnqueueImportFromTensorEXP* function copies data from tensor's
+storage to buffer object / host allocation.
+
+The elements of buffer object / host allocation are mapped to tensor
+coordinates and vice versa as follows in pseudo C code:
+
+[source,c]
+----
+tensor_element(
+  tensor,
+  tensor_origin[0] + i[0],
+  tensor_origin[1] + i[1],
+  ...,
+  tensor_origin[N-2] + i[N-2],
+  tensor_origin[N-2] + i[N-1]) ==
+((TENSOR_DATATYPE *)buffer_or_host_ptr)[
+  (mem_origin[0] + i[0]) * pitch(0) +
+  (mem_origin[1] + i[1]) * pitch(1) +
+  ... +
+  (mem_origin[N-2] + i[N-2]) * pitch(N-2) +
+  (mem_origin[N-1] + i[N-1])];
+----
+
+Where the `N` is tensor rank, the `i[X]` is a tensor coordinate with
+inclusive range of `0..<region[X]-1>` and the `pitch` is computed as
+follows in pseudo C code:
+
+[source,c]
+----
+size_t pitch(size_t dim) {
+  size_t pitch = 1;
+  for (size_t i = dim; i < tensor_rank - 1; i++)
+    pitch *=
+      (mem_pitch != NULL || mem_pitch[i] == 0) ? mem_pitch[i] : region[i + 1];
+  return pitch;
+}
+----
+
+For `dim` in `0..(tensor_rank()-1)`. The `tensor_element()` represents
+an abstract function that accesses a tensor element in its storage at
+given coordinate. The method how the coordinates translate to tensor
+storage addresses is unspecified.
+
+*clEnqueueImportFromTensorEXP* and *clEnqueueExportToTensorEXP*
+returns CL_SUCCESS if the function is executed
+successfully. Otherwise, it returns one of the following errors:
+
+* CL_INVALID_COMMAND_QUEUE if _command_queue_ is not a valid host
+  command-queue.
+
+* CL_INVALID_CONTEXT if the context associated with _command_queue_
+  and buffer are not the same or if the context associated with
+  _command_queue_ and events in _event_wait_list_ are not the same.
+
+* CL_INVALID_MEM_OBJECT if _buffer_ is not a valid buffer object.
+
+* CL_INVALID_VALUE if _tensor_origin_ or _mem_origin_ is NULL.
+
+* CL_INVALID_VALUE if the region being read or written specified by
+  (_mem_origin_, _region_, _mem_pitch_) is out of bounds.
+
+* CL_INVALID_VALUE if any _region_ array element is 0.
+
+* CL_INVALID_VALUE if _mem_pitch_ is not NULL and _mem_pitch_[i] is
+  not 0 and _mem_pitch_[i] is less than _region_[i].
+
+* CL_INVALID_VALUE if _buffer_ and _host_ptr_ both are NULL or non-NULL.
+
+* CL_INVALID_EVENT_WAIT_LIST if _event_wait_list_ is NULL and
+  _num_events_in_wait_list_ > 0, or _event_wait_list_ is not NULL and
+  _num_events_in_wait_list_ is 0, or if event objects in
+  _event_wait_list_ are not valid events.
+
+* CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST if the read and write
+  operations are blocking and the execution status of any of the
+  events in _event_wait_list_ is a negative integer value.
+
+* CL_OUT_OF_RESOURCES if there is a failure to allocate resources
+  required by the OpenCL implementation on the device.
+
+* CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
+  required by the OpenCL implementation on the host.
+
+
+To copy elements from one tensor to another use:
+
+[source,c]
+----
+cl_int clEnqueueCopyTensorEXP(
+  cl_command_queue command_queue,
+  cl_tensor src_tensor,
+  cl_tensor dst_tensor,
+  const cl_tensor_shape* src_origin,
+  const cl_tensor_shape* dst_origin,
+  const cl_tensor_shape* region,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);
+----
+
+* _command_queue_ is a valid host command-queue in which the read /
+  write command will be queued. _command_queue_ and _tensor_ must be
+  created with the same OpenCL context.
+
+* _src_tensor_ and _dst_tensor_ refer to valid buffer objects created
+  with `CL_MEM_TENSOR_EXP`. Tensor elements are copied from _src_tensor_
+  to _dst_tensor_. Rank of the _src_tensor_ and _dst_tensor_ must match.
+
+* _src_origin_ and _dst_origin_ define origins of the copy region. The
+  length of the arrays must be at least tensors' rank.
+
+* _region_ defines extends of the slice being being copied. The length
+  of the arrays must be at least tensors' rank.
+
+* _event_wait_list_ and _num_events_in_wait_list_ specify events that
+  need to complete before this particular command can be executed. If
+  _event_wait_list_ is NULL, then this particular command does not
+  wait on any event to complete. If _event_wait_list_ is NULL,
+  _num_events_in_wait_list_ must be 0. If _event_wait_list_ is not
+  NULL, the list of events pointed to by _event_wait_list_ must be
+  valid and _num_events_in_wait_list_ must be greater than 0. The
+  events specified in _event_wait_list_ act as synchronization
+  points. The context associated with events in _event_wait_list_ and
+  _command_queue_ must be the same. The memory associated with
+  _event_wait_list_ can be reused or freed after the function returns.
+
+* _event_ returns an event object that identifies this read / write
+  command and can be used to query or queue a wait for this command to
+  complete. If _event_ is NULL or the enqueue is unsuccessful, no
+  event will be created and therefore it will not be possible to query
+  the status of this command or to wait for this command to
+  complete. If _event_wait_list_ and _event_ are not NULL, _event_
+  must not refer to an element of the _event_wait_list_ array.
+
+Elements are copied from the source tensor to the destination tensor
+so that after the completion following condition holds expressed in
+pseudo C:
+
+[source,c]
+----
+// 'so' and 'do' are aliases for src_origin and dst_origin, respectively.
+tensor_element(dst_tensor, do[0] + i[0], do[1] + i[1], ..., do[N-1] + i[N-1])
+==
+tensor_element(src_tensor, so[0] + i[0], so[1] + i[1], ..., so[N-1] + i[N-1]);
+----
+
+Where the `N` is tensor rank, the `i[X]` is a tensor coordinate with
+inclusive range of `0..<region[X]-1>`.
+
+*clEnqueueCopyTensorEXP* returns CL_SUCCESS if the function is
+executed successfully. Otherwise, it returns one of the following
+errors:
+
+* CL_INVALID_COMMAND_QUEUE if _command_queue_ is not a valid host
+  command-queue.
+
+* CL_INVALID_CONTEXT if the context associated with _command_queue_
+  and buffer are not the same or if the context associated with
+  _command_queue_ and events in _event_wait_list_ are not the same.
+
+* CL_INVALID_MEM_OBJECT if _src_tensor_ or _dst_tensor_ are not a
+  valid buffer object created with `CL_MEM_TENSOR_EXP`.
+
+* CL_INVALID_VALUE if _tensor_origin_ or _mem_origin_ is NULL.
+
+* CL_INVALID_VALUE if _src_origin_, _dst_origin_ or _region_ is NULL.
+
+* CL_INVALID_VALUE if `region[i]` is zero for i in `[0, tensor_rank)`.
+
+* CL_INVALID_VALUE if `origin[i] + region[i] > tensor_shape[i]` at any
+  dimension `i` in range `[0, tensor_rank)`.
+
+* CL_INVALID_EVENT_WAIT_LIST if _event_wait_list_ is NULL and
+  _num_events_in_wait_list_ > 0, or _event_wait_list_ is not NULL and
+  _num_events_in_wait_list_ is 0, or if event objects in
+  _event_wait_list_ are not valid events.
+
+* CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST if the read and write
+  operations are blocking and the execution status of any of the
+  events in _event_wait_list_ is a negative integer value.
+
+* CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate
+  memory for data store associated with memory object the _tensor_ is
+  bound to.
+
+* CL_OUT_OF_RESOURCES if there is a failure to allocate resources
+  required by the OpenCL implementation on the device.
+
+* CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
+  required by the OpenCL implementation on the host.
+
+// TODO: add clEnqueueFillTensor?
+
+--
+// End (Add the following to Section 5.2.2, *Reading, Writing and Copying Buffer Objects*)
+
+
+(Add the following to Section 5.17.5, *Recording Commands to a Command-Buffer*) ::
++
+--
+
+If *cl_khr_command_buffer* is supported, then the following command
+buffer counterparts of the *clEnqueueImportFromTensorEXP* and
+*clEnqueueExportToTensorEXP* commands are available.
+
+[source,c]
+----
+cl_int clCommandImportFromTensorEXP(
+  cl_command_buffer_khr command_buffer,
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  void* host_ptr,
+  cl_uint num_sync_points_in_wait_list,
+  const cl_sync_point_khr* sync_point_wait_list,
+  cl_sync_point_khr* sync_point,
+  cl_mutable_command_khr* mutable_handle);
+
+cl_int clCommandExportToTensorEXP(
+  cl_command_buffer_khr command_buffer,
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  const void* host_ptr,
+  cl_uint num_sync_points_in_wait_list,
+  const cl_sync_point_khr* sync_point_wait_list,
+  cl_sync_point_khr* sync_point,
+  cl_mutable_command_khr* mutable_handle);
+----
+
+* _command_buffer_ refers to valid command-buffer object.
+
+* For _command_queue_, _tensor_, _tensor_origin_, _mem_origin_,
+  _region_, _mem_pitch_, _buffer_ and _host_ptr_ parameters refer to
+  *clEnqueueImportFromTensor*.
+
+* For _num_sync_points_in_wait_list_, _sync_point_wait_list_,
+  _sync_point_, _mutable_handle_ parameters refer to
+  *clCommandCopyBufferEXP*.
+
+*clCommandImportFromTensorEXP* and *clCommandImportFromTensorEXP*
+returns CL_SUCCESS if the function is executed
+successfully. Otherwise, it returns one of the following errors:
+
+* CL_INVALID_COMMAND_QUEUE if _command_queue_ is not NULL.
+
+* CL_INVALID_COMMAND_BUFFER_KHR if _command_buffer_ is not a valid
+  command-buffer.
+
+* CL_INVALID_CONTEXT if the context associated with _command_queue_
+  and _command_buffer_ is not the same.
+
+* CL_INVALID_OPERATION if _command_buffer_ has been finalized.
+
+* CL_INVALID_VALUE if _mutable_handle_ is not NULL.
+
+* CL_INVALID_SYNC_POINT_WAIT_LIST_KHR if _sync_point_wait_list_ is
+  NULL and _num_sync_points_in_wait_list_ is > 0, or
+  _sync_point_wait_list_ is not NULL and _num_sync_points_in_wait_list_ is
+  0, or if synchronization-point objects in _sync_point_wait_list_ are
+  not valid synchronization-points.
+
+* CL_OUT_OF_RESOURCES if there is a failure to allocate resources
+  required by the OpenCL implementation on the device.
+
+* CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
+  required by the OpenCL implementation on the host.
+
+--
+// End (Add the following to Section 5.17.5, *Recording Commands to a Command-Buffer*)
+
+
+(Add the following to new Section 5.X.Y, *Tensor Descriptions*) ::
++
+--
+
+The following structure describes properties of a tensor to be created
+with *clCreateBufferWithProperties()* using `CL_MEM_TENSOR_EXP` property:
+
+[source,c]
+----
+typedef struct cl_tensor_desc_exp {
+    cl_uint               rank;
+    cl_tensor_datatype    dtype;
+    cl_tensor_properties_exp  properties[CL_TENSOR_DESC_MAX_PROPERTIES_EXP]
+    cl_tensor_shape       shape[CL_TENSOR_DESC_MAX_RANK_EXP];
+    const void*           layout;
+    cl_tensor_layout_type_exp layout_type;
+} cl_tensor_desc_exp;
+----
+
+* _rank_ defines the tensor's rank - the number of dimensions.
+
+* _dtype_ defines the data type of the elements in the
+  tensor. Possible types are listed in <<tensor-dtype-table, tensor
+  element type>> table.
+
+* _properties_ is an optional list of properties for the tensor object
+  and their corresponding values. The list is terminated with the
+  special property 0. If no properties are required, properties may be
+  NULL. This extension does not define any optional properties for
+  tensors, but future extensions may define properties.
+
+* _shape_ defines the extends of the tensor's dimensions in number of
+  elements.
+
+* _layout_ points to an optional structure describing how tensor
+  elements are laid out in the buffer memory. The structure must be a
+  type corresponding to the _layout_type_ listed in
+  <<layout-types-table, tensor layout type>> table. The pointer is
+  ignored if the _tensor_type_ is `CL_TENSOR_LAYOUT_OPAQUE_EXP`.
+
+* _layout_type_ indicates the layout structure type the _layout_
+  point to.
+
+
+[[tensor-dtypes-table]]
+.Tensor element types. The API type indicates the corresponding type for copying elements from an host allocation / buffer object to tensor or vice versa.
+[cols="1,1,1",stripes=even]
+|===
+| *Tensor element data type* | *Description* | *API type*
+
+| CL_TENSOR_DTYPE_BOOL | Data type representing true or false.  |
+cl_uchar. footnote:[zero and non-zero bytes are interpreted as false
+and true values, respectively.]
+
+| CL_TENSOR_DTYPE_INT4_EXP        | 4-bit signed integer.            | cl_char.
+| CL_TENSOR_DTYPE_INT8_EXP        | 8-bit signed integer.            | cl_char.
+| CL_TENSOR_DTYPE_INT16_EXP       | 16-bit signed integer.           | cl_short.
+| CL_TENSOR_DTYPE_INT32_EXP       | 32-bit signed integer.           | cl_int.
+| CL_TENSOR_DTYPE_INT64_EXP       | 64-bit signed integer.           | cl_long.
+| CL_TENSOR_DTYPE_UINT8_EXP       | 8-bit unsigned integer.          | cl_uchar.
+| CL_TENSOR_DTYPE_UINT16_EXP      | 16-bit unsigned integer.         | cl_ushort.
+| CL_TENSOR_DTYPE_UINT32_EXP      | 32-bit unsigned integer.         | cl_uint.
+| CL_TENSOR_DTYPE_UINT64_EXP      | 64-bit unsigned integer.         | cl_ulong.
+| CL_TENSOR_DTYPE_FP8_EXP         | Half precision floating-point.   | cl_char.
+| CL_TENSOR_DTYPE_FP16_EXP        | Half precision floating-point.   | cl_half.
+| CL_TENSOR_DTYPE_BFLOAT16_EXP    | 16-bit brain floating-point.     | cl_ushort
+| CL_TENSOR_DTYPE_FP32_EXP        | Single precision floating-point. | cl_float.
+| CL_TENSOR_DTYPE_FP64_EXP        | Double precision floating-point. | cl_double.
+| CL_TENSOR_DTYPE_COMPLEX64_EXP   | 64-bit complex floating-point with
+  32-bit real and imaginary part. | cl_float2
+| CL_TENSOR_DTYPE_COMPLEX128_EXP  | 128-bit complex floating-point with
+  64-bit real and imaginary part. | cl_double2
+|===
+
+[[layout-types-table]]
+.Optional tensor memory layout types.
+[cols="1,1,4",stripes=even]
+|===
+| *layout type* | *tensor layout type* | *Description*
+
+| CL_TENSOR_LAYOUT_OPAQUE_EXP | N/A | The tensor don't have application
+  defined memory layout. Driver controls the tensors layout. To read
+  or write elements of the tensor
+
+| CL_TENSOR_LAYOUT_BLAS_EXP
+|<<cl-tensor-layout-blas,cl_tensor_layout_blas_exp>>
+| A type that describe packed memory layout similar ones used in BLAS APIs.
+
+| CL_TENSOR_LAYOUT_BLAS_EXP
+|<<cl-tensor-layout-blas,cl_tensor_layout_blas_pitched_exp>>
+| A type that describe memory layout similar ones used in BLAS APIs.
+
+| CL_TENSOR_LAYOUT_ML_EXP       | <<cl-tensor-layout-ml,cl_tensor_layout_ml_exp>> |
+A convenience layout type over `CL_TENSOR_LAYOUT_BLAS_EXP`.
+
+|===
+
+--
+// End (Add the following to new Section 5.X.Y, *Tensor Descriptions*)
+
+
+[[cl-tensor-layout-blas]]
+(Add the following to new Section 5.X.Y.1, *BLAS Tensor Layout*) ::
++
+--
+The following structures describes packed / pitched BLAS-like memory
+layout for the tensor:
+
+[source,c]
+----
+typedef struct cl_tensor_layout_blas_exp {
+    cl_tensor_dim_exp    leading_dims[CL_TENSOR_DESC_MAX_RANK_EXP];
+} cl_tensor_layout_blas_exp;
+
+typedef struct cl_tensor_layout_blas_pitched_exp {
+    cl_tensor_dim_exp    leading_dims[CL_TENSOR_DESC_MAX_RANK_EXP];
+    cl_tensor_pitch      leading_pitches[CL_TENSOR_DESC_MAX_RANK_EXP];
+} cl_tensor_layout_blas_pitched_exp;
+
+typedef struct cl_tensor_layout_ml_exp {
+    cl_tensor_layout_ml_type_exp ml_type;
+} cl_tensor_layout_ml_exp;
+----
+
+* _leading_dims_ describes which elements along the tensor dimension
+  are laid out in the memory. `leading_dims[0]` point to dimension
+  whose elements are laid out first, followed by elements along
+  dimension by `leading_dims[1]` and so on. The first N elements must
+  be non-zero where N is tensor's rank and the values must be unique
+  and within range `[0, tensor_rank)`.
+
+* _leading_pitches_ describes distance between from an element to the
+  next one for the leading dimensions in _leading_dims_. The distance
+  is measured in number of elements. The first N elements must be
+  non-zero where the N is tensor's rank minus one. The values of the
+  array must be non-zero for the first tensor rank minus one elements
+  and following conditions must hold:
+
+** `leading_pitches[0] >= tensor_shape[leading_dims[0]]` if the tensor
+   rank is greater than one and
+
+** `leading_pitches[i + 1] >= tensor_shape[leading_dims[i]] *
+  leading_pitches[i]` for `i` in `[0, tensor_rank - 1)` if the tensor
+  rank is greater than two.
+
+// ^ This condition is meant to ensure that the tensor elements at different
+// coordinates don't alias.
+
+* _ml_type_ defines memory layout via enumerators which corresponds to
+  predefined configurations of `cl_tensor_layout_blas_exp` structure
+  as listed in <<tensor-layout-ml-types,ML tensor layout type>> table.
+
+The memory layout descriptions map tensor coordinates to buffer's
+memory byte locations respect to buffer's base address as followed in
+pseudo C:
+
+[source,c]
+----
+size_t index = 0;
+for (unsigned i = 0; i < tensor_rank - 1; i++)
+  index += tensor_coordinates[leading_dims[i]] * pitches[i];
+buffer_offset = index * tensor_element_size;
+----
+
+Where `pitches[i]` equals to:
+
+* _leading_pitches_[i] for `cl_tensor_layout_blas_pitched_exp`.
+
+* `tensor_shape[leading_dims[i]] *
+  tensor_shape[leading_dims[i-1]] * ... *
+  tensor_shape[leading_dims[0]]` for `cl_tensor_layout_blas_exp`.
+
+
+[[tensor-layout-ml-type]]
+.ML tensor layout types and their corresponding cl_tensor_layout_blas_exp configuration.
+[cols="1,2",stripes=even]
+|===
+| *ML layout type* | *Equivalent _leading_dims_ configuration*
+
+|CL_TENSOR_LAYOUT_ML_C_EXP    | `{}`
+|CL_TENSOR_LAYOUT_ML_NC_EXP   | `{1}`
+|CL_TENSOR_LAYOUT_ML_CN_EXP   | `{0}`
+|CL_TENSOR_LAYOUT_ML_HW_EXP   | `{1}`
+|CL_TENSOR_LAYOUT_ML_CHW_EXP  | `{2, 1}`
+|CL_TENSOR_LAYOUT_ML_NCHW_EXP | `{3, 2, 1}`
+|CL_TENSOR_LAYOUT_ML_NHWC_EXP | `{1, 3, 2}`
+|===
+--
+
+== Sample Codes
+
+An example usage of tensors:
+
+[source,cpp]
+----
+constexpr size_t b = 64, m = 100, n = 200, k = 50;
+
+std::vector<float> in0_data = ...;
+std::vector<float> in1_data = ...;
+std::vector<float> out_data(b * m * n);
+
+// Create tensor with opaque layout.
+cl_tensor_desc_exp in0_desc;
+in0_desc.rank = 3;
+in0_desc.properties[0] = 0;
+in0_desc.shape[0] = b;
+in0_desc.shape[1] = m;
+in0_desc.shape[2] = k;
+in0_desc.layout = nullptr;
+in0_desc.layout_type = CL_TENSOR_LAYOUT_OPAQUE_EXP;
+
+cl_int err;
+cl_mem in0_tensor = clCreateBufferWithProperties(
+  ctx, {CL_MEM_TENSOR_EXP, in0_desc, 0},
+  CL_MEM_READ_ONLY, 0, nullptr, &err);
+
+// Create tensor from a host allocation using an application defined
+// layout description for mapping elements to the tensor.
+cl_tensor_desc_exp in1_desc;
+in1_desc.rank = 3;
+in1_desc.properties[0] = 0;
+in1_desc.shape[0] = b;
+in1_desc.shape[1] = k;
+in1_desc.shape[2] = n;
+
+cl_tensor_layout_blas_exp col_major;
+col_major.leading_dims[0] = 1,
+col_major.leading_dims[1] = 2,
+in1_desc.layout = &col_major;
+in1_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP;
+
+cl_mem in1_tensor = clCreateBufferWithProperties(
+  ctx, {CL_MEM_TENSOR_EXP, in1_desc, 0},
+  CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, in1_data.data(), &err);
+
+// Create another tensor with application defined layout.
+cl_tensor_desc_exp out_desc;
+out_desc.rank = 3;
+out_desc.properties[0] = 0;
+out_desc.shape[0] = b;
+out_desc.shape[1] = m;
+out_desc.shape[2] = n;
+
+cl_tensor_layout_blas_exp row_major;
+row_major.leading_dims[0] = 2,
+row_major.leading_dims[1] = 1,
+out_desc.layout = &row_major;
+out_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP;
+
+cl_mem out_tensor = clCreateBufferWithProperties(
+  ctx, {CL_MEM_TENSOR_EXP, out_desc, 0},
+  CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY, 0, out_data.data(), &err);
+
+// Create a kernel that operates on the tensors and is possibly
+// optimized for them using via yet realized API extension.
+cl_kernel batched_matmul_kernel = create_batched_matmul_kernel(
+  ctx, device_span, in1_desc, in2_desc, out_desc);
+
+clSetKernelArg(batched_matmul_kernel, 0, sizeof(cl_mem), &in0_tensor);
+clSetKernelArg(batched_matmul_kernel, 1, sizeof(cl_mem), &in1_tensor);
+clSetKernelArg(batched_matmul_kernel, 2, sizeof(cl_mem), &out_tensor);
+
+// Required command for transferring data to layout-opaque tensors and
+// from it to elsewhere.
+clEnqueueExportToTensor(
+  cmd_q, in0_tensor, false, {0, 0, 0}, {0, 0, 0}, {b, m, k},
+  nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr);
+
+clEnqueueNDRangeKernel(
+  cmd_q, batched_matmul_kernel, 3, matmul_grid, nullptr, nullptr, 0, nullptr, nullptr);
+
+clEnqueueMapBuffer(
+  cmd_q, out_tensor, CL_TRUE, CL_MAP_READ, 0, b * m * n, 0, nullptr, nullptr);
+----
+
+
+== Issues
+
+. Should we have support for tensors with undefined shape and tensors
+  with unknown / symbolic dimension sizes like in ONNX?
++
+--
+// https://onnx.ai/onnx/repo-docs/ShapeInference.html
+*UNRESOLVED*
+--
+
+. Should we define OpenCL C language features for accessing tensors?
++
+--
+*RESOLVED*: OpenCL C support for tensors can be introduced later in a
+           separate extension. Built-in kernels may benefit from this
+           extension as it is.
+--
+
+. What is the use case of `cl_tensor_layout_blas_pitch_exp`?
++
+--
+*UNRESOLVED*
+--
+
+. Should image types be extended instead of adding a separate tensor type?
++
+--
+*UNRESOLVED*
+--
+
+== Version History
+
+[cols="5,15,15,70"]
+[grid="rows"]
+[options="header"]
+|====
+| Version | Date       | Author          | Changes
+| 0.1.0   | 2023-11-23 | Henry Linjamäki | *Initial revision*
+
+| 0.2.0   | 2024-8-14 | Henry Linjamäki |
+
+* Rework document structure match to the cl_khr_extension_template.
+
+* Added clEnqueueCopyTensor.
+
+* Added API for setting memory layout for tensors.
+
+|====
diff --git a/ext/cl_exp_tensor.html b/extensions/cl_exp_tensor.html
similarity index 53%
rename from ext/cl_exp_tensor.html
rename to extensions/cl_exp_tensor.html
index 84cba869..29822a4d 100644
--- a/ext/cl_exp_tensor.html
+++ b/extensions/cl_exp_tensor.html
@@ -436,230 +436,1093 @@
 #footer-text{color:rgba(0,0,0,.6);font-size:.9em}}
 @media amzn-kf8{#header,#content,#footnotes,#footer{padding:0}}
 </style>
+<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">
+<style>
+/* Stylesheet for CodeRay to match GitHub theme | MIT License | http://foundation.zurb.com */
+pre.CodeRay{background:#f7f7f8}
+.CodeRay .line-numbers{border-right:1px solid currentColor;opacity:.35;padding:0 .5em 0 0}
+.CodeRay span.line-numbers{display:inline-block;margin-right:.75em}
+.CodeRay .line-numbers strong{color:#000}
+table.CodeRay{border-collapse:separate;border:0;margin-bottom:0;background:none}
+table.CodeRay td{vertical-align:top;line-height:inherit}
+table.CodeRay td.line-numbers{text-align:right}
+table.CodeRay td.code{padding:0 0 0 .75em}
+.CodeRay .debug{color:#fff !important;background:#000080 !important}
+.CodeRay .annotation{color:#007}
+.CodeRay .attribute-name{color:#000080}
+.CodeRay .attribute-value{color:#700}
+.CodeRay .binary{color:#509}
+.CodeRay .comment{color:#998;font-style:italic}
+.CodeRay .char{color:#04d}
+.CodeRay .char .content{color:#04d}
+.CodeRay .char .delimiter{color:#039}
+.CodeRay .class{color:#458;font-weight:bold}
+.CodeRay .complex{color:#a08}
+.CodeRay .constant,.CodeRay .predefined-constant{color:#008080}
+.CodeRay .color{color:#099}
+.CodeRay .class-variable{color:#369}
+.CodeRay .decorator{color:#b0b}
+.CodeRay .definition{color:#099}
+.CodeRay .delimiter{color:#000}
+.CodeRay .doc{color:#970}
+.CodeRay .doctype{color:#34b}
+.CodeRay .doc-string{color:#d42}
+.CodeRay .escape{color:#666}
+.CodeRay .entity{color:#800}
+.CodeRay .error{color:#808}
+.CodeRay .exception{color:inherit}
+.CodeRay .filename{color:#099}
+.CodeRay .function{color:#900;font-weight:bold}
+.CodeRay .global-variable{color:#008080}
+.CodeRay .hex{color:#058}
+.CodeRay .integer,.CodeRay .float{color:#099}
+.CodeRay .include{color:#555}
+.CodeRay .inline{color:#000}
+.CodeRay .inline .inline{background:#ccc}
+.CodeRay .inline .inline .inline{background:#bbb}
+.CodeRay .inline .inline-delimiter{color:#d14}
+.CodeRay .inline-delimiter{color:#d14}
+.CodeRay .important{color:#555;font-weight:bold}
+.CodeRay .interpreted{color:#b2b}
+.CodeRay .instance-variable{color:#008080}
+.CodeRay .label{color:#970}
+.CodeRay .local-variable{color:#963}
+.CodeRay .octal{color:#40e}
+.CodeRay .predefined{color:#369}
+.CodeRay .preprocessor{color:#579}
+.CodeRay .pseudo-class{color:#555}
+.CodeRay .directive{font-weight:bold}
+.CodeRay .type{font-weight:bold}
+.CodeRay .predefined-type{color:inherit}
+.CodeRay .reserved,.CodeRay .keyword {color:#000;font-weight:bold}
+.CodeRay .key{color:#808}
+.CodeRay .key .delimiter{color:#606}
+.CodeRay .key .char{color:#80f}
+.CodeRay .value{color:#088}
+.CodeRay .regexp .delimiter{color:#808}
+.CodeRay .regexp .content{color:#808}
+.CodeRay .regexp .modifier{color:#808}
+.CodeRay .regexp .char{color:#d14}
+.CodeRay .regexp .function{color:#404;font-weight:bold}
+.CodeRay .string{color:#d20}
+.CodeRay .string .string .string{background:#ffd0d0}
+.CodeRay .string .content{color:#d14}
+.CodeRay .string .char{color:#d14}
+.CodeRay .string .delimiter{color:#d14}
+.CodeRay .shell{color:#d14}
+.CodeRay .shell .delimiter{color:#d14}
+.CodeRay .symbol{color:#990073}
+.CodeRay .symbol .content{color:#a60}
+.CodeRay .symbol .delimiter{color:#630}
+.CodeRay .tag{color:#008080}
+.CodeRay .tag-special{color:#d70}
+.CodeRay .variable{color:#036}
+.CodeRay .insert{background:#afa}
+.CodeRay .delete{background:#faa}
+.CodeRay .change{color:#aaf;background:#007}
+.CodeRay .head{color:#f8f;background:#505}
+.CodeRay .insert .insert{color:#080}
+.CodeRay .delete .delete{color:#800}
+.CodeRay .change .change{color:#66f}
+.CodeRay .head .head{color:#f4f}
+</style>
 </head>
 <body class="article">
 <div id="header">
 <h1>cl_exp_tensor</h1>
 </div>
 <div id="content">
-<div class="sect1">
-<h2 id="cl_exp_tensor">Tensor Data Type</h2>
+<div id="preamble">
 <div class="sectionbody">
 <div class="paragraph">
-<p>This extension provides a new opaque OpenCL datatype called
-<code>cl_tensor</code>. It is used for storing N-dimensional tensor data in
-implementation-defined memory layout which may be optimized based on
-tensor&#8217;s use cases. The datatype is designed to be efficiently used
-within the <code>cl_khr_command_buffers</code> extension to capture task graphs
-which can utilize tensors as input, output and temporary storage.</p>
+<p>This extension provides new buffer abstraction - tensor objects - for
+managing N-dimensional data.</p>
 </div>
-<div class="sect2">
-<h3 id="_general_information">General information</h3>
-<div class="sect3">
-<h4 id="_name_strings">Name Strings</h4>
+</div>
+</div>
+<div class="sect1">
+<h2 id="_xxx_not_complete_yet">XXX - Not complete yet!!!</h2>
+<div class="sectionbody">
+
+</div>
+</div>
+<div class="sect1">
+<h2 id="_name_strings">Name Strings</h2>
+<div class="sectionbody">
 <div class="paragraph">
 <p><code>cl_exp_tensor</code></p>
 </div>
 </div>
-<div class="sect3">
-<h4 id="_version_history">Version history</h4>
-<table class="tableblock frame-all grid-all stretch">
+</div>
+<div class="sect1">
+<h2 id="_contact">Contact</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>TODO</p>
+</div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="_contributors">Contributors</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>Henry Linjamäki, Intel.<br>
+Pekka Jääslkeläinen, Intel and Tampere University.<br>
+Ben Ashbaugh, Intel.<br></p>
+</div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="_notice">Notice</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>TODO</p>
+</div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="_status">Status</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>Draft spec, NOT APPROVED!!</p>
+</div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="_version">Version</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>Built On: 2024-08-14<br>
+Version: 0.2.0</p>
+</div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="_dependencies">Dependencies</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>This extension is written against the OpenCL Specification version 3.0.14.</p>
+</div>
+<div class="paragraph">
+<p>This extension requires OpenCL 1.2 or later.</p>
+</div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="_overview">Overview</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>The extension provides new tensor object abstraction. Tensor objects
+are similar to image types in regard they represents N-dimensional
+data of some application chosen data type and they may be mapped to
+dedicated hardware except that</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>higher than 3-dimensional data can be supported (limited by
+devices' capabilities).</p>
+</li>
+<li>
+<p>applications may choose how the data elements of the tensors are
+laid out in the buffers using the tensor layout descriptions
+provided in this extension.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>Applications may also choose the memory layouts of the tensors be
+implementation-specified, letting the driver to optimize the tensor
+data layout for better performance or to lay out the data as required by
+hardware functions (e.g. exposed via builtin kernels).</p>
+</div>
+<div class="paragraph">
+<p>The scope of this extension to provide host APIs for creating tensor
+objects and transfer data between tensors, host and other memory
+objects.</p>
+</div>
+<div class="paragraph">
+<p>A separate extension implemented on top of this extension,
+cl_exp_defined_builtin_kernels which provides "defined built-in
+kernels" (DKBs) which operates on tensors. It also provides mechanism
+for drivers to create DBKs that are optimized for the tensor arguments
+they operate on.</p>
+</div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="_new_api_functions">New API Functions</h2>
+<div class="sectionbody">
+<div class="listingblock">
+<div class="content">
+<pre class="CodeRay highlight"><code data-lang="c">cl_int clEnqueueImportFromTensorEXP(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  <span class="directive">const</span> size_t* tensor_origin,
+  <span class="directive">const</span> size_t* mem_origin,
+  <span class="directive">const</span> size_t* region,
+  <span class="directive">const</span> size_t* mem_pitch,
+  cl_mem buffer,
+  <span class="directive">void</span>* host_ptr,
+  cl_uint num_events_in_wait_list,
+  <span class="directive">const</span> cl_event* event_wait_list,
+  cl_event* event);
+
+cl_int clEnqueueExportToTensorEXP(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  <span class="directive">const</span> size_t* tensor_origin,
+  <span class="directive">const</span> size_t* mem_origin,
+  <span class="directive">const</span> size_t* region,
+  <span class="directive">const</span> size_t* mem_pitch,
+  cl_mem buffer,
+  <span class="directive">const</span> <span class="directive">void</span>* host_ptr,
+  cl_uint num_events_in_wait_list,
+  <span class="directive">const</span> cl_event* event_wait_list,
+  cl_event* event);
+
+cl_int clEnqueueCopyTensorEXP(
+  cl_command_queue command_queue,
+  cl_tensor src_tensor,
+  cl_tensor dst_tensor,
+  <span class="directive">const</span> cl_tensor_shape* src_origin,
+  <span class="directive">const</span> cl_tensor_shape* dst_origin,
+  <span class="directive">const</span> cl_tensor_shape* region,
+  cl_uint num_events_in_wait_list,
+  <span class="directive">const</span> cl_event* event_wait_list,
+  cl_event* event);
+
+cl_int clCommandImportFromTensorEXP(
+  cl_command_buffer_khr command_buffer,
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  <span class="directive">const</span> size_t* tensor_origin,
+  <span class="directive">const</span> size_t* mem_origin,
+  <span class="directive">const</span> size_t* region,
+  <span class="directive">const</span> size_t* mem_pitch,
+  cl_mem buffer,
+  <span class="directive">void</span>* host_ptr,
+  cl_uint num_sync_points_in_wait_list,
+  <span class="directive">const</span> cl_sync_point_khr* sync_point_wait_list,
+  cl_sync_point_khr* sync_point,
+  cl_mutable_command_khr* mutable_handle);
+
+cl_int clCommandExportToTensorEXP(
+  cl_command_buffer_khr command_buffer,
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  <span class="directive">const</span> size_t* tensor_origin,
+  <span class="directive">const</span> size_t* mem_origin,
+  <span class="directive">const</span> size_t* region,
+  <span class="directive">const</span> size_t* mem_pitch,
+  cl_mem buffer,
+  <span class="directive">const</span> <span class="directive">void</span>* host_ptr,
+  cl_uint num_sync_points_in_wait_list,
+  <span class="directive">const</span> cl_sync_point_khr* sync_point_wait_list,
+  cl_sync_point_khr* sync_point,
+  cl_mutable_command_khr* mutable_handle);</code></pre>
+</div>
+</div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="_new_api_types">New API Types</h2>
+<div class="sectionbody">
+<div class="listingblock">
+<div class="content">
+<pre class="CodeRay highlight"><code data-lang="c"><span class="keyword">typedef</span> cl_uint cl_tensor_layout_type_exp;
+<span class="keyword">typedef</span> cl_uint cl_tensor_dim_exp;
+<span class="keyword">typedef</span> cl_uint cl_tensor_layout_ml_type_exp;
+<span class="keyword">typedef</span> cl_properties cl_tensor_properties_exp;
+
+<span class="preprocessor">#define</span> CL_TENSOR_DESC_MAX_RANK_EXP       <span class="integer">20</span>u
+<span class="preprocessor">#define</span> CL_TENSOR_DESC_MAX_PROPERTIES_EXP <span class="integer">16</span>u
+
+<span class="keyword">typedef</span> <span class="keyword">struct</span> cl_tensor_desc_exp {
+    cl_uint               rank;
+    cl_tensor_datatype    dtype;
+    cl_tensor_properties_exp  properties[CL_TENSOR_DESC_MAX_PROPERTIES_EXP]
+    cl_tensor_shape       shape[CL_TENSOR_DESC_MAX_RANK_EXP];
+    <span class="directive">const</span> <span class="directive">void</span>*           layout;
+    cl_tensor_layout_type_exp layout_type;
+} cl_tensor_desc_exp;
+
+<span class="keyword">typedef</span> <span class="keyword">struct</span> cl_tensor_layout_blas_exp {
+    cl_tensor_dim_exp    leading_dims[CL_TENSOR_DESC_MAX_RANK_EXP];
+} cl_tensor_layout_blas_exp;
+
+<span class="keyword">typedef</span> <span class="keyword">struct</span> cl_tensor_layout_blas_pitched_exp {
+    cl_tensor_dim_exp    leading_dims[CL_TENSOR_DESC_MAX_RANK_EXP];
+    cl_tensor_stride     leading_strides[CL_TENSOR_DESC_MAX_RANK_EXP];
+} cl_tensor_layout_blas_pitched__exp;
+
+<span class="keyword">typedef</span> <span class="keyword">struct</span> cl_tensor_layout_ml_exp {
+  cl_tensor_layout_ml_type_exp ml_type;
+} cl_tensor_layout_ml_exp;</code></pre>
+</div>
+</div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="_new_api_enums">New API Enums</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>Accepted value for <em>properties</em> parameter to
+<strong>clCreateBufferWithProperties</strong> for creating a tensor object:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="CodeRay highlight"><code data-lang="c">CL_MEM_TENSOR_EXP               <span class="integer">0</span>x????</code></pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Accepted values for the <em>param_name</em> parameter to <strong>clGetDeviceInfo</strong>:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="CodeRay highlight"><code data-lang="c">CL_DEVICE_MAX_TENSOR_ARGS_EXP     <span class="integer">0</span>x????
+CL_DEVICE_MAX_TENSOR_RANK_EXP     <span class="integer">0</span>x????
+CL_DEVICE_MAX_TENSOR_ELEMENTS_EXP <span class="integer">0</span>x????
+CL_DEVICE_MAX_TENSOR_STRIDE_EXP   <span class="integer">0</span>x????</code></pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Accepted values for <strong>cl_tensor_datatype</strong> type:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="CodeRay highlight"><code data-lang="c">CL_TENSOR_DTYPE_BOOL_EXP        <span class="integer">0</span>x????
+
+CL_TENSOR_DTYPE_INT4_EXP        <span class="integer">0</span>x????
+CL_TENSOR_DTYPE_INT8_EXP        <span class="integer">0</span>x????
+CL_TENSOR_DTYPE_INT16_EXP       <span class="integer">0</span>x????
+CL_TENSOR_DTYPE_INT32_EXP       <span class="integer">0</span>x????
+CL_TENSOR_DTYPE_INT64_EXP       <span class="integer">0</span>x????
+
+CL_TENSOR_DTYPE_UINT4_EXP       <span class="integer">0</span>x????
+CL_TENSOR_DTYPE_UINT8_EXP       <span class="integer">0</span>x????
+CL_TENSOR_DTYPE_UINT16_EXP      <span class="integer">0</span>x????
+CL_TENSOR_DTYPE_UINT32_EXP      <span class="integer">0</span>x????
+CL_TENSOR_DTYPE_UINT64_EXP      <span class="integer">0</span>x????
+
+CL_TENSOR_DTYPE_FP8_EXP         <span class="integer">0</span>x????
+CL_TENSOR_DTYPE_FP16_EXP        <span class="integer">0</span>x????
+CL_TENSOR_DTYPE_FP32_EXP        <span class="integer">0</span>x????
+CL_TENSOR_DTYPE_FP64_EXP        <span class="integer">0</span>x????
+
+CL_TENSOR_DTYPE_BFLOAT16_EXP    <span class="integer">0</span>x????
+
+CL_TENSOR_DTYPE_COMPLEX64_EXP   <span class="integer">0</span>x????
+CL_TENSOR_DTYPE_COMPLEX128_EXP  <span class="integer">0</span>x????</code></pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Accepted values for <strong>cl_tensor_layout_type_exp</strong>:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="CodeRay highlight"><code data-lang="c">CL_TENSOR_LAYOUT_OPAQUE_EXP       <span class="integer">0</span>x????
+CL_TENSOR_LAYOUT_BLAS_EXP         <span class="integer">0</span>x????
+CL_TENSOR_LAYOUT_BLAS_PITCHED_EXP <span class="integer">0</span>x????
+CL_TENSOR_LAYOUT_ML_EXP           <span class="integer">0</span>x????</code></pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Accepted values for <strong>cl_tensor_layout_ml_type_exp</strong>:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="CodeRay highlight"><code data-lang="c">CL_TENSOR_LAYOUT_ML_C_EXP       <span class="integer">0</span>x????
+CL_TENSOR_LAYOUT_ML_NC_EXP      <span class="integer">0</span>x????
+CL_TENSOR_LAYOUT_ML_CN_EXP      <span class="integer">0</span>x????
+CL_TENSOR_LAYOUT_ML_HW_EXP      <span class="integer">0</span>x????
+CL_TENSOR_LAYOUT_ML_CHW_EXP     <span class="integer">0</span>x????
+CL_TENSOR_LAYOUT_ML_NCHW_EXP    <span class="integer">0</span>x????
+CL_TENSOR_LAYOUT_ML_NHWC_EXP    <span class="integer">0</span>x????</code></pre>
+</div>
+</div>
+<div class="paragraph">
+<p>New error codes:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="CodeRay highlight"><code data-lang="c">CL_INVALID_TENSOR_RANK_EXP   <span class="integer">0</span>x????
+CL_INVALID_TENSOR_DTYPE_EXP  <span class="integer">0</span>x????
+CL_INVALID_TENSOR_SHAPE_EXP  <span class="integer">0</span>x????
+CL_INVALID_TENSOR_LAYOUT_EXP <span class="integer">0</span>x????</code></pre>
+</div>
+</div>
+<div class="sect2">
+<h3 id="_modifications_to_the_opencl_api_specification">Modifications to The OpenCL API Specification</h3>
+<div class="dlist">
+<dl>
+<dt class="hdlist1">(Modify Section 4.2, <strong>Querying Devices</strong>) </dt>
+<dd>
+<div class="openblock">
+<div class="content">
+<div class="dlist">
+<dl>
+<dt class="hdlist1">(Add the following to Table 5., <em>List of supported _param_names</em> by <strong>clGetDeviceInfo</strong>) </dt>
+</dl>
+</div>
+</div>
+</div>
+</dd>
+</dl>
+</div>
+<table class="tableblock frame-all grid-all stripes-odd stretch">
 <colgroup>
+<col style="width: 40%;">
 <col style="width: 20%;">
-<col style="width: 20%;">
-<col style="width: 60%;">
+<col style="width: 40%;">
 </colgroup>
 <thead>
 <tr>
-<th class="tableblock halign-left valign-top"><strong>Date</strong></th>
-<th class="tableblock halign-left valign-top"><strong>Version</strong></th>
-<th class="tableblock halign-left valign-top"><strong>Description</strong></th>
+<th class="tableblock halign-left valign-top">Device Info</th>
+<th class="tableblock halign-left valign-top">Return Type</th>
+<th class="tableblock halign-left valign-top">Description</th>
 </tr>
 </thead>
 <tbody>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">2023-11-23</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">0.1.0</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">First assigned version.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_DEVICE_MAX_TENSOR_ARGS_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_uint</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Max number of tensor objects
+  arguments specified as arguments to.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_DEVICE_MAX_TENSOR_RANK_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_uint</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Max tensor rank. The minimum
+  value is 4.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_DEVICE_MAX_TENSOR_ELEMENTS_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">size_t</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Maximum number of tensor
+  elements in total. The minimum value is 65536.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_DEVICE_MAX_TENSOR_PITCH_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">size_t</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Maximum pitch value for
+  all pitch components for
+  <a href="#cl-tensor-layout-blas">CL_TENSOR_LAYOUT_BLAS_PITCHED_EXP</a> memory
+  layout.</p>
+<p class="tableblock">The minimum value is 65536.</p></td>
 </tr>
 </tbody>
 </table>
+<div class="openblock">
+<div class="content">
+
 </div>
-<div class="sect3">
-<h4 id="_dependencies">Dependencies</h4>
-<div class="paragraph">
-<p>This extension is written against the OpenCL Specification version 3.0.14.</p>
 </div>
-<div class="paragraph">
-<p>This extension requires OpenCL 1.2 or later.</p>
+<div class="dlist">
+<dl>
+<dt class="hdlist1">(Modify Section 5.2.1, <strong>Creating Buffer Objects</strong>) </dt>
+<dd>
+<div class="openblock">
+<div class="content">
+<div class="dlist">
+<dl>
+<dt class="hdlist1">(Add the following to Table 18.,  <em>Buffer creation properties</em>) </dt>
+</dl>
 </div>
 </div>
-<div class="sect3">
-<h4 id="_contributors">Contributors</h4>
-<div class="paragraph">
-<p>Henry Linjamäki, Intel.<br>
-Pekka Jääslkeläinen, Intel and Tampere University.<br>
-Ben Ashbaugh, Intel.<br></p>
 </div>
+</dd>
+</dl>
 </div>
+<table class="tableblock frame-all grid-all stripes-odd stretch">
+<colgroup>
+<col style="width: 40%;">
+<col style="width: 20%;">
+<col style="width: 40%;">
+</colgroup>
+<thead>
+<tr>
+<th class="tableblock halign-left valign-top">cl_mem_properties</th>
+<th class="tableblock halign-left valign-top">Property Value</th>
+<th class="tableblock halign-left valign-top">Description</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_MEM_TENSOR_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_tensor_desc_exp</p></td>
+<td class="tableblock halign-left valign-top"><div class="content"><div class="paragraph">
+<p>Creates a tensor object with
+properties set in <strong>cl_tensor_desc_exp</strong> tensor description structure.</p>
 </div>
-<div class="sect2">
-<h3 id="_overview">Overview</h3>
 <div class="paragraph">
-<p>The new tensor object enables applications to describe N-dimensional
-arrays whose memory layout is opaque to applications. The goals
-of this extension are the following:</p>
+<p>The <em>size</em> parameter of the <strong>clCreateBufferWithProperties()</strong> is
+ignored and may be set to zero. The required storage space needed is
+inferred from the tensor description. The storage size of the queried
+with <strong>clGetMemObjectInfo()</strong>. The storage size may change during
+the runtime unless constrained by the given tensor description.</p>
+</div></div></td>
+</tr>
+</tbody>
+</table>
+<div class="openblock">
+<div class="content">
+<div class="dlist">
+<dl>
+<dt class="hdlist1">(Add to list of error codes <strong>clCreateBufferWithProperties()</strong>) </dt>
+</dl>
+</div>
+</div>
 </div>
 <div class="ulist">
 <ul>
 <li>
-<p>Enable implementations to have freedom of placement data of the tensors for
-improving performance of the kernels which use them. This extension
-is designed such it allows implementations to determine optimal
-memory layouts for the tensors based on their use cases for
-increased performance, by means of, for example, analyzing kernels’ access
-patterns or, in case of built-in kernels, by inspecting the tensor
-arguments they operate on.</p>
+<p><code>CL_INVALID_VALUE</code> if <code>CL_MEM_TENSOR_EXP</code> property is specified and the
+<code>rank</code> member of the <code>cl_tensor_desc_exp</code> structure has invalid or
+unsupported value.</p>
 </li>
 <li>
-<p>Reduce details and boilerplate needed for performance portable implementation of
-applications by being less dependent on platform or device specifics
-on the memory layout / data arrangements which matters for
-performance. Such specifics may include:</p>
-<div class="ulist">
-<ul>
-<li>
-<p>alignment of data (e.g. for avoiding misaligned memory accesses)</p>
+<p><code>CL_INVALID_TENSOR_SHAPE_EXP</code> if <code>CL_MEM_TENSOR_EXP</code> property is
+specified and the <code>shape</code> member of the <code>cl_tensor_desc_exp</code>
+structure has invalid or unsupported description.</p>
 </li>
 <li>
-<p>arrangement of data required by kernels (column-major vs row-major
-for matrix multiplication, NHWC vs NCHW for neural network
-convolution)</p>
+<p><code>CL_INVALID_TENSOR_LAYOUT_TYPE_EXP</code> if <code>CL_MEM_TENSOR_EXP</code> property is
+specified and the <code>layout_type</code> member of the <code>cl_tensor_desc_exp</code>
+structure has an invalid enumeration constant.</p>
 </li>
 <li>
-<p>arrangement of the data into tiles (or “packing”) for improving
-cache and TLB hits</p>
+<p><code>CL_INVALID_TENSOR_LAYOUT_EXP</code> if <code>CL_MEM_TENSOR_EXP</code> property is
+specified and the <code>layout</code> member of the <code>cl_tensor_desc_exp</code> has an
+invalid description.</p>
+</li>
+</ul>
+</div>
+<div class="openblock">
+<div class="content">
+
+</div>
+</div>
+<div class="dlist">
+<dl>
+<dt class="hdlist1">(Add the following to Section 5.2.2, <strong>Reading, Writing and Copying Buffer Objects</strong>) </dt>
+<dd>
+<div class="openblock">
+<div class="content">
+<div class="paragraph">
+<p>The following functions are for reading from a tensor to host memory /
+buffer object or to write to a tensor object from host memory / buffer
+object.</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="CodeRay highlight"><code data-lang="c">cl_int clEnqueueImportFromTensorEXP(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  <span class="directive">const</span> size_t* tensor_origin,
+  <span class="directive">const</span> size_t* mem_origin,
+  <span class="directive">const</span> size_t* region,
+  <span class="directive">const</span> size_t* mem_pitch,
+  cl_mem buffer,
+  <span class="directive">void</span>* host_ptr,
+  cl_uint num_events_in_wait_list,
+  <span class="directive">const</span> cl_event* event_wait_list,
+  cl_event* event);</code></pre>
+</div>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="CodeRay highlight"><code data-lang="c">cl_int clEnqueueExportToTensorEXP(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  <span class="directive">const</span> size_t* tensor_origin,
+  <span class="directive">const</span> size_t* mem_origin,
+  <span class="directive">const</span> size_t* region,
+  <span class="directive">const</span> size_t* mem_pitch,
+  cl_mem buffer,
+  <span class="directive">const</span> <span class="directive">void</span>* host_ptr,
+  cl_uint num_events_in_wait_list,
+  <span class="directive">const</span> cl_event* event_wait_list,
+  cl_event* event);</code></pre>
+</div>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><em>command_queue</em> is a valid host command-queue in which the read /
+write command will be queued. <em>command_queue</em> and <em>tensor</em> must be
+created with the same OpenCL context.</p>
+</li>
+<li>
+<p><em>tensor</em> refers to a valid tensor object which is bound to a buffer.</p>
+</li>
+<li>
+<p><em>blocking_command</em> indicate if the read and write operations are
+blocking or non-blocking (see below).</p>
+</li>
+<li>
+<p><em>tensor_origin</em> defines the offset coordinates in <em>tensor</em> for start of
+the regions to read / write tensor data. The length of the array
+must be at least rank the the <em>tensor</em>.</p>
+</li>
+<li>
+<p><em>mem_origin</em> defines the offset coordinates in the memory region
+pointed by <em>buffer</em> or <em>host_ptr</em> expressed in elements of <em>tensor</em>
+data type. The length of the array must be at least rank the the
+<em>tensor</em>.</p>
+</li>
+<li>
+<p><em>region</em> defines the region being read or written expressed in in
+elements of <em>tensor</em> data type. The length of the array must be at
+least rank the the <em>tensor</em>. If <em>region</em> is NULL then <em>tensor</em>'s
+shape will be used as the region.</p>
+</li>
+<li>
+<p><em>mem_pitch</em> defines the length of each dimension in elements to be
+used for the memory region of <em>buffer</em> or <em>host_ptr</em>. The length of
+the array must be at least the rank of <em>tensor</em> minus one. if
+<em>mem_pitch</em> is NULL or <em>mem_pitch</em>[i] is zero, <em>mem_pitch</em>[i] is
+computed as <em>region</em>[i + 1].</p>
+</li>
+<li>
+<p><em>buffer</em> and <em>host_ptr</em> refer to a valid buffer object / host
+allocation where data is to be read into or to be written from.
+Either the <em>buffer</em> or <em>host_ptr</em> can be non-NULL in which case the
+non-NULL argument is used as the operand for the operation.</p>
+</li>
+<li>
+<p><em>event_wait_list</em> and <em>num_events_in_wait_list</em> specify events that
+need to complete before this particular command can be executed. If
+<em>event_wait_list</em> is NULL, then this particular command does not
+wait on any event to complete. If <em>event_wait_list</em> is NULL,
+<em>num_events_in_wait_list</em> must be 0. If <em>event_wait_list</em> is not
+NULL, the list of events pointed to by <em>event_wait_list</em> must be
+valid and <em>num_events_in_wait_list</em> must be greater than 0. The
+events specified in <em>event_wait_list</em> act as synchronization
+points. The context associated with events in <em>event_wait_list</em> and
+<em>command_queue</em> must be the same. The memory associated with
+<em>event_wait_list</em> can be reused or freed after the function returns.</p>
+</li>
+<li>
+<p><em>event</em> returns an event object that identifies this read / write
+command and can be used to query or queue a wait for this command to
+complete. If <em>event</em> is NULL or the enqueue is unsuccessful, no
+event will be created and therefore it will not be possible to query
+the status of this command or to wait for this command to
+complete. If <em>event_wait_list</em> and <em>event</em> are not NULL, <em>event</em>
+must not refer to an element of the <em>event_wait_list</em> array.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>The <strong>clEnqueueExportToTensorEXP</strong> function copies contents of the buffer
+object / host allocation to tensor&#8217;s storage in
+implementation-defined, opaque memory layout. The
+<strong>clEnqueueImportFromTensorEXP</strong> function copies data from tensor&#8217;s
+storage to buffer object / host allocation.</p>
+</div>
+<div class="paragraph">
+<p>The elements of buffer object / host allocation are mapped to tensor
+coordinates and vice versa as follows in pseudo C code:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="CodeRay highlight"><code data-lang="c">tensor_element(
+  tensor,
+  tensor_origin[<span class="integer">0</span>] + i[<span class="integer">0</span>],
+  tensor_origin[<span class="integer">1</span>] + i[<span class="integer">1</span>],
+  ...,
+  tensor_origin[N-<span class="integer">2</span>] + i[N-<span class="integer">2</span>],
+  tensor_origin[N-<span class="integer">2</span>] + i[N-<span class="integer">1</span>]) ==
+((TENSOR_DATATYPE *)buffer_or_host_ptr)[
+  (mem_origin[<span class="integer">0</span>] + i[<span class="integer">0</span>]) * pitch(<span class="integer">0</span>) +
+  (mem_origin[<span class="integer">1</span>] + i[<span class="integer">1</span>]) * pitch(<span class="integer">1</span>) +
+  ... +
+  (mem_origin[N-<span class="integer">2</span>] + i[N-<span class="integer">2</span>]) * pitch(N-<span class="integer">2</span>) +
+  (mem_origin[N-<span class="integer">1</span>] + i[N-<span class="integer">1</span>])];</code></pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Where the <code>N</code> is tensor rank, the <code>i[X]</code> is a tensor coordinate with
+inclusive range of <code>0..&lt;region[X]-1&gt;</code> and the <code>pitch</code> is computed as
+follows in pseudo C code:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="CodeRay highlight"><code data-lang="c">size_t pitch(size_t dim) {
+  size_t pitch = <span class="integer">1</span>;
+  <span class="keyword">for</span> (size_t i = dim; i &lt; tensor_rank - <span class="integer">1</span>; i++)
+    pitch *=
+      (mem_pitch != <span class="predefined-constant">NULL</span> || mem_pitch[i] == <span class="integer">0</span>) ? mem_pitch[i] : region[i + <span class="integer">1</span>];
+  <span class="keyword">return</span> pitch;
+}</code></pre>
+</div>
+</div>
+<div class="paragraph">
+<p>For <code>dim</code> in <code>0..(tensor_rank()-1)</code>. The <code>tensor_element()</code> represents
+an abstract function that accesses a tensor element in its storage at
+given coordinate. The method how the coordinates translate to tensor
+storage addresses is unspecified.</p>
+</div>
+<div class="paragraph">
+<p><strong>clEnqueueImportFromTensorEXP</strong> and <strong>clEnqueueExportToTensorEXP</strong>
+returns CL_SUCCESS if the function is executed
+successfully. Otherwise, it returns one of the following errors:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>CL_INVALID_COMMAND_QUEUE if <em>command_queue</em> is not a valid host
+command-queue.</p>
+</li>
+<li>
+<p>CL_INVALID_CONTEXT if the context associated with <em>command_queue</em>
+and buffer are not the same or if the context associated with
+<em>command_queue</em> and events in <em>event_wait_list</em> are not the same.</p>
+</li>
+<li>
+<p>CL_INVALID_MEM_OBJECT if <em>buffer</em> is not a valid buffer object.</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if <em>tensor_origin</em> or <em>mem_origin</em> is NULL.</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if the region being read or written specified by
+(<em>mem_origin</em>, <em>region</em>, <em>mem_pitch</em>) is out of bounds.</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if any <em>region</em> array element is 0.</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if <em>mem_pitch</em> is not NULL and <em>mem_pitch</em>[i] is
+not 0 and <em>mem_pitch</em>[i] is less than <em>region</em>[i].</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if <em>buffer</em> and <em>host_ptr</em> both are NULL or non-NULL.</p>
+</li>
+<li>
+<p>CL_INVALID_EVENT_WAIT_LIST if <em>event_wait_list</em> is NULL and
+<em>num_events_in_wait_list</em> &gt; 0, or <em>event_wait_list</em> is not NULL and
+<em>num_events_in_wait_list</em> is 0, or if event objects in
+<em>event_wait_list</em> are not valid events.</p>
+</li>
+<li>
+<p>CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST if the read and write
+operations are blocking and the execution status of any of the
+events in <em>event_wait_list</em> is a negative integer value.</p>
+</li>
+<li>
+<p>CL_OUT_OF_RESOURCES if there is a failure to allocate resources
+required by the OpenCL implementation on the device.</p>
+</li>
+<li>
+<p>CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
+required by the OpenCL implementation on the host.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>To copy elements from one tensor to another use:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="CodeRay highlight"><code data-lang="c">cl_int clEnqueueCopyTensorEXP(
+  cl_command_queue command_queue,
+  cl_tensor src_tensor,
+  cl_tensor dst_tensor,
+  <span class="directive">const</span> cl_tensor_shape* src_origin,
+  <span class="directive">const</span> cl_tensor_shape* dst_origin,
+  <span class="directive">const</span> cl_tensor_shape* region,
+  cl_uint num_events_in_wait_list,
+  <span class="directive">const</span> cl_event* event_wait_list,
+  cl_event* event);</code></pre>
+</div>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><em>command_queue</em> is a valid host command-queue in which the read /
+write command will be queued. <em>command_queue</em> and <em>tensor</em> must be
+created with the same OpenCL context.</p>
+</li>
+<li>
+<p><em>src_tensor</em> and <em>dst_tensor</em> refer to valid buffer objects created
+with <code>CL_MEM_TENSOR_EXP</code>. Tensor elements are copied from <em>src_tensor</em>
+to <em>dst_tensor</em>. Rank of the <em>src_tensor</em> and <em>dst_tensor</em> must match.</p>
+</li>
+<li>
+<p><em>src_origin</em> and <em>dst_origin</em> define origins of the copy region. The
+length of the arrays must be at least tensors' rank.</p>
+</li>
+<li>
+<p><em>region</em> defines extends of the slice being being copied. The length
+of the arrays must be at least tensors' rank.</p>
+</li>
+<li>
+<p><em>event_wait_list</em> and <em>num_events_in_wait_list</em> specify events that
+need to complete before this particular command can be executed. If
+<em>event_wait_list</em> is NULL, then this particular command does not
+wait on any event to complete. If <em>event_wait_list</em> is NULL,
+<em>num_events_in_wait_list</em> must be 0. If <em>event_wait_list</em> is not
+NULL, the list of events pointed to by <em>event_wait_list</em> must be
+valid and <em>num_events_in_wait_list</em> must be greater than 0. The
+events specified in <em>event_wait_list</em> act as synchronization
+points. The context associated with events in <em>event_wait_list</em> and
+<em>command_queue</em> must be the same. The memory associated with
+<em>event_wait_list</em> can be reused or freed after the function returns.</p>
+</li>
+<li>
+<p><em>event</em> returns an event object that identifies this read / write
+command and can be used to query or queue a wait for this command to
+complete. If <em>event</em> is NULL or the enqueue is unsuccessful, no
+event will be created and therefore it will not be possible to query
+the status of this command or to wait for this command to
+complete. If <em>event_wait_list</em> and <em>event</em> are not NULL, <em>event</em>
+must not refer to an element of the <em>event_wait_list</em> array.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>Elements are copied from the source tensor to the destination tensor
+so that after the completion following condition holds expressed in
+pseudo C:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="CodeRay highlight"><code data-lang="c"><span class="comment">// 'so' and 'do' are aliases for src_origin and dst_origin, respectively.</span>
+tensor_element(dst_tensor, <span class="keyword">do</span>[<span class="integer">0</span>] + i[<span class="integer">0</span>], <span class="keyword">do</span>[<span class="integer">1</span>] + i[<span class="integer">1</span>], ..., <span class="keyword">do</span>[N-<span class="integer">1</span>] + i[N-<span class="integer">1</span>])
+==
+tensor_element(src_tensor, so[<span class="integer">0</span>] + i[<span class="integer">0</span>], so[<span class="integer">1</span>] + i[<span class="integer">1</span>], ..., so[N-<span class="integer">1</span>] + i[N-<span class="integer">1</span>]);</code></pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Where the <code>N</code> is tensor rank, the <code>i[X]</code> is a tensor coordinate with
+inclusive range of <code>0..&lt;region[X]-1&gt;</code>.</p>
+</div>
+<div class="paragraph">
+<p><strong>clEnqueueCopyTensorEXP</strong> returns CL_SUCCESS if the function is
+executed successfully. Otherwise, it returns one of the following
+errors:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>CL_INVALID_COMMAND_QUEUE if <em>command_queue</em> is not a valid host
+command-queue.</p>
+</li>
+<li>
+<p>CL_INVALID_CONTEXT if the context associated with <em>command_queue</em>
+and buffer are not the same or if the context associated with
+<em>command_queue</em> and events in <em>event_wait_list</em> are not the same.</p>
+</li>
+<li>
+<p>CL_INVALID_MEM_OBJECT if <em>src_tensor</em> or <em>dst_tensor</em> are not a
+valid buffer object created with <code>CL_MEM_TENSOR_EXP</code>.</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if <em>tensor_origin</em> or <em>mem_origin</em> is NULL.</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if <em>src_origin</em>, <em>dst_origin</em> or <em>region</em> is NULL.</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if <code>region[i]</code> is zero for i in <code>[0, tensor_rank)</code>.</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if <code>origin[i] + region[i] &gt; tensor_shape[i]</code> at any
+dimension <code>i</code> in range <code>[0, tensor_rank)</code>.</p>
+</li>
+<li>
+<p>CL_INVALID_EVENT_WAIT_LIST if <em>event_wait_list</em> is NULL and
+<em>num_events_in_wait_list</em> &gt; 0, or <em>event_wait_list</em> is not NULL and
+<em>num_events_in_wait_list</em> is 0, or if event objects in
+<em>event_wait_list</em> are not valid events.</p>
+</li>
+<li>
+<p>CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST if the read and write
+operations are blocking and the execution status of any of the
+events in <em>event_wait_list</em> is a negative integer value.</p>
+</li>
+<li>
+<p>CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate
+memory for data store associated with memory object the <em>tensor</em> is
+bound to.</p>
+</li>
+<li>
+<p>CL_OUT_OF_RESOURCES if there is a failure to allocate resources
+required by the OpenCL implementation on the device.</p>
+</li>
+<li>
+<p>CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
+required by the OpenCL implementation on the host.</p>
+</li>
+</ul>
+</div>
+</div>
+</div>
+</dd>
+<dt class="hdlist1">(Add the following to Section 5.17.5, <strong>Recording Commands to a Command-Buffer</strong>) </dt>
+<dd>
+<div class="openblock">
+<div class="content">
+<div class="paragraph">
+<p>If <strong>cl_khr_command_buffer</strong> is supported, then the following command
+buffer counterparts of the <strong>clEnqueueImportFromTensorEXP</strong> and
+<strong>clEnqueueExportToTensorEXP</strong> commands are available.</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="CodeRay highlight"><code data-lang="c">cl_int clCommandImportFromTensorEXP(
+  cl_command_buffer_khr command_buffer,
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  <span class="directive">const</span> size_t* tensor_origin,
+  <span class="directive">const</span> size_t* mem_origin,
+  <span class="directive">const</span> size_t* region,
+  <span class="directive">const</span> size_t* mem_pitch,
+  cl_mem buffer,
+  <span class="directive">void</span>* host_ptr,
+  cl_uint num_sync_points_in_wait_list,
+  <span class="directive">const</span> cl_sync_point_khr* sync_point_wait_list,
+  cl_sync_point_khr* sync_point,
+  cl_mutable_command_khr* mutable_handle);
+
+cl_int clCommandExportToTensorEXP(
+  cl_command_buffer_khr command_buffer,
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  <span class="directive">const</span> size_t* tensor_origin,
+  <span class="directive">const</span> size_t* mem_origin,
+  <span class="directive">const</span> size_t* region,
+  <span class="directive">const</span> size_t* mem_pitch,
+  cl_mem buffer,
+  <span class="directive">const</span> <span class="directive">void</span>* host_ptr,
+  cl_uint num_sync_points_in_wait_list,
+  <span class="directive">const</span> cl_sync_point_khr* sync_point_wait_list,
+  cl_sync_point_khr* sync_point,
+  cl_mutable_command_khr* mutable_handle);</code></pre>
+</div>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><em>command_buffer</em> refers to valid command-buffer object.</p>
+</li>
+<li>
+<p>For <em>command_queue</em>, <em>tensor</em>, <em>tensor_origin</em>, <em>mem_origin</em>,
+<em>region</em>, <em>mem_pitch</em>, <em>buffer</em> and <em>host_ptr</em> parameters refer to
+<strong>clEnqueueImportFromTensor</strong>.</p>
+</li>
+<li>
+<p>For <em>num_sync_points_in_wait_list</em>, <em>sync_point_wait_list</em>,
+<em>sync_point</em>, <em>mutable_handle</em> parameters refer to
+<strong>clCommandCopyBufferEXP</strong>.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p><strong>clCommandImportFromTensorEXP</strong> and <strong>clCommandImportFromTensorEXP</strong>
+returns CL_SUCCESS if the function is executed
+successfully. Otherwise, it returns one of the following errors:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>CL_INVALID_COMMAND_QUEUE if <em>command_queue</em> is not NULL.</p>
+</li>
+<li>
+<p>CL_INVALID_COMMAND_BUFFER_KHR if <em>command_buffer</em> is not a valid
+command-buffer.</p>
+</li>
+<li>
+<p>CL_INVALID_CONTEXT if the context associated with <em>command_queue</em>
+and <em>command_buffer</em> is not the same.</p>
+</li>
+<li>
+<p>CL_INVALID_OPERATION if <em>command_buffer</em> has been finalized.</p>
+</li>
+<li>
+<p>CL_INVALID_VALUE if <em>mutable_handle</em> is not NULL.</p>
+</li>
+<li>
+<p>CL_INVALID_SYNC_POINT_WAIT_LIST_KHR if <em>sync_point_wait_list</em> is
+NULL and <em>num_sync_points_in_wait_list</em> is &gt; 0, or
+<em>sync_point_wait_list</em> is not NULL and <em>num_sync_points_in_wait_list</em> is
+0, or if synchronization-point objects in <em>sync_point_wait_list</em> are
+not valid synchronization-points.</p>
 </li>
 <li>
-<p>arrangement of data into specific tiles in order to exploit complex
-HW operations such as matrix multiplications (Intel AMX, AMD matrix
-cores).</p>
+<p>CL_OUT_OF_RESOURCES if there is a failure to allocate resources
+required by the OpenCL implementation on the device.</p>
 </li>
 <li>
-<p>arrangement of data into rows separated by a stride in order to
-avoid bank conflicts in GPUs.</p>
-</li>
-</ul>
-</div>
+<p>CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
+required by the OpenCL implementation on the host.</p>
 </li>
 </ul>
 </div>
-<div class="paragraph">
-<p>The tensor data type is designed to be efficiently used together with command buffers (cl_khr_command_buffers)
-and built-in kernels, including kernels to be provided by the Defined
-Built-in Kernels (cl_khr_defined_builtin_kernels) extension that is being prepared together with this extension.</p>
 </div>
 </div>
-<div class="sect2">
-<h3 id="_modifications_to_opencl">Modifications to OpenCL</h3>
-<div class="sect3">
-<h4 id="_new_section_5_x_tensor_objects">New Section: 5.x Tensor Objects</h4>
-<div class="paragraph">
-<p>A tensor object stores an N-dimensional array of elements. The memory
-layout of the tensor is opaque to the application. When a tensor
-object is created it is initially not associated to any storage for the tensor elements.
- A storage is bound to a tensor
-by creating a memory buffer with CL_MEM_BIND_TO_BUFFER. Tensor objects
-without storage can be set as kernel arguments for kernels which
-accepts them. Kernels which have tensor arguments must have storage
-assigned to them prior enqueuing the kernels for execution.</p>
-</div>
-</div>
-<div class="sect3">
-<h4 id="_new_opencl_functions_added_to_tensor_objects_section">New OpenCL Functions added to Tensor Objects section</h4>
+</dd>
+<dt class="hdlist1">(Add the following to new Section 5.X.Y, <strong>Tensor Descriptions</strong>) </dt>
+<dd>
+<div class="openblock">
+<div class="content">
 <div class="paragraph">
-<p>To create a tensor use:</p>
+<p>The following structure describes properties of a tensor to be created
+with <strong>clCreateBufferWithProperties()</strong> using <code>CL_MEM_TENSOR_EXP</code> property:</p>
 </div>
 <div class="listingblock">
 <div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_tensor clCreateTensor(
-    cl_context context,
-    const cl_tensor_peoperties *properties,
-    size_t rank,
-    const size_t* shape,
-    cl_tensor_datatype dtype,
-    cl_int *errcode_ret);</code></pre>
+<pre class="CodeRay highlight"><code data-lang="c"><span class="keyword">typedef</span> <span class="keyword">struct</span> cl_tensor_desc_exp {
+    cl_uint               rank;
+    cl_tensor_datatype    dtype;
+    cl_tensor_properties_exp  properties[CL_TENSOR_DESC_MAX_PROPERTIES_EXP]
+    cl_tensor_shape       shape[CL_TENSOR_DESC_MAX_RANK_EXP];
+    <span class="directive">const</span> <span class="directive">void</span>*           layout;
+    cl_tensor_layout_type_exp layout_type;
+} cl_tensor_desc_exp;</code></pre>
 </div>
 </div>
 <div class="ulist">
 <ul>
 <li>
-<p><em>context</em> is a valid OpenCL context used to create the tensor object.</p>
+<p><em>rank</em> defines the tensor&#8217;s rank - the number of dimensions.</p>
+</li>
+<li>
+<p><em>dtype</em> defines the data type of the elements in the
+tensor. Possible types are listed in <a href="#tensor-dtype-table">tensor
+element type</a> table.</p>
 </li>
 <li>
 <p><em>properties</em> is an optional list of properties for the tensor object
 and their corresponding values. The list is terminated with the
 special property 0. If no properties are required, properties may be
 NULL. This extension does not define any optional properties for
-tensors.</p>
-</li>
-<li>
-<p><em>rank</em> is the number of dimensions. Zero value creates a "scalar"
-tensor which has no dimensions but has storage for one element.</p>
-</li>
-<li>
-<p><em>shape</em> is a list of sizes of the dimensions. The length of the list
-must be <em>rank</em> elements. <em>shape</em> can be NULL if <em>rank</em> value is
-zero. All the first <em>rank</em> values in the list must be non-zero.</p>
-</li>
-<li>
-<p><em>dtype</em> is the element type of <em>tensor</em>. Refer to the
-<a href="#TensorDtypes">Tensor element types. The API type indicates the corresponding type for copying elements from an host allocation / buffer object to tensor or vice versa.</a> table for the types.</p>
-</li>
-<li>
-<p><em>errcode_ret</em> may return an appropriate error code. If errcode_ret
-is NULL, no error code is returned.</p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>clCreateTensor function creates a <code>rank</code>-dimensional tensor with
-<code>shape[0] * shape[1] * &#8230;&#8203; * shape[rank-1]</code> elements of <em>dtype</em>
-type. At the creation time of the tensor, it does not have
-storage. The storage is assigned to the tensor by calling
-clCreateBufferWithProperties() with CL_MEM_BIND_TO_TENSOR.</p>
-</div>
-<div class="paragraph">
-<p>A command that refers to a tensor must be bound to a valid buffer
-object before enqueuing or recording the command.</p>
-</div>
-<div class="paragraph">
-<p><strong>clCreateTensor</strong> returns a valid non-zero tensor object and errcode_ret
-is set to CL_SUCCESS if the tensor object is created
-successfully. Otherwise, they return a NULL value with one of the
-following error values returned in errcode_ret:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p>CL_INVALID_CONTEXT if context is not a valid context.</p>
+tensors, but future extensions may define properties.</p>
 </li>
 <li>
-<p>CL_INVALID_PROPERTY if a property name in properties is not a
-supported property name, if the value specified for a supported
-property name is not valid, or if the same property name is
-specified more than once.</p>
+<p><em>shape</em> defines the extends of the tensor&#8217;s dimensions in number of
+elements.</p>
 </li>
 <li>
-<p>CL_INVALID_VALUE if a value specified in dtype is invalid.</p>
+<p><em>layout</em> points to an optional structure describing how tensor
+elements are laid out in the buffer memory. The structure must be a
+type corresponding to the <em>layout_type</em> listed in
+<a href="#layout-types-table">tensor layout type</a> table. The pointer is
+ignored if the <em>tensor_type</em> is <code>CL_TENSOR_LAYOUT_OPAQUE_EXP</code>.</p>
 </li>
 <li>
-<p>CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
-required by the OpenCL implementation on the host.</p>
+<p><em>layout_type</em> indicates the layout structure type the <em>layout</em>
+point to.</p>
 </li>
 </ul>
 </div>
-<table id="TensorDtypes" class="tableblock frame-all grid-all stripes-even stretch">
+<table id="tensor-dtypes-table" class="tableblock frame-all grid-all stripes-even stretch">
 <caption class="title">Table 1. Tensor element types. The API type indicates the corresponding type for copying elements from an host allocation / buffer object to tensor or vice versa.</caption>
 <colgroup>
 <col style="width: 33.3333%;">
@@ -675,884 +1538,378 @@ <h4 id="_new_opencl_functions_added_to_tensor_objects_section">New OpenCL Functi
 </thead>
 <tbody>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_BOOL</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">1-bit signedless integer.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_BOOL</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Data type representing true or false.</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_uchar. <sup class="footnote">[<a id="_footnoteref_1" class="footnote" href="#_footnotedef_1" title="View footnote.">1</a>]</sup></p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_INT8</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_INT4_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">4-bit signed integer.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_char.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_INT8_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">8-bit signed integer.</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_char.</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_INT16</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_INT16_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">16-bit signed integer.</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_short.</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_INT32</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_INT32_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">32-bit signed integer.</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_int.</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_INT64</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_INT64_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">64-bit signed integer.</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_long.</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_UINT8</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_UINT8_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">8-bit unsigned integer.</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_uchar.</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_UINT16</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_UINT16_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">16-bit unsigned integer.</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_ushort.</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_UINT32</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_UINT32_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">32-bit unsigned integer.</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_uint.</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_UINT64</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_UINT64_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">64-bit unsigned integer.</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_ulong.</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_HALF</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_FP8_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Half precision floating-point.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_char.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_FP16_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">Half precision floating-point.</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_half.</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_BFLOAT16</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_BFLOAT16_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">16-bit brain floating-point.</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_ushort</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_FLOAT</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_FP32_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">Single precision floating-point.</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_float.</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DOUBLE</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_FP64_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">Double precision floating-point.</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_double.</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_COMPLEX64</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_COMPLEX64_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">64-bit complex floating-point with
   32-bit real and imaginary part.</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_float2</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_COMPLEX128</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_COMPLEX128_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">128-bit complex floating-point with
   64-bit real and imaginary part.</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_double2</p></td>
 </tr>
 </tbody>
 </table>
-<div class="paragraph">
-<p>To retain a tensor object, call the function</p>
-</div>
-<div class="listingblock">
-<div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_int clRetainTensorObject(cl_tensor tensor);</code></pre>
-</div>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p><em>tensor</em> is the tensor object to be retained.</p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>The <em>tensor</em> reference count is incremented.</p>
-</div>
-<div class="paragraph">
-<p><strong>clRetainTensor</strong> returns CL_SUCCESS if the function is executed
-successfully. Otherwise, it returns one of the following errors:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p>CL_INVALID_TENSOR if the tensor is not a valid tensor object.</p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>To release a tensor object, call the function</p>
-</div>
-<div class="listingblock">
-<div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_int clReleaseTensorObject(cl_tensor tensor);</code></pre>
-</div>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p><em>tensor</em> is the tensor object to be released.</p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>The <em>tensor</em> reference count is decremented.</p>
-</div>
-<div class="paragraph">
-<p>The tensor object is deleted once the number of instances that are
-retained to tensor become zero and the tensor object is no longer
-needed by any enqueued or recorded commands that use <em>tensor</em>. Using
-this function to release a reference that was not obtained by creating
-the object or by calling <strong>clRetainTensor</strong> causes undefined behavior.</p>
-</div>
-<div class="paragraph">
-<p><strong>clReleaseTensor</strong> returns CL_SUCCESS if the function is executed
-successfully. Otherwise, it returns one of the following errors:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p>CL_INVALID_TENSOR if tensor is not a valid tensor object.</p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>To return information about a tensor object, call the function</p>
-</div>
-<div class="listingblock">
-<div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_int clGetTensorInfo(
-  cl_tensor tensor,
-  cl_tensor_info param_name,
-  size_t param_value_size,
-  void* param_value,
-  size_t* param_value_size_ret);</code></pre>
-</div>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p><em>tensor</em> specifies the tensor object being queried.</p>
-</li>
-<li>
-<p><em>param_name</em> specifies the information to query. The list of
-supported param_name types and the information returned in
-<em>param_value</em> by clGetTensorInfo is described in the <a href="#Tensor Object
-Queries">[Tensor Object
-Queries]</a> table.</p>
-</li>
-<li>
-<p><em>param_value</em> is a pointer to memory where the appropriate result
-being queried is returned. If <em>param_value</em> is NULL, it is ignored.</p>
-</li>
-<li>
-<p><em>param_value_size</em> is used to specify the size in bytes of memory
-pointed to by <em>param_value</em>. This size must be ≥ size of return type
-as described in the <a href="#Tensor Object Queries">[Tensor Object Queries]</a> table.</p>
-</li>
-<li>
-<p><em>param_value_size_ret</em> returns the actual size in bytes of data
-being queried by <em>param_name</em>. If <em>param_value_size_ret</em> is NULL, it is
-ignored.</p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p><strong>clGetTensorInfo</strong> returns CL_SUCCESS if the function is executed
- succesfully. Otherwise, it returns one of the following errors:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p>CL_INVALID_TENSOR if <em>tensor</em> is not a valid tensor object.</p>
-</li>
-</ul>
-</div>
-<table class="tableblock frame-all grid-all stripes-odd stretch">
-<caption class="title">Table 2. List of supported param_names by clGetTensorInfo</caption>
+<table id="layout-types-table" class="tableblock frame-all grid-all stripes-even stretch">
+<caption class="title">Table 2. Optional tensor memory layout types.</caption>
 <colgroup>
-<col style="width: 40%;">
-<col style="width: 20%;">
-<col style="width: 40%;">
+<col style="width: 16.6666%;">
+<col style="width: 16.6666%;">
+<col style="width: 66.6668%;">
 </colgroup>
-<tbody>
-<tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_RANK</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">size_t</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">Return the tensor rank.</p></td>
-</tr>
-<tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_SHAPE</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">size_t[]</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">Return the tensor shape.</p></td>
-</tr>
+<thead>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">cl_tensor_datatype</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">Return the tensor data type.</p></td>
+<th class="tableblock halign-left valign-top"><strong>layout type</strong></th>
+<th class="tableblock halign-left valign-top"><strong>tensor layout type</strong></th>
+<th class="tableblock halign-left valign-top"><strong>Description</strong></th>
 </tr>
+</thead>
+<tbody>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_BOUND_TO_BUFFER</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">cl_bool</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">Return true if the tensor is
-bound to a buffer.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_LAYOUT_OPAQUE_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">N/A</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">The tensor don&#8217;t have application
+  defined memory layout. Driver controls the tensors layout. To read
+  or write elements of the tensor</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_BUFFER</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">cl_mem</p></td>
-<td class="tableblock halign-left valign-top"><div class="content"><div class="paragraph">
-<p>If CL_TENSOR_BOUND_TO_BUFFER is true,
-return the buffer object the tensor is bound to. Otherwise,
-clGetTensorInfo call returns:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p>CL_INVALID_MEM_OBJECT if the tensor is not bound to a buffer object.</p>
-</li>
-<li>
-<p>CL_INVALID_PROPERTY otherwise.</p>
-</li>
-</ul>
-</div></div></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_LAYOUT_BLAS_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock"><a href="#cl-tensor-layout-blas">cl_tensor_layout_blas_exp</a></p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">A type that describe packed memory layout similar ones used in BLAS APIs.</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_CONTEXT</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">cl_context</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">Return the context specified when
-  the tensor object is created.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_LAYOUT_BLAS_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock"><a href="#cl-tensor-layout-blas">cl_tensor_layout_blas_pitched_exp</a></p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">A type that describe memory layout similar ones used in BLAS APIs.</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_REFERENCE_COUNT</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">cl_uint</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">Return the tensor reference
-count.</p></td>
-</tr>
-</tbody>
-</table>
-<div class="paragraph">
-<p>The following functions are for reading from a tensor to host memory /
-buffer object or to write to a tensor object from host memory / buffer
-object.</p>
-</div>
-<div class="listingblock">
-<div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_int clEnqueueImportFromTensor(
-  cl_command_queue command_queue,
-  cl_tensor tensor,
-  cl_bool blocking_command,
-  const size_t* tensor_origin,
-  const size_t* mem_origin,
-  const size_t* region,
-  const size_t* mem_pitch,
-  cl_mem buffer,
-  void* host_ptr,
-  cl_uint num_events_in_wait_list,
-  const cl_event* event_wait_list,
-  cl_event* event);</code></pre>
-</div>
-</div>
-<div class="listingblock">
-<div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_int clEnqueueExportToTensor(
-  cl_command_queue command_queue,
-  cl_tensor tensor,
-  cl_bool blocking_command,
-  const size_t* tensor_origin,
-  const size_t* mem_origin,
-  const size_t* region,
-  const size_t* mem_pitch,
-  cl_mem buffer,
-  const void* host_ptr,
-  cl_uint num_events_in_wait_list,
-  const cl_event* event_wait_list,
-  cl_event* event);</code></pre>
-</div>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p><em>command_queue</em> is a valid host command-queue in which the read /
-write command will be queued. <em>command_queue</em> and <em>tensor</em> must be
-created with the same OpenCL context.</p>
-</li>
-<li>
-<p><em>tensor</em> refers to a valid tensor object which is bound to a buffer.</p>
-</li>
-<li>
-<p><em>blocking_command</em> indicate if the read and write operations are
-blocking or non-blocking (see below).</p>
-</li>
-<li>
-<p><em>tensor_origin</em> defines the offset coordinates in <em>tensor</em> for start of
-the regions to read / write tensor data. The length of the array
-must be at least rank the the <em>tensor</em>.</p>
-</li>
-<li>
-<p><em>mem_origin</em> defines the offset coordinates in the memory region
-pointed by <em>buffer</em> or <em>host_ptr</em> expressed in elements of <em>tensor</em>
-data type. The length of the array must be at least rank the the
-<em>tensor</em>.</p>
-</li>
-<li>
-<p><em>region</em> defines the region being read or written expressed in in
-elements of <em>tensor</em> data type. The length of the array must be at
-least rank the the <em>tensor</em>. If <em>region</em> is NULL then <em>tensor</em>'s
-shape will be used as the region.</p>
-</li>
-<li>
-<p><em>mem_pitch</em> defines the length of each dimension in elements to be
-used for the memory region of <em>buffer</em> or <em>host_ptr</em>. The length of
-the array must be at least the rank of <em>tensor</em> minus one. if
-<em>mem_pitch</em> is NULL or <em>mem_pitch</em>[i] is zero, <em>mem_pitch</em>[i] is
-computed as <em>region</em>[i + 1].</p>
-</li>
-<li>
-<p><em>buffer</em> and <em>host_ptr</em> refer to a valid buffer object / host
-allocation where data is to be read into or to be written from.
-Either the <em>buffer</em> or <em>host_ptr</em> can be non-NULL in which case the
-non-NULL argument is used as the operand for the operation.</p>
-</li>
-<li>
-<p><em>event_wait_list</em> and <em>num_events_in_wait_list</em> specify events that
-need to complete before this particular command can be executed. If
-<em>event_wait_list</em> is NULL, then this particular command does not
-wait on any event to complete. If <em>event_wait_list</em> is NULL,
-<em>num_events_in_wait_list</em> must be 0. If <em>event_wait_list</em> is not
-NULL, the list of events pointed to by <em>event_wait_list</em> must be
-valid and <em>num_events_in_wait_list</em> must be greater than 0. The
-events specified in <em>event_wait_list</em> act as synchronization
-points. The context associated with events in <em>event_wait_list</em> and
-<em>command_queue</em> must be the same. The memory associated with
-<em>event_wait_list</em> can be reused or freed after the function returns.</p>
-</li>
-<li>
-<p><em>event</em> returns an event object that identifies this read / write
-command and can be used to query or queue a wait for this command to
-complete. If <em>event</em> is NULL or the enqueue is unsuccessful, no
-event will be created and therefore it will not be possible to query
-the status of this command or to wait for this command to
-complete. If <em>event_wait_list</em> and <em>event</em> are not NULL, <em>event</em>
-must not refer to an element of the <em>event_wait_list</em> array.</p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>The <strong>clEnqueueExportToTensor</strong> function copies contents of the buffer
-object / host allocation to tensor&#8217;s storage in
-implementation-defined, opaque memory layout. The
-<strong>clEnqueueImportFromTensor</strong> function copies data from tensor&#8217;s
-storage to buffer object / host allocation.</p>
-</div>
-<div class="paragraph">
-<p>The elements of buffer object / host allocation are mapped to tensor
-coordinates and vice versa as follows in pseudo C code:</p>
-</div>
-<div class="listingblock">
-<div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">tensor_element(
-  tensor_origin[0] + i[0],
-  tensor_origin[1] + i[1],
-  ...,
-  tensor_origin[N-2] + i[N-2],
-  tensor_origin[N-2] + i[N-1]) ==
-((TENSOR_DATATYPE *)buffer_or_host_ptr)[
-  (mem_origin[0] + i[0]) * pitch(0) +
-  (mem_origin[1] + i[1]) * pitch(1) +
-  ... +
-  (mem_origin[N-2] + i[N-2]) * pitch(N-2) +
-  (mem_origin[N-1] + i[N-1])];</code></pre>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_LAYOUT_ML_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock"><a href="#cl-tensor-layout-ml">cl_tensor_layout_ml_exp</a></p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">A convenience layout type over <code>CL_TENSOR_LAYOUT_BLAS_EXP</code>.</p></td>
+</tr>
+</tbody>
+</table>
+</div>
 </div>
+</dd>
+</dl>
 </div>
+<div id="cl-tensor-layout-blas" class="dlist">
+<dl>
+<dt class="hdlist1">(Add the following to new Section 5.X.Y.1, <strong>BLAS Tensor Layout</strong>) </dt>
+<dd>
+<div class="openblock">
+<div class="content">
 <div class="paragraph">
-<p>Where the <code>N</code> is tensor rank, the <code>i[X]</code> is a tensor coordinate with
-inclusive range of <code>0..&lt;region[X]-1&gt;</code> and the <code>pitch</code> is computed as
-follows in pseudo C code:</p>
+<p>The following structures describes packed / pitched BLAS-like memory
+layout for the tensor:</p>
 </div>
 <div class="listingblock">
 <div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">size_t pitch(size_t dim) {
-  size_t pitch = 1;
-  for (size_t i = dim; i &lt; tensor_rank - 1; i++)
-    pitch *=
-      (mem_pitch != NULL || mem_pitch[i] == 0) ? mem_pitch[i] : region[i + 1];
-  return pitch;
-}</code></pre>
-</div>
-</div>
-<div class="paragraph">
-<p>For <code>dim</code> in <code>0..(tensor_rank()-1)</code>. The <code>tensor_element()</code> represents
-an abstract function that accesses a tensor element in its storage at
-given coordinate. The method how the coordinates translate to tensor
-storage addresses is unspecified.</p>
+<pre class="CodeRay highlight"><code data-lang="c"><span class="keyword">typedef</span> <span class="keyword">struct</span> cl_tensor_layout_blas_exp {
+    cl_tensor_dim_exp    leading_dims[CL_TENSOR_DESC_MAX_RANK_EXP];
+} cl_tensor_layout_blas_exp;
+
+<span class="keyword">typedef</span> <span class="keyword">struct</span> cl_tensor_layout_blas_pitched_exp {
+    cl_tensor_dim_exp    leading_dims[CL_TENSOR_DESC_MAX_RANK_EXP];
+    cl_tensor_pitch      leading_pitches[CL_TENSOR_DESC_MAX_RANK_EXP];
+} cl_tensor_layout_blas_pitched_exp;
+
+<span class="keyword">typedef</span> <span class="keyword">struct</span> cl_tensor_layout_ml_exp {
+    cl_tensor_layout_ml_type_exp ml_type;
+} cl_tensor_layout_ml_exp;</code></pre>
 </div>
-<div class="paragraph">
-<p><strong>clEnqueueImportFromTensor</strong> and <strong>clEnqueueExportToTensor</strong>
-returns CL_SUCCESS if the function is executed
-successfully. Otherwise, it returns one of the following errors:</p>
 </div>
 <div class="ulist">
 <ul>
 <li>
-<p>CL_INVALID_COMMAND_QUEUE if <em>command_queue</em> is not a valid host
-command-queue.</p>
-</li>
-<li>
-<p>CL_INVALID_CONTEXT if the context associated with <em>command_queue</em>
-and buffer are not the same or if the context associated with
-<em>command_queue</em> and events in <em>event_wait_list</em> are not the same.</p>
-</li>
-<li>
-<p>CL_INVALID_MEM_OBJECT if <em>buffer</em> is not a valid buffer object.</p>
-</li>
-<li>
-<p>CL_INVALID_VALUE if <em>tensor_origin</em> or <em>mem_origin</em> is NULL.</p>
-</li>
-<li>
-<p>CL_INVALID_VALUE if the region being read or written specified by
-(<em>mem_origin</em>, <em>region</em>, <em>mem_pitch</em>) is out of bounds.</p>
-</li>
-<li>
-<p>CL_INVALID_VALUE if any <em>region</em> array element is 0.</p>
-</li>
-<li>
-<p>CL_INVALID_VALUE if <em>mem_pitch</em> is not NULL and <em>mem_pitch</em>[i] is
-not 0 and <em>mem_pitch</em>[i] is less than <em>region</em>[i].</p>
-</li>
-<li>
-<p>CL_INVALID_VALUE if <em>buffer</em> and <em>host_ptr</em> both are NULL or non-NULL.</p>
+<p><em>leading_dims</em> describes which elements along the tensor dimension
+are laid out in the memory. <code>leading_dims[0]</code> point to dimension
+whose elements are laid out first, followed by elements along
+dimension by <code>leading_dims[1]</code> and so on. The first N elements must
+be non-zero where N is tensor&#8217;s rank and the values must be unique
+and within range <code>[0, tensor_rank)</code>.</p>
 </li>
 <li>
-<p>CL_INVALID_EVENT_WAIT_LIST if <em>event_wait_list</em> is NULL and
-<em>num_events_in_wait_list</em> &gt; 0, or <em>event_wait_list</em> is not NULL and
-<em>num_events_in_wait_list</em> is 0, or if event objects in
-<em>event_wait_list</em> are not valid events.</p>
-</li>
+<p><em>leading_pitches</em> describes distance between from an element to the
+next one for the leading dimensions in <em>leading_dims</em>. The distance
+is measured in number of elements. The first N elements must be
+non-zero where the N is tensor&#8217;s rank minus one. The values of the
+array must be non-zero for the first tensor rank minus one elements
+and following conditions must hold:</p>
+<div class="ulist">
+<ul>
 <li>
-<p>CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST if the read and write
-operations are blocking and the execution status of any of the
-events in <em>event_wait_list</em> is a negative integer value.</p>
+<p><code>leading_pitches[0] &gt;= tensor_shape[leading_dims[0]]</code> if the tensor
+rank is greater than one and</p>
 </li>
 <li>
-<p>CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate
-memory for data store associated with memory object the <em>tensor</em> is
-bound to.</p>
+<p><code>leading_pitches[i + 1] &gt;= tensor_shape[leading_dims[i]] *
+leading_pitches[i]</code> for <code>i</code> in <code>[0, tensor_rank - 1)</code> if the tensor
+rank is greater than two.</p>
 </li>
-<li>
-<p>CL_OUT_OF_RESOURCES if there is a failure to allocate resources
-required by the OpenCL implementation on the device.</p>
+</ul>
+</div>
 </li>
+</ul>
+</div>
+<div class="ulist">
+<ul>
 <li>
-<p>CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
-required by the OpenCL implementation on the host.</p>
+<p><em>ml_type</em> defines memory layout via enumerators which corresponds to
+predefined configurations of <code>cl_tensor_layout_blas_exp</code> structure
+as listed in <a href="#tensor-layout-ml-types">ML tensor layout type</a> table.</p>
 </li>
 </ul>
 </div>
 <div class="paragraph">
-<p>If <strong>cl_khr_command_buffer</strong> is supported, then the following command
-buffer counterparts of the <strong>clEnqueueImportFromTensor</strong> and
-<strong>clEnqueueExportToTensor</strong> commands are available.</p>
-</div>
-<div class="listingblock">
-<div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_int clCommandImportFromTensorKHR(
-  cl_command_buffer_khr command_buffer,
-  cl_command_queue command_queue,
-  cl_tensor tensor,
-  const size_t* tensor_origin,
-  const size_t* mem_origin,
-  const size_t* region,
-  const size_t* mem_pitch,
-  cl_mem buffer,
-  void* host_ptr,
-  cl_uint num_sync_points_in_wait_list,
-  const cl_sync_point_khr* sync_point_wait_list,
-  cl_sync_point_khr* sync_point,
-  cl_mutable_command_khr* mutable_handle);</code></pre>
-</div>
+<p>The memory layout descriptions map tensor coordinates to buffer&#8217;s
+memory byte locations respect to buffer&#8217;s base address as followed in
+pseudo C:</p>
 </div>
 <div class="listingblock">
 <div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_int clCommandExportToTensorKHR(
-  cl_command_buffer_khr command_buffer,
-  cl_command_queue command_queue,
-  cl_tensor tensor,
-  const size_t* tensor_origin,
-  const size_t* mem_origin,
-  const size_t* region,
-  const size_t* mem_pitch,
-  cl_mem buffer,
-  const void* host_ptr,
-  cl_uint num_sync_points_in_wait_list,
-  const cl_sync_point_khr* sync_point_wait_list,
-  cl_sync_point_khr* sync_point,
-  cl_mutable_command_khr* mutable_handle);</code></pre>
-</div>
+<pre class="CodeRay highlight"><code data-lang="c">size_t index = <span class="integer">0</span>;
+<span class="keyword">for</span> (<span class="predefined-type">unsigned</span> i = <span class="integer">0</span>; i &lt; tensor_rank - <span class="integer">1</span>; i++)
+  index += tensor_coordinates[leading_dims[i]] * pitches[i];
+buffer_offset = index * tensor_element_size;</code></pre>
 </div>
-<div class="ulist">
-<ul>
-<li>
-<p><em>command_buffer</em> refers to valid command-buffer object.</p>
-</li>
-<li>
-<p>For <em>command_queue</em>, <em>tensor</em>, <em>tensor_origin</em>, <em>mem_origin</em>,
-<em>region</em>, <em>mem_pitch</em>, <em>buffer</em> and <em>host_ptr</em> parameters refer to
-<strong>clEnqueueImportFromTensor</strong>.</p>
-</li>
-<li>
-<p>For <em>num_sync_points_in_wait_list</em>, <em>sync_point_wait_list</em>,
-<em>sync_point</em>, <em>mutable_handle</em> parameters refer to
-<strong>clCommandCopyBufferKHR</strong>.</p>
-</li>
-</ul>
 </div>
 <div class="paragraph">
-<p><strong>clCommandImportFromTensorKHR</strong> and <strong>clCommandImportFromTensorKHR</strong>
-returns CL_SUCCESS if the function is executed
-successfully. Otherwise, it returns one of the following errors:</p>
+<p>Where <code>pitches[i]</code> equals to:</p>
 </div>
 <div class="ulist">
 <ul>
 <li>
-<p>CL_INVALID_COMMAND_QUEUE if <em>command_queue</em> is not NULL.</p>
-</li>
-<li>
-<p>CL_INVALID_COMMAND_BUFFER_KHR if <em>command_buffer</em> is not a valid
-command-buffer.</p>
-</li>
-<li>
-<p>CL_INVALID_CONTEXT if the context associated with <em>command_queue</em>
-and <em>command_buffer</em> is not the same.</p>
-</li>
-<li>
-<p>CL_INVALID_OPERATION if <em>command_buffer</em> has been finalized.</p>
-</li>
-<li>
-<p>CL_INVALID_VALUE if <em>mutable_handle</em> is not NULL.</p>
-</li>
-<li>
-<p>CL_INVALID_SYNC_POINT_WAIT_LIST_KHR if <em>sync_point_wait_list</em> is
-NULL and <em>num_sync_points_in_wait_list</em> is &gt; 0, or
-<em>sync_point_wait_list</em> is not NULL and <em>num_sync_points_in_wait_list</em> is
-0, or if synchronization-point objects in <em>sync_point_wait_list</em> are
-not valid synchronization-points.</p>
-</li>
-<li>
-<p>CL_OUT_OF_RESOURCES if there is a failure to allocate resources
-required by the OpenCL implementation on the device.</p>
+<p><em>leading_pitches</em>[i] for <code>cl_tensor_layout_blas_pitched_exp</code>.</p>
 </li>
 <li>
-<p>CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources
-required by the OpenCL implementation on the host.</p>
+<p><code>tensor_shape[leading_dims[i]] *
+tensor_shape[leading_dims[i-1]] * &#8230;&#8203; *
+tensor_shape[leading_dims[0]]</code> for <code>cl_tensor_layout_blas_exp</code>.</p>
 </li>
 </ul>
 </div>
-</div>
-<div class="sect3">
-<h4 id="_add_new_buffer_property_in_section_5_2_1">Add New Buffer Property in Section 5.2.1</h4>
-<table class="tableblock frame-all grid-all stripes-odd stretch">
+<table id="tensor-layout-ml-type" class="tableblock frame-all grid-all stripes-even stretch">
+<caption class="title">Table 3. ML tensor layout types and their corresponding cl_tensor_layout_blas_exp configuration.</caption>
 <colgroup>
-<col style="width: 40%;">
-<col style="width: 20%;">
-<col style="width: 40%;">
+<col style="width: 33.3333%;">
+<col style="width: 66.6667%;">
 </colgroup>
+<thead>
+<tr>
+<th class="tableblock halign-left valign-top"><strong>ML layout type</strong></th>
+<th class="tableblock halign-left valign-top"><strong>Equivalent <em>leading_dims</em> configuration</strong></th>
+</tr>
+</thead>
 <tbody>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_MEM_COMMAND_BUFFER_TEMPORARY</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">cl_bool</p></td>
-<td class="tableblock halign-left valign-top"><div class="content"><div class="paragraph">
-<p>This property can be set if <strong>cl_khr_command_buffer</strong> extension is
-supported.</p>
-</div>
-<div class="admonitionblock note">
-<table>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_LAYOUT_ML_C_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock"><code>{}</code></p></td>
+</tr>
 <tr>
-<td class="icon">
-<div class="title">Note</div>
-</td>
-<td class="content">
-This property temporarily lives here and will be moved to
-a separate extension proposal.
-</td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_LAYOUT_ML_NC_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock"><code>{1}</code></p></td>
 </tr>
-</table>
-</div>
-<div class="paragraph">
-<p>If the value is true, create a "temporary" buffer object that only can
-be used on commands recorded in command buffers. Non-recording
-command enqueue functions must return CL_INVALID_OPERATION if the
-command refers to a temporary buffer object.</p>
-</div>
-<div class="paragraph">
-<p>The temporary buffer objects are managed by command buffers. When a
-temporary buffer object is used by multiple command buffer, the object
-receives disjoint storage for each command buffer.</p>
-</div>
-<div class="paragraph">
-<p>Storage of the temporary buffer objects may be allocated on-demand
-basis. At the times the buffer is not needed, OpenCL implementations
-may reuse storage for other tasks within the command buffer.</p>
-</div>
-<div class="paragraph">
-<p>Contents of the temporary buffers are not guaranteed to be preserved
-across command buffer executions.</p>
-</div></div></td>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_LAYOUT_ML_CN_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock"><code>{0}</code></p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_MEM_BIND_TO_TENSOR</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">cl_tensor</p></td>
-<td class="tableblock halign-left valign-top"><div class="content"><div class="paragraph">
-<p>Use the created buffer as
-storage for the given valid tensor. To succeed creating the buffer,
-the target tensor may not have storage already and <em>size</em>
-argument of the clCreateBufferWithProperties() must be zero.</p>
-</div>
-<div class="paragraph">
-<p>Size of the memory buffer is implementation-defined and it can be
-queried with clGetTensorInfo().</p>
-</div>
-<div class="paragraph">
-<p>Memory layout of the tensor in the created memory buffer is
-implementation-defined and opaque to the applications and it may
-change at unspecified points.  Implementation may use non-contiguous
-allocations to store the tensor data and implementation may store
-auxiliary data within the allocations.  Therefore, reading from or
-writing to the memory buffer directly using the cl_mem handle leads to
-undefined behavior.</p>
-</div>
-<div class="paragraph">
-<p>If the tensor is already bound to a buffer object,
-clCreateBufferWithProperties call returns CL_TENSOR_BOUND_TO_BUFFER
-error code.</p>
-</div></div></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_LAYOUT_ML_HW_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock"><code>{1}</code></p></td>
 </tr>
-</tbody>
-</table>
-</div>
-<div class="sect3">
-<h4 id="_add_new_memory_object_query_in_section_5_5_5">Add New Memory Object Query in Section 5.5.5</h4>
-<table class="tableblock frame-all grid-all stripes-odd stretch">
-<colgroup>
-<col style="width: 40%;">
-<col style="width: 20%;">
-<col style="width: 40%;">
-</colgroup>
-<tbody>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_MEM_COMMAND_BUFFER_TEMPORARY</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">cl_bool</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">This property can be
-queried if <strong>cl_khr_command_buffer</strong> extension is supported.</p>
-<p class="tableblock">Return true if the <em>memobj</em> is temporary buffer object for command
-buffers.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_LAYOUT_ML_CHW_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock"><code>{2, 1}</code></p></td>
 </tr>
-</tbody>
-</table>
-</div>
-<div class="sect3">
-<h4 id="_add_new_error_codes_in_appendix_f">Add New Error Codes in Appendix F</h4>
-<table class="tableblock frame-all grid-all stripes-odd stretch">
-<colgroup>
-<col style="width: 40%;">
-<col style="width: 60%;">
-</colgroup>
-<tbody>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_BOUND_TO_BUFFER</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">Returned when attempting to bind a
-  buffer object to a tensor which already has been bound to the same
-  or another.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_LAYOUT_ML_NCHW_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock"><code>{3, 2, 1}</code></p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_INVALID_TENSOR</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">Returned then the specified tensor is not a
-  valid tensor object.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_LAYOUT_ML_NHWC_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock"><code>{1, 3, 2}</code></p></td>
 </tr>
 </tbody>
 </table>
 </div>
 </div>
-<div class="sect2">
-<h3 id="_sample_codes">Sample Codes</h3>
-<div class="paragraph">
-<p>Helper functions used in the follow up tensor code samples:</p>
+</dd>
+</dl>
+</div>
 </div>
-<div class="listingblock">
-<div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">cl_kernel create_matmul_kernel(
-  cl_context ctx, std::span&lt;cl_device_id&gt; device_span,
-  cl_tensor lhs, cl_tensor rhs, cl_tensor out) {
-  // A hypothetical matmul kernel signature in pseudo OpenCL C for
-  // illustrative purposes:
-  //
-  //   kernel void matmul(global read_only tensor_t, global read_only tensor_t,
-  //                      global write_only tensor_t);
-
-  cl_kernel matmul_kernel = /* Omitted. */;
-  clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor), &amp;lhs);
-  clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor), &amp;rhs);
-  clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor), &amp;out);
-  return matmul_kernel;
-}
-
-cl_kernel create_add_kernel(
-  cl_context ctx, std::span&lt;cl_device_id&gt; device_span,
-  cl_tensor lhs, cl_tensor rhs, cl_tensor out) {
-  // A hypothetical add kernel signature in pseudo OpenCL C for illustrative
-  // purposes:
-  //
-  // kernel void add(global read_only tensor_t, global read_only tensor_t,
-  //                 global write_only tensor_t);
-
-  cl_tensor add_kernel = /* Omitted. */;
-  clSetKernelArg(add_kernel, 0, sizeof(cl_tensor), &amp;lhs);
-  clSetKernelArg(add_kernel, 1, sizeof(cl_tensor), &amp;rhs);
-  clSetKernelArg(add_kernel, 2, sizeof(cl_tensor), &amp;out);
-  return add_kernel;
-}</code></pre>
 </div>
 </div>
+<div class="sect1">
+<h2 id="_sample_codes">Sample Codes</h2>
+<div class="sectionbody">
 <div class="paragraph">
-<p>An example usage of tensors on a command queue:</p>
+<p>An example usage of tensors:</p>
 </div>
 <div class="listingblock">
 <div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">constexpr size_t b = 64, m = 100, n = 200, k = 50;
+<pre class="CodeRay highlight"><code data-lang="cpp"><span class="directive">constexpr</span> size_t b = <span class="integer">64</span>, m = <span class="integer">100</span>, n = <span class="integer">200</span>, k = <span class="integer">50</span>;
 
-cl_int err;
-cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
-cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
-cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
-cl_tensor t0  = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
-cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+std::vector&lt;<span class="predefined-type">float</span>&gt; in0_data = ...;
+std::vector&lt;<span class="predefined-type">float</span>&gt; in1_data = ...;
+std::vector&lt;<span class="predefined-type">float</span>&gt; out_data(b * m * n);
 
-cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
-cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
+<span class="comment">// Create tensor with opaque layout.</span>
+cl_tensor_desc_exp in0_desc;
+in0_desc.rank = <span class="integer">3</span>;
+in0_desc.properties[<span class="integer">0</span>] = <span class="integer">0</span>;
+in0_desc.shape[<span class="integer">0</span>] = b;
+in0_desc.shape[<span class="integer">1</span>] = m;
+in0_desc.shape[<span class="integer">2</span>] = k;
+in0_desc.layout = <span class="predefined-constant">nullptr</span>;
+in0_desc.layout_type = CL_TENSOR_LAYOUT_OPAQUE_EXP;
 
-// Allocate storage for the tensors. The buffer size must be set to
-// zero when the buffer is bound to a tensor. OpenCL implementation
-// may determine optimal data layout and the storage needed for it,
-// based on the tensor's uses (the 'matmul' and 'add' kernels in this
-// sample) so far.
-cl_mem in0_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, in0, 0}, CL_MEM_READ_ONLY,
-  0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */, nullptr, &amp;err);
-cl_mem in1_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, in1, 0}, CL_MEM_READ_ONLY,
-  0, nullptr, &amp;err);
-cl_mem in2_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, in2, 0}, CL_MEM_READ_ONLY,
-  0, nullptr, &amp;err);
-cl_mem t0_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE,
-  0, nullptr, &amp;err);
-cl_mem out_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_BIND_TO_TENSOR, out, 0}, CL_MEM_WRITE_ONLY,
-  0, nullptr, &amp;err);
+cl_int err;
+cl_mem in0_tensor = clCreateBufferWithProperties(
+  ctx, {CL_MEM_TENSOR_EXP, in0_desc, <span class="integer">0</span>},
+  CL_MEM_READ_ONLY, <span class="integer">0</span>, <span class="predefined-constant">nullptr</span>, &amp;err);
 
-std::vector&lt;float&gt; in0_data = ...;
-std::vector&lt;float&gt; in1_data = ...;
-std::vector&lt;float&gt; out_data(b * m * n);
+<span class="comment">// Create tensor from a host allocation using an application defined</span>
+<span class="comment">// layout description for mapping elements to the tensor.</span>
+cl_tensor_desc_exp in1_desc;
+in1_desc.rank = <span class="integer">3</span>;
+in1_desc.properties[<span class="integer">0</span>] = <span class="integer">0</span>;
+in1_desc.shape[<span class="integer">0</span>] = b;
+in1_desc.shape[<span class="integer">1</span>] = k;
+in1_desc.shape[<span class="integer">2</span>] = n;
 
-// Copies data into in0 tensor while possibly rearranging the data to the
-// optimal data layout.
-clEnqueueExportToTensor(
-  cmd_q, in0, false, {0, 0, 0}, {0, 0, 0}, {b, m, k},
-  nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr);
-clEnqueueExportToTensor(
-  cmd_q, in1, false, {0, 0, 0}, {0, 0, 0}, {b, k, n},
-  nullptr, nullptr, in1_data.data(), 0, nullptr, nullptr);
-clEnqueueNDRangeKernel(
-  cmd_q, matmul_kernel, 3, matmul_grid, nullptr, nullptr, 0, nullptr, nullptr);
-clEnqueueNDRangeKernel(
-  cmd_q, add_kernel, 3, add_grid, nullptr, nullptr, 0, nullptr, nullptr);
-clEnqueueImportFromTensor(
-  cmd_q, out, false,  {0, 0, 0}, {0, 0, 0}, {b, m, n},
-  nullptr, nullptr, out_data.data(), 0, nullptr, nullptr);</code></pre>
-</div>
-</div>
-<div class="paragraph">
-<p>An example use of tensors in a command buffer when cl_khr_command_buffer
-extension is supported:</p>
-</div>
-<div class="listingblock">
-<div class="content">
-<pre class="highlight"><code class="language-c" data-lang="c">constexpr size_t b = 64, m = 100, n = 200, k = 50;
+cl_tensor_layout_blas_exp col_major;
+col_major.leading_dims[<span class="integer">0</span>] = <span class="integer">1</span>,
+col_major.leading_dims[<span class="integer">1</span>] = <span class="integer">2</span>,
+in1_desc.layout = &amp;col_major;
+in1_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP;
 
-cl_int err;
-cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
-cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
-cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
-cl_tensor t0  = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
-cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_mem in1_tensor = clCreateBufferWithProperties(
+  ctx, {CL_MEM_TENSOR_EXP, in1_desc, <span class="integer">0</span>},
+  CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, <span class="integer">0</span>, in1_data.data(), &amp;err);
+
+<span class="comment">// Create another tensor with application defined layout.</span>
+cl_tensor_desc_exp out_desc;
+out_desc.rank = <span class="integer">3</span>;
+out_desc.properties[<span class="integer">0</span>] = <span class="integer">0</span>;
+out_desc.shape[<span class="integer">0</span>] = b;
+out_desc.shape[<span class="integer">1</span>] = m;
+out_desc.shape[<span class="integer">2</span>] = n;
 
-cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
-cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
+cl_tensor_layout_blas_exp row_major;
+row_major.leading_dims[<span class="integer">0</span>] = <span class="integer">2</span>,
+row_major.leading_dims[<span class="integer">1</span>] = <span class="integer">1</span>,
+out_desc.layout = &amp;row_major;
+out_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP;
 
-// Bind command buffer managed storage to tensors.
-//
-// NOTE: same temporary tensor handle used in multiple command buffers
-//       will have separate storage. IOW, command buffers may not exchange
-//       data via temporary buffers between them.
-cl_mem in0_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in0, 0},
-  CL_MEM_READ_ONLY, 0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */,
-  nullptr, &amp;err);
-cl_mem in1_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in1, 0},
-  CL_MEM_READ_ONLY, 0, nullptr, &amp;err);
-cl_mem in2_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in2, 0},
-  CL_MEM_READ_ONLY, 0, nullptr, &amp;err);
-cl_mem t0_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, t0, 0},
-  CL_MEM_READ_WRITE, 0, nullptr, &amp;err);
-cl_mem out_mem = clCreateBufferWithProperties(
-  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, out, 0},
-  CL_MEM_WRITE_ONLY, 0, nullptr, &amp;err);
+cl_mem out_tensor = clCreateBufferWithProperties(
+  ctx, {CL_MEM_TENSOR_EXP, out_desc, <span class="integer">0</span>},
+  CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY, <span class="integer">0</span>, out_data.data(), &amp;err);
 
-std::vector&lt;float&gt; in0_data = ...;
-std::vector&lt;float&gt; in1_data = ...;
-std::vector&lt;float&gt; out_data(b * m * n);
+<span class="comment">// Create a kernel that operates on the tensors and is possibly</span>
+<span class="comment">// optimized for them using via yet realized API extension.</span>
+cl_kernel batched_matmul_kernel = create_batched_matmul_kernel(
+  ctx, device_span, in1_desc, in2_desc, out_desc);
 
-cl_command_buffer_khr cb =
-  clCreateCommandBufferKHR(num_queues, queue_list, nullptr, &amp;err);
+clSetKernelArg(batched_matmul_kernel, <span class="integer">0</span>, <span class="keyword">sizeof</span>(cl_mem), &amp;in0_tensor);
+clSetKernelArg(batched_matmul_kernel, <span class="integer">1</span>, <span class="keyword">sizeof</span>(cl_mem), &amp;in1_tensor);
+clSetKernelArg(batched_matmul_kernel, <span class="integer">2</span>, <span class="keyword">sizeof</span>(cl_mem), &amp;out_tensor);
 
-cl_sync_point_khr in0_syncp, in1_syncp, matmul_syncp, add_syncp;
-clCommandExportToTensorKHR(
-  cmd_b, cmd_q, in0, {0, 0, 0}, {0, 0, 0}, {b, m, k},
-  nullptr, nullptr, in0_data.data(), 0, nullptr, &amp;in0_syncp);
-clCommandExportToTensorKHR(
-  cmd_b, cmd_q, in1, {0, 0, 0}, {0, 0, 0}, {b, k, m},
-  nullptr, nullptr, in1_data.data(), 0, nullptr, &amp;in1_syncp);
-clCommandNDRangeKernelKHR(
-  cmd_b, cmd_q, nullptr, matmul_kernel, 3, matmul_grid, nullptr, nullptr,
-  2, {in0_syncp, in2_syncp}, &amp;matmul_syncp, nullptr);
-clCommandNDRangeKernelKHR(
-  cmd_b, cmd_q, nullptr, add_kernel, 3, add_grid, nullptr, nullptr,
-  1, {matmul_syncp}, &amp;add_syncp, nullptr);
-clCommandImportFromTensorKHR(
-  cmd_b, cmd_q, out, {0, 0, 0}, {0, 0, 0}, {b, k, m},
-  nullptr, nullptr, out_data.data(), 1, {add_syncp}, nullptr);
+<span class="comment">// Required command for transferring data to layout-opaque tensors and</span>
+<span class="comment">// from it to elsewhere.</span>
+clEnqueueExportToTensor(
+  cmd_q, in0_tensor, <span class="predefined-constant">false</span>, {<span class="integer">0</span>, <span class="integer">0</span>, <span class="integer">0</span>}, {<span class="integer">0</span>, <span class="integer">0</span>, <span class="integer">0</span>}, {b, m, k},
+  <span class="predefined-constant">nullptr</span>, <span class="predefined-constant">nullptr</span>, in0_data.data(), <span class="integer">0</span>, <span class="predefined-constant">nullptr</span>, <span class="predefined-constant">nullptr</span>);
 
-// Finalize the command buffer. At this point the OpenCL
-// implementation may reserve enough storage for all the tensor
-// temporaries. Temporary tensors might be eliminated - for example,
-// OpenCL implementation could use 'out' tensor to store result of
-// matmul_kernel , thus, eliminating the need of 't0' tensor.
-clFinalizeCommandBufferKHR(cmd_b);
+clEnqueueNDRangeKernel(
+  cmd_q, batched_matmul_kernel, <span class="integer">3</span>, matmul_grid, <span class="predefined-constant">nullptr</span>, <span class="predefined-constant">nullptr</span>, <span class="integer">0</span>, <span class="predefined-constant">nullptr</span>, <span class="predefined-constant">nullptr</span>);
 
-// Temporary tensors used in a command buffer can't be read or written
-// into. A hypothetical reason is that the finalized command buffer
-// might not use some of the tensor.
-assert(clEnqueueImportFromTensor(..., t0, ...) == CL_INVALID_OPERATION);</code></pre>
+clEnqueueMapBuffer(
+  cmd_q, out_tensor, CL_TRUE, CL_MAP_READ, <span class="integer">0</span>, b * m * n, <span class="integer">0</span>, <span class="predefined-constant">nullptr</span>, <span class="predefined-constant">nullptr</span>);</code></pre>
 </div>
 </div>
 </div>
-<div class="sect2">
-<h3 id="_open_questions">Open Questions</h3>
+</div>
+<div class="sect1">
+<h2 id="_issues">Issues</h2>
+<div class="sectionbody">
 <div class="olist arabic">
 <ol class="arabic">
 <li>
@@ -1572,8 +1929,28 @@ <h3 id="_open_questions">Open Questions</h3>
 <div class="content">
 <div class="paragraph">
 <p><strong>RESOLVED</strong>: OpenCL C support for tensors can be introduced later in a
-            separate extension. Built-in kernels may benefit from this
-            extension as it is.</p>
+           separate extension. Built-in kernels may benefit from this
+           extension as it is.</p>
+</div>
+</div>
+</div>
+</li>
+<li>
+<p>What is the use case of <code>cl_tensor_layout_blas_pitch_exp</code>?</p>
+<div class="openblock">
+<div class="content">
+<div class="paragraph">
+<p><strong>UNRESOLVED</strong></p>
+</div>
+</div>
+</div>
+</li>
+<li>
+<p>Should image types be extended instead of adding a separate tensor type?</p>
+<div class="openblock">
+<div class="content">
+<div class="paragraph">
+<p><strong>UNRESOLVED</strong></p>
 </div>
 </div>
 </div>
@@ -1582,17 +1959,53 @@ <h3 id="_open_questions">Open Questions</h3>
 </div>
 </div>
 </div>
+<div class="sect1">
+<h2 id="_version_history">Version History</h2>
+<div class="sectionbody">
+<table class="tableblock frame-all grid-rows stretch">
+<colgroup>
+<col style="width: 4.7619%;">
+<col style="width: 14.2857%;">
+<col style="width: 14.2857%;">
+<col style="width: 66.6667%;">
+</colgroup>
+<thead>
+<tr>
+<th class="tableblock halign-left valign-top">Version</th>
+<th class="tableblock halign-left valign-top">Date</th>
+<th class="tableblock halign-left valign-top">Author</th>
+<th class="tableblock halign-left valign-top">Changes</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">0.1.0</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">2023-11-23</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Henry Linjamäki</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock"><strong>Initial revision</strong></p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">0.2.0</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">2024-8-14</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Henry Linjamäki</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">* Rework document structure match to the cl_khr_extension_template.</p>
+<p class="tableblock">* Added clEnqueueCopyTensor.</p>
+<p class="tableblock">* Added API for setting memory layout for tensors.</p></td>
+</tr>
+</tbody>
+</table>
+</div>
 </div>
 </div>
 <div id="footnotes">
 <hr>
 <div class="footnote" id="_footnotedef_1">
-<a href="#_footnoteref_1">1</a>. only LSB bit is considered when writing data to tensor. When reading data from tensor the boolean value will be written as 0 or 1. The boolean values in the tensor may be packed densenly
+<a href="#_footnoteref_1">1</a>. zero and non-zero bytes are interpreted as false and true values, respectively.
 </div>
 </div>
 <div id="footer">
 <div id="footer-text">
-Last updated 2023-11-23 10:21:09 +0200
+Last updated 2024-08-14 16:10:59 +0300
 </div>
 </div>
 </body>

From 4586eefc17fcbf38cce4bdc6ef29360ff16a1cda Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.linjamaki@gmail.com>
Date: Thu, 15 Aug 2024 12:21:05 +0300
Subject: [PATCH 21/26] Update extensions/cl_exp_tensor.asciidoc
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Pekka Jääskeläinen <pekka.jaaskelainen@tuni.fi>
---
 extensions/cl_exp_tensor.asciidoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/extensions/cl_exp_tensor.asciidoc b/extensions/cl_exp_tensor.asciidoc
index 002ff366..774c1857 100644
--- a/extensions/cl_exp_tensor.asciidoc
+++ b/extensions/cl_exp_tensor.asciidoc
@@ -6,7 +6,7 @@
 
 = cl_exp_tensor
 
-This extension provides new buffer abstraction - tensor objects - for
+This extension provides a new buffer abstraction, tensor objects, for
 managing N-dimensional data.
 
 == XXX - Not complete yet!!!

From 19116009dfa676d948adace46eca4a222c5f74a0 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.linjamaki@gmail.com>
Date: Thu, 15 Aug 2024 12:31:26 +0300
Subject: [PATCH 22/26] Apply suggestions from code review
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Pekka Jääskeläinen <pekka.jaaskelainen@tuni.fi>
---
 extensions/cl_exp_tensor.asciidoc | 56 +++++++++++++++----------------
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/extensions/cl_exp_tensor.asciidoc b/extensions/cl_exp_tensor.asciidoc
index 774c1857..5c33f277 100644
--- a/extensions/cl_exp_tensor.asciidoc
+++ b/extensions/cl_exp_tensor.asciidoc
@@ -22,7 +22,7 @@ TODO
 == Contributors
 
 Henry Linjamäki, Intel. +
-Pekka Jääslkeläinen, Intel and Tampere University. +
+Pekka Jääskeläinen, Intel. +
 Ben Ashbaugh, Intel. +
 
 == Notice
@@ -46,30 +46,30 @@ This extension requires OpenCL 1.2 or later.
 
 == Overview
 
-The extension provides new tensor object abstraction. Tensor objects
-are similar to image types in regard they represents N-dimensional
-data of some application chosen data type and they may be mapped to
-dedicated hardware except that
+The extension provides a new tensor object abstraction. Tensor objects
+are similar to image types in regard that they represent N-dimensional
+data of an application chosen data type and they may be mapped to
+dedicated hardware, with the following key differences:
 
-* higher than 3-dimensional data can be supported (limited by
+* Higher than 3-dimensional data can be supported (limited by
   devices' capabilities).
 
-* applications may choose how the data elements of the tensors are
+* Applications may choose how the data elements of the tensors are
   laid out in the buffers using the tensor layout descriptions
   provided in this extension.
 
-Applications may also choose the memory layouts of the tensors be
+Applications may also choose the memory layouts of the tensors to be
 implementation-specified, letting the driver to optimize the tensor
 data layout for better performance or to lay out the data as required by
-hardware functions (e.g. exposed via builtin kernels).
+hardware accelerated functions (e.g. exposed via builtin kernels).
 
-The scope of this extension to provide host APIs for creating tensor
+The scope of this extension is to provide host APIs for creating tensor
 objects and transfer data between tensors, host and other memory
 objects.
 
 A separate extension implemented on top of this extension,
-cl_exp_defined_builtin_kernels which provides "defined built-in
-kernels" (DKBs) which operates on tensors. It also provides mechanism
+cl_exp_defined_builtin_kernels provides "defined built-in
+kernels" (DKBs) which can operate on tensors. It also provides mechanism
 for drivers to create DBKs that are optimized for the tensor arguments
 they operate on.
 
@@ -184,7 +184,7 @@ typedef struct cl_tensor_layout_ml_exp {
 
 == New API Enums
 
-Accepted value for _properties_ parameter to
+Accepted value for the _properties_ parameter to
 *clCreateBufferWithProperties* for creating a tensor object:
 
 [source,c]
@@ -812,13 +812,13 @@ and true values, respectively.]
 |===
 | *layout type* | *tensor layout type* | *Description*
 
-| CL_TENSOR_LAYOUT_OPAQUE_EXP | N/A | The tensor don't have application
+| CL_TENSOR_LAYOUT_OPAQUE_EXP | N/A | The tensor doesn't have application
   defined memory layout. Driver controls the tensors layout. To read
   or write elements of the tensor
 
 | CL_TENSOR_LAYOUT_BLAS_EXP
 |<<cl-tensor-layout-blas,cl_tensor_layout_blas_exp>>
-| A type that describe packed memory layout similar ones used in BLAS APIs.
+| A type that describes a packed memory layout similar ones used in BLAS APIs.
 
 | CL_TENSOR_LAYOUT_BLAS_EXP
 |<<cl-tensor-layout-blas,cl_tensor_layout_blas_pitched_exp>>
@@ -837,7 +837,7 @@ A convenience layout type over `CL_TENSOR_LAYOUT_BLAS_EXP`.
 (Add the following to new Section 5.X.Y.1, *BLAS Tensor Layout*) ::
 +
 --
-The following structures describes packed / pitched BLAS-like memory
+The following structures describe packed / pitched BLAS-like memory
 layout for the tensor:
 
 [source,c]
@@ -857,13 +857,13 @@ typedef struct cl_tensor_layout_ml_exp {
 ----
 
 * _leading_dims_ describes which elements along the tensor dimension
-  are laid out in the memory. `leading_dims[0]` point to dimension
+  are laid out in the memory. `leading_dims[0]` points to the dimension
   whose elements are laid out first, followed by elements along
-  dimension by `leading_dims[1]` and so on. The first N elements must
-  be non-zero where N is tensor's rank and the values must be unique
+  the dimension by `leading_dims[1]` and so on. The first N elements must
+  be non-zero where N is a tensor's rank and the values must be unique
   and within range `[0, tensor_rank)`.
 
-* _leading_pitches_ describes distance between from an element to the
+* _leading_pitches_ describes the distance between an element to the
   next one for the leading dimensions in _leading_dims_. The distance
   is measured in number of elements. The first N elements must be
   non-zero where the N is tensor's rank minus one. The values of the
@@ -880,7 +880,7 @@ typedef struct cl_tensor_layout_ml_exp {
 // ^ This condition is meant to ensure that the tensor elements at different
 // coordinates don't alias.
 
-* _ml_type_ defines memory layout via enumerators which corresponds to
+* _ml_type_ defines the memory layout via enumerators which corresponds to
   predefined configurations of `cl_tensor_layout_blas_exp` structure
   as listed in <<tensor-layout-ml-types,ML tensor layout type>> table.
 
@@ -933,7 +933,7 @@ std::vector<float> in0_data = ...;
 std::vector<float> in1_data = ...;
 std::vector<float> out_data(b * m * n);
 
-// Create tensor with opaque layout.
+// Create a tensor with an opaque layout.
 cl_tensor_desc_exp in0_desc;
 in0_desc.rank = 3;
 in0_desc.properties[0] = 0;
@@ -948,7 +948,7 @@ cl_mem in0_tensor = clCreateBufferWithProperties(
   ctx, {CL_MEM_TENSOR_EXP, in0_desc, 0},
   CL_MEM_READ_ONLY, 0, nullptr, &err);
 
-// Create tensor from a host allocation using an application defined
+// Create tensor from a host allocation using an application-defined
 // layout description for mapping elements to the tensor.
 cl_tensor_desc_exp in1_desc;
 in1_desc.rank = 3;
@@ -967,7 +967,7 @@ cl_mem in1_tensor = clCreateBufferWithProperties(
   ctx, {CL_MEM_TENSOR_EXP, in1_desc, 0},
   CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, in1_data.data(), &err);
 
-// Create another tensor with application defined layout.
+// Create another tensor with an application-defined layout.
 cl_tensor_desc_exp out_desc;
 out_desc.rank = 3;
 out_desc.properties[0] = 0;
@@ -995,7 +995,7 @@ clSetKernelArg(batched_matmul_kernel, 1, sizeof(cl_mem), &in1_tensor);
 clSetKernelArg(batched_matmul_kernel, 2, sizeof(cl_mem), &out_tensor);
 
 // Required command for transferring data to layout-opaque tensors and
-// from it to elsewhere.
+// from it elsewhere.
 clEnqueueExportToTensor(
   cmd_q, in0_tensor, false, {0, 0, 0}, {0, 0, 0}, {b, m, k},
   nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr);
@@ -1008,9 +1008,9 @@ clEnqueueMapBuffer(
 ----
 
 
-== Issues
+== Issues and Open Questions
 
-. Should we have support for tensors with undefined shape and tensors
+. Should we support tensors with undefined shape and tensors
   with unknown / symbolic dimension sizes like in ONNX?
 +
 --
@@ -1053,6 +1053,6 @@ clEnqueueMapBuffer(
 
 * Added clEnqueueCopyTensor.
 
-* Added API for setting memory layout for tensors.
+* Added an API for setting the memory layout for tensors.
 
 |====

From af6d58c249f2cd7729f75e9c63f19b355b570c11 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 15 Aug 2024 13:47:03 +0300
Subject: [PATCH 23/26] Address some feedback, fix formatting

---
 extensions/cl_exp_tensor.asciidoc | 44 ++++++++++++++++++++++---------
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/extensions/cl_exp_tensor.asciidoc b/extensions/cl_exp_tensor.asciidoc
index 5c33f277..f7a2b9c0 100644
--- a/extensions/cl_exp_tensor.asciidoc
+++ b/extensions/cl_exp_tensor.asciidoc
@@ -220,7 +220,8 @@ CL_TENSOR_DTYPE_UINT16_EXP      0x????
 CL_TENSOR_DTYPE_UINT32_EXP      0x????
 CL_TENSOR_DTYPE_UINT64_EXP      0x????
 
-CL_TENSOR_DTYPE_FP8_EXP         0x????
+CL_TENSOR_DTYPE_FP8E4M3_EXP     0x????
+CL_TENSOR_DTYPE_FP8E5M2_EXP     0x????
 CL_TENSOR_DTYPE_FP16_EXP        0x????
 CL_TENSOR_DTYPE_FP32_EXP        0x????
 CL_TENSOR_DTYPE_FP64_EXP        0x????
@@ -795,7 +796,17 @@ and true values, respectively.]
 | CL_TENSOR_DTYPE_UINT16_EXP      | 16-bit unsigned integer.         | cl_ushort.
 | CL_TENSOR_DTYPE_UINT32_EXP      | 32-bit unsigned integer.         | cl_uint.
 | CL_TENSOR_DTYPE_UINT64_EXP      | 64-bit unsigned integer.         | cl_ulong.
-| CL_TENSOR_DTYPE_FP8_EXP         | Half precision floating-point.   | cl_char.
+
+| CL_TENSOR_DTYPE_FP8E4M3_EXP | 8-bit floating point with a sign bit,
+  4 exponent bits, 3 mantissa bits and a exponent bias of 7.
+| cl_char.
+
+| CL_TENSOR_DTYPE_FP8E5M2_EXP | 8-bit floating point with a sign bit,
+  5 exponent bits, 2 mantissa bits and a exponent bias of 15.
+| cl_char.
+
+// Reference: https://arxiv.org/pdf/2209.05433
+
 | CL_TENSOR_DTYPE_FP16_EXP        | Half precision floating-point.   | cl_half.
 | CL_TENSOR_DTYPE_BFLOAT16_EXP    | 16-bit brain floating-point.     | cl_ushort
 | CL_TENSOR_DTYPE_FP32_EXP        | Single precision floating-point. | cl_float.
@@ -812,20 +823,29 @@ and true values, respectively.]
 |===
 | *layout type* | *tensor layout type* | *Description*
 
-| CL_TENSOR_LAYOUT_OPAQUE_EXP | N/A | The tensor doesn't have application
-  defined memory layout. Driver controls the tensors layout. To read
-  or write elements of the tensor
+| CL_TENSOR_LAYOUT_OPAQUE_EXP | N/A   a| The tensor doesn't have
+  application defined memory layout. Driver controls the tensors
+  layout. To read or write elements of the tensor, the application
+  must:
+
+* use *clEnqueueExportToTensor* and *clEnqueueImportFromTensor* (or their
+   command buffer variants) or
+* use *clEnqueueCopyTensor* to copy elements to / from another tensor
+   object with an application-defined memory layout.
 
 | CL_TENSOR_LAYOUT_BLAS_EXP
 |<<cl-tensor-layout-blas,cl_tensor_layout_blas_exp>>
 | A type that describes a packed memory layout similar ones used in BLAS APIs.
 
-| CL_TENSOR_LAYOUT_BLAS_EXP
+| CL_TENSOR_LAYOUT_BLAS_PITCHED_EXP
 |<<cl-tensor-layout-blas,cl_tensor_layout_blas_pitched_exp>>
 | A type that describe memory layout similar ones used in BLAS APIs.
 
-| CL_TENSOR_LAYOUT_ML_EXP       | <<cl-tensor-layout-ml,cl_tensor_layout_ml_exp>> |
-A convenience layout type over `CL_TENSOR_LAYOUT_BLAS_EXP`.
+| CL_TENSOR_LAYOUT_ML_EXP       | <<cl-tensor-layout-blas,cl_tensor_layout_ml_exp>> |
+
+The tensor layout is specified with an enumerator. Each enumerator
+corresponds to a predefined configuration of
+*cl_tensor_layout_blas_exp* structure.
 
 |===
 
@@ -878,15 +898,15 @@ typedef struct cl_tensor_layout_ml_exp {
   rank is greater than two.
 
 // ^ This condition is meant to ensure that the tensor elements at different
-// coordinates don't alias.
+// coordinates don't alias in memory.
 
 * _ml_type_ defines the memory layout via enumerators which corresponds to
   predefined configurations of `cl_tensor_layout_blas_exp` structure
   as listed in <<tensor-layout-ml-types,ML tensor layout type>> table.
 
 The memory layout descriptions map tensor coordinates to buffer's
-memory byte locations respect to buffer's base address as followed in
-pseudo C:
+memory byte locations respect to buffer's base address as in the
+followed in pseudo C code example:
 
 [source,c]
 ----
@@ -1047,7 +1067,7 @@ clEnqueueMapBuffer(
 | Version | Date       | Author          | Changes
 | 0.1.0   | 2023-11-23 | Henry Linjamäki | *Initial revision*
 
-| 0.2.0   | 2024-8-14 | Henry Linjamäki |
+| 0.2.0   | 2024-8-14 | Henry Linjamäki a|
 
 * Rework document structure match to the cl_khr_extension_template.
 

From 294b1a19a357893da91a28330546e0500927a452 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 15 Aug 2024 14:09:49 +0300
Subject: [PATCH 24/26] Add people who gave feedback in the version history

---
 extensions/cl_exp_tensor.asciidoc | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/extensions/cl_exp_tensor.asciidoc b/extensions/cl_exp_tensor.asciidoc
index f7a2b9c0..2cb4ad26 100644
--- a/extensions/cl_exp_tensor.asciidoc
+++ b/extensions/cl_exp_tensor.asciidoc
@@ -1060,14 +1060,19 @@ clEnqueueMapBuffer(
 
 == Version History
 
-[cols="5,15,15,70"]
+[cols="5,10,15,40"]
 [grid="rows"]
 [options="header"]
 |====
-| Version | Date       | Author          | Changes
-| 0.1.0   | 2023-11-23 | Henry Linjamäki | *Initial revision*
-
-| 0.2.0   | 2024-8-14 | Henry Linjamäki a|
+| Version | Date       | Author           | Changes
+| 0.1.0   | 2023-11-23 | Henry Linjamäki  | *Initial revision*
+
+| 0.2.0   | 2024-8-14  |
+Henry Linjamäki +
+Pekka Jääskeläinen +
+Michal Babej +
+Freddie Witherden
+a|
 
 * Rework document structure match to the cl_khr_extension_template.
 

From 2293467fd922a732cf97c91b214eab47811ab598 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.mikael.linjamaki@intel.com>
Date: Thu, 15 Aug 2024 14:17:24 +0300
Subject: [PATCH 25/26] Update html render (temporary)

---
 extensions/cl_exp_tensor.html | 132 ++++++++++++++++++++++------------
 1 file changed, 85 insertions(+), 47 deletions(-)

diff --git a/extensions/cl_exp_tensor.html b/extensions/cl_exp_tensor.html
index 29822a4d..db1045c9 100644
--- a/extensions/cl_exp_tensor.html
+++ b/extensions/cl_exp_tensor.html
@@ -535,7 +535,7 @@ <h1>cl_exp_tensor</h1>
 <div id="preamble">
 <div class="sectionbody">
 <div class="paragraph">
-<p>This extension provides new buffer abstraction - tensor objects - for
+<p>This extension provides a new buffer abstraction, tensor objects, for
 managing N-dimensional data.</p>
 </div>
 </div>
@@ -567,7 +567,7 @@ <h2 id="_contributors">Contributors</h2>
 <div class="sectionbody">
 <div class="paragraph">
 <p>Henry Linjamäki, Intel.<br>
-Pekka Jääslkeläinen, Intel and Tampere University.<br>
+Pekka Jääskeläinen, Intel.<br>
 Ben Ashbaugh, Intel.<br></p>
 </div>
 </div>
@@ -592,7 +592,7 @@ <h2 id="_status">Status</h2>
 <h2 id="_version">Version</h2>
 <div class="sectionbody">
 <div class="paragraph">
-<p>Built On: 2024-08-14<br>
+<p>Built On: 2024-08-15<br>
 Version: 0.2.0</p>
 </div>
 </div>
@@ -612,39 +612,39 @@ <h2 id="_dependencies">Dependencies</h2>
 <h2 id="_overview">Overview</h2>
 <div class="sectionbody">
 <div class="paragraph">
-<p>The extension provides new tensor object abstraction. Tensor objects
-are similar to image types in regard they represents N-dimensional
-data of some application chosen data type and they may be mapped to
-dedicated hardware except that</p>
+<p>The extension provides a new tensor object abstraction. Tensor objects
+are similar to image types in regard that they represent N-dimensional
+data of an application chosen data type and they may be mapped to
+dedicated hardware, with the following key differences:</p>
 </div>
 <div class="ulist">
 <ul>
 <li>
-<p>higher than 3-dimensional data can be supported (limited by
+<p>Higher than 3-dimensional data can be supported (limited by
 devices' capabilities).</p>
 </li>
 <li>
-<p>applications may choose how the data elements of the tensors are
+<p>Applications may choose how the data elements of the tensors are
 laid out in the buffers using the tensor layout descriptions
 provided in this extension.</p>
 </li>
 </ul>
 </div>
 <div class="paragraph">
-<p>Applications may also choose the memory layouts of the tensors be
+<p>Applications may also choose the memory layouts of the tensors to be
 implementation-specified, letting the driver to optimize the tensor
 data layout for better performance or to lay out the data as required by
-hardware functions (e.g. exposed via builtin kernels).</p>
+hardware accelerated functions (e.g. exposed via builtin kernels).</p>
 </div>
 <div class="paragraph">
-<p>The scope of this extension to provide host APIs for creating tensor
+<p>The scope of this extension is to provide host APIs for creating tensor
 objects and transfer data between tensors, host and other memory
 objects.</p>
 </div>
 <div class="paragraph">
 <p>A separate extension implemented on top of this extension,
-cl_exp_defined_builtin_kernels which provides "defined built-in
-kernels" (DKBs) which operates on tensors. It also provides mechanism
+cl_exp_defined_builtin_kernels provides "defined built-in
+kernels" (DKBs) which can operate on tensors. It also provides mechanism
 for drivers to create DBKs that are optimized for the tensor arguments
 they operate on.</p>
 </div>
@@ -769,7 +769,7 @@ <h2 id="_new_api_types">New API Types</h2>
 <h2 id="_new_api_enums">New API Enums</h2>
 <div class="sectionbody">
 <div class="paragraph">
-<p>Accepted value for <em>properties</em> parameter to
+<p>Accepted value for the <em>properties</em> parameter to
 <strong>clCreateBufferWithProperties</strong> for creating a tensor object:</p>
 </div>
 <div class="listingblock">
@@ -807,7 +807,8 @@ <h2 id="_new_api_enums">New API Enums</h2>
 CL_TENSOR_DTYPE_UINT32_EXP      <span class="integer">0</span>x????
 CL_TENSOR_DTYPE_UINT64_EXP      <span class="integer">0</span>x????
 
-CL_TENSOR_DTYPE_FP8_EXP         <span class="integer">0</span>x????
+CL_TENSOR_DTYPE_FP8E4M3_EXP     <span class="integer">0</span>x????
+CL_TENSOR_DTYPE_FP8E5M2_EXP     <span class="integer">0</span>x????
 CL_TENSOR_DTYPE_FP16_EXP        <span class="integer">0</span>x????
 CL_TENSOR_DTYPE_FP32_EXP        <span class="integer">0</span>x????
 CL_TENSOR_DTYPE_FP64_EXP        <span class="integer">0</span>x????
@@ -1588,8 +1589,15 @@ <h3 id="_modifications_to_the_opencl_api_specification">Modifications to The Ope
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_ulong.</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_FP8_EXP</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">Half precision floating-point.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_FP8E4M3_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">8-bit floating point with a sign bit,
+  4 exponent bits, 3 mantissa bits and a exponent bias of 7.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">cl_char.</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_DTYPE_FP8E5M2_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">8-bit floating point with a sign bit,
+  5 exponent bits, 2 mantissa bits and a exponent bias of 15.</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">cl_char.</p></td>
 </tr>
 <tr>
@@ -1644,24 +1652,41 @@ <h3 id="_modifications_to_the_opencl_api_specification">Modifications to The Ope
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_LAYOUT_OPAQUE_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">N/A</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">The tensor don&#8217;t have application
-  defined memory layout. Driver controls the tensors layout. To read
-  or write elements of the tensor</p></td>
+<td class="tableblock halign-left valign-top"><div class="content"><div class="paragraph">
+<p>The tensor doesn&#8217;t have
+  application defined memory layout. Driver controls the tensors
+  layout. To read or write elements of the tensor, the application
+  must:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>use <strong>clEnqueueExportToTensor</strong> and <strong>clEnqueueImportFromTensor</strong> (or their
+command buffer variants) or</p>
+</li>
+<li>
+<p>use <strong>clEnqueueCopyTensor</strong> to copy elements to / from another tensor
+object with an application-defined memory layout.</p>
+</li>
+</ul>
+</div></div></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_LAYOUT_BLAS_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock"><a href="#cl-tensor-layout-blas">cl_tensor_layout_blas_exp</a></p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">A type that describe packed memory layout similar ones used in BLAS APIs.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">A type that describes a packed memory layout similar ones used in BLAS APIs.</p></td>
 </tr>
 <tr>
-<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_LAYOUT_BLAS_EXP</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_LAYOUT_BLAS_PITCHED_EXP</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock"><a href="#cl-tensor-layout-blas">cl_tensor_layout_blas_pitched_exp</a></p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">A type that describe memory layout similar ones used in BLAS APIs.</p></td>
 </tr>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">CL_TENSOR_LAYOUT_ML_EXP</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock"><a href="#cl-tensor-layout-ml">cl_tensor_layout_ml_exp</a></p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">A convenience layout type over <code>CL_TENSOR_LAYOUT_BLAS_EXP</code>.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock"><a href="#cl-tensor-layout-blas">cl_tensor_layout_ml_exp</a></p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">The tensor layout is specified with an enumerator. Each enumerator
+corresponds to a predefined configuration of
+<strong>cl_tensor_layout_blas_exp</strong> structure.</p></td>
 </tr>
 </tbody>
 </table>
@@ -1677,7 +1702,7 @@ <h3 id="_modifications_to_the_opencl_api_specification">Modifications to The Ope
 <div class="openblock">
 <div class="content">
 <div class="paragraph">
-<p>The following structures describes packed / pitched BLAS-like memory
+<p>The following structures describe packed / pitched BLAS-like memory
 layout for the tensor:</p>
 </div>
 <div class="listingblock">
@@ -1700,14 +1725,14 @@ <h3 id="_modifications_to_the_opencl_api_specification">Modifications to The Ope
 <ul>
 <li>
 <p><em>leading_dims</em> describes which elements along the tensor dimension
-are laid out in the memory. <code>leading_dims[0]</code> point to dimension
+are laid out in the memory. <code>leading_dims[0]</code> points to the dimension
 whose elements are laid out first, followed by elements along
-dimension by <code>leading_dims[1]</code> and so on. The first N elements must
-be non-zero where N is tensor&#8217;s rank and the values must be unique
+the dimension by <code>leading_dims[1]</code> and so on. The first N elements must
+be non-zero where N is a tensor&#8217;s rank and the values must be unique
 and within range <code>[0, tensor_rank)</code>.</p>
 </li>
 <li>
-<p><em>leading_pitches</em> describes distance between from an element to the
+<p><em>leading_pitches</em> describes the distance between an element to the
 next one for the leading dimensions in <em>leading_dims</em>. The distance
 is measured in number of elements. The first N elements must be
 non-zero where the N is tensor&#8217;s rank minus one. The values of the
@@ -1732,7 +1757,7 @@ <h3 id="_modifications_to_the_opencl_api_specification">Modifications to The Ope
 <div class="ulist">
 <ul>
 <li>
-<p><em>ml_type</em> defines memory layout via enumerators which corresponds to
+<p><em>ml_type</em> defines the memory layout via enumerators which corresponds to
 predefined configurations of <code>cl_tensor_layout_blas_exp</code> structure
 as listed in <a href="#tensor-layout-ml-types">ML tensor layout type</a> table.</p>
 </li>
@@ -1740,8 +1765,8 @@ <h3 id="_modifications_to_the_opencl_api_specification">Modifications to The Ope
 </div>
 <div class="paragraph">
 <p>The memory layout descriptions map tensor coordinates to buffer&#8217;s
-memory byte locations respect to buffer&#8217;s base address as followed in
-pseudo C:</p>
+memory byte locations respect to buffer&#8217;s base address as in the
+followed in pseudo C code example:</p>
 </div>
 <div class="listingblock">
 <div class="content">
@@ -1831,7 +1856,7 @@ <h2 id="_sample_codes">Sample Codes</h2>
 std::vector&lt;<span class="predefined-type">float</span>&gt; in1_data = ...;
 std::vector&lt;<span class="predefined-type">float</span>&gt; out_data(b * m * n);
 
-<span class="comment">// Create tensor with opaque layout.</span>
+<span class="comment">// Create a tensor with an opaque layout.</span>
 cl_tensor_desc_exp in0_desc;
 in0_desc.rank = <span class="integer">3</span>;
 in0_desc.properties[<span class="integer">0</span>] = <span class="integer">0</span>;
@@ -1846,7 +1871,7 @@ <h2 id="_sample_codes">Sample Codes</h2>
   ctx, {CL_MEM_TENSOR_EXP, in0_desc, <span class="integer">0</span>},
   CL_MEM_READ_ONLY, <span class="integer">0</span>, <span class="predefined-constant">nullptr</span>, &amp;err);
 
-<span class="comment">// Create tensor from a host allocation using an application defined</span>
+<span class="comment">// Create tensor from a host allocation using an application-defined</span>
 <span class="comment">// layout description for mapping elements to the tensor.</span>
 cl_tensor_desc_exp in1_desc;
 in1_desc.rank = <span class="integer">3</span>;
@@ -1865,7 +1890,7 @@ <h2 id="_sample_codes">Sample Codes</h2>
   ctx, {CL_MEM_TENSOR_EXP, in1_desc, <span class="integer">0</span>},
   CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, <span class="integer">0</span>, in1_data.data(), &amp;err);
 
-<span class="comment">// Create another tensor with application defined layout.</span>
+<span class="comment">// Create another tensor with an application-defined layout.</span>
 cl_tensor_desc_exp out_desc;
 out_desc.rank = <span class="integer">3</span>;
 out_desc.properties[<span class="integer">0</span>] = <span class="integer">0</span>;
@@ -1893,7 +1918,7 @@ <h2 id="_sample_codes">Sample Codes</h2>
 clSetKernelArg(batched_matmul_kernel, <span class="integer">2</span>, <span class="keyword">sizeof</span>(cl_mem), &amp;out_tensor);
 
 <span class="comment">// Required command for transferring data to layout-opaque tensors and</span>
-<span class="comment">// from it to elsewhere.</span>
+<span class="comment">// from it elsewhere.</span>
 clEnqueueExportToTensor(
   cmd_q, in0_tensor, <span class="predefined-constant">false</span>, {<span class="integer">0</span>, <span class="integer">0</span>, <span class="integer">0</span>}, {<span class="integer">0</span>, <span class="integer">0</span>, <span class="integer">0</span>}, {b, m, k},
   <span class="predefined-constant">nullptr</span>, <span class="predefined-constant">nullptr</span>, in0_data.data(), <span class="integer">0</span>, <span class="predefined-constant">nullptr</span>, <span class="predefined-constant">nullptr</span>);
@@ -1908,12 +1933,12 @@ <h2 id="_sample_codes">Sample Codes</h2>
 </div>
 </div>
 <div class="sect1">
-<h2 id="_issues">Issues</h2>
+<h2 id="_issues_and_open_questions">Issues and Open Questions</h2>
 <div class="sectionbody">
 <div class="olist arabic">
 <ol class="arabic">
 <li>
-<p>Should we have support for tensors with undefined shape and tensors
+<p>Should we support tensors with undefined shape and tensors
 with unknown / symbolic dimension sizes like in ONNX?</p>
 <div class="openblock">
 <div class="content">
@@ -1964,10 +1989,10 @@ <h2 id="_version_history">Version History</h2>
 <div class="sectionbody">
 <table class="tableblock frame-all grid-rows stretch">
 <colgroup>
-<col style="width: 4.7619%;">
+<col style="width: 7.1428%;">
 <col style="width: 14.2857%;">
-<col style="width: 14.2857%;">
-<col style="width: 66.6667%;">
+<col style="width: 21.4285%;">
+<col style="width: 57.143%;">
 </colgroup>
 <thead>
 <tr>
@@ -1987,10 +2012,23 @@ <h2 id="_version_history">Version History</h2>
 <tr>
 <td class="tableblock halign-left valign-top"><p class="tableblock">0.2.0</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">2024-8-14</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">Henry Linjamäki</p></td>
-<td class="tableblock halign-left valign-top"><p class="tableblock">* Rework document structure match to the cl_khr_extension_template.</p>
-<p class="tableblock">* Added clEnqueueCopyTensor.</p>
-<p class="tableblock">* Added API for setting memory layout for tensors.</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">Henry Linjamäki<br>
+Pekka Jääskeläinen<br>
+Michal Babej<br>
+Freddie Witherden</p></td>
+<td class="tableblock halign-left valign-top"><div class="content"><div class="ulist">
+<ul>
+<li>
+<p>Rework document structure match to the cl_khr_extension_template.</p>
+</li>
+<li>
+<p>Added clEnqueueCopyTensor.</p>
+</li>
+<li>
+<p>Added an API for setting the memory layout for tensors.</p>
+</li>
+</ul>
+</div></div></td>
 </tr>
 </tbody>
 </table>
@@ -2005,7 +2043,7 @@ <h2 id="_version_history">Version History</h2>
 </div>
 <div id="footer">
 <div id="footer-text">
-Last updated 2024-08-14 16:10:59 +0300
+Last updated 2024-08-15 14:07:25 +0300
 </div>
 </div>
 </body>

From 01a415857be1e13195ba4933bea4b92cfa9a2460 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= <henry.linjamaki@gmail.com>
Date: Fri, 16 Aug 2024 09:28:52 +0300
Subject: [PATCH 26/26] Update extensions/cl_exp_tensor.asciidoc
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Pekka Jääskeläinen <pekka.jaaskelainen@tuni.fi>
---
 extensions/cl_exp_tensor.asciidoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/extensions/cl_exp_tensor.asciidoc b/extensions/cl_exp_tensor.asciidoc
index 2cb4ad26..619cca5c 100644
--- a/extensions/cl_exp_tensor.asciidoc
+++ b/extensions/cl_exp_tensor.asciidoc
@@ -906,7 +906,7 @@ typedef struct cl_tensor_layout_ml_exp {
 
 The memory layout descriptions map tensor coordinates to buffer's
 memory byte locations respect to buffer's base address as in the
-followed in pseudo C code example:
+following pseudo C code example:
 
 [source,c]
 ----

Tensor element data type	Description
CL_TENSOR_BOOL	1-bit signedless integer.
CL_TENSOR_INT8	8-bit signed integer.
CL_TENSOR_INT16	16-bit signed integer.
CL_TENSOR_INT32	32-bit signed integer.
CL_TENSOR_INT64	64-bit signed integer.
CL_TENSOR_UINT8	8-bit signed integer.
CL_TENSOR_UINT16	16-bit signed integer.
CL_TENSOR_UINT32	32-bit signed integer.
CL_TENSOR_UINT64	64-bit signed integer.
CL_TENSOR_HALF	Half precision floating-point value.
CL_TENSOR_BFLOAT16	16-bit brain floating-point value.
CL_TENSOR_FLOAT	Single precision floating-point value.
CL_TENSOR_DOUBLE	Double precision floating-point value.
CL_TENSOR_COMPLEX64	64-bit complex floating point value with + 32-bit real and imaginary part.
CL_TENSOR_COMPLEX128	128-bit complex floating point value with + 64-bit real and imaginary part.
CL_TENSOR_RANK	size_t	Return the tensor rank.
CL_TENSOR_SHAPE	size_t[]	Return the tensor shape.
CL_TENSOR_DTYPE	cl_tensor_type	Return the tensor data type.
CL_TENSOR_COMMAND_BUFFER_TEMPORARY	cl_bool	Return true if the +tensor is temporary tensor for command buffers.
CL_TENSOR_BOUND_TO_BUFFER	cl_bool	Return true if the tensor is +bound to a buffer. If CL_TENSOR_COMMAND_BUFFER_TEMPORARY is true, then +CL_TENSOR_BOUND_TO_BUFFER must return false.
CL_TENSOR_BUFFER	cl_mem	+ If CL_TENSOR_BOUND_TO_BUFFER is true, +return the buffer object the tensor is bound to. Otherwise, +clGetTensorInfo call returns: + + + + + CL_INVALID_MEM_OBJECT if the tensor is not bound to a buffer object. + + + CL_INVALID_PROPERTY otherwise. + + +
CL_TENSOR_CONTEXT	cl_context	Return the context specified when + the tensor object is created.
CL_TENSOR_REFERENCE_COUNT	cl_uint	Return the tensor reference +count.
CL_TENSOR_BOOL	1-bit signedless integer.	CL_TENSOR_DTYPE_BOOL	Data type representing true or false.	cl_uchar. ^[1]
CL_TENSOR_INT8	CL_TENSOR_DTYPE_INT4_EXP	4-bit signed integer.	cl_char.
CL_TENSOR_DTYPE_INT8_EXP	8-bit signed integer.	cl_char.
CL_TENSOR_INT16	CL_TENSOR_DTYPE_INT16_EXP	16-bit signed integer.	cl_short.
CL_TENSOR_INT32	CL_TENSOR_DTYPE_INT32_EXP	32-bit signed integer.	cl_int.
CL_TENSOR_INT64	CL_TENSOR_DTYPE_INT64_EXP	64-bit signed integer.	cl_long.
CL_TENSOR_UINT8	CL_TENSOR_DTYPE_UINT8_EXP	8-bit unsigned integer.	cl_uchar.
CL_TENSOR_UINT16	CL_TENSOR_DTYPE_UINT16_EXP	16-bit unsigned integer.	cl_ushort.
CL_TENSOR_UINT32	CL_TENSOR_DTYPE_UINT32_EXP	32-bit unsigned integer.	cl_uint.
CL_TENSOR_UINT64	CL_TENSOR_DTYPE_UINT64_EXP	64-bit unsigned integer.	cl_ulong.
CL_TENSOR_HALF	CL_TENSOR_DTYPE_FP8_EXP	Half precision floating-point.	cl_char.
CL_TENSOR_DTYPE_FP16_EXP	Half precision floating-point.	cl_half.
CL_TENSOR_BFLOAT16	CL_TENSOR_DTYPE_BFLOAT16_EXP	16-bit brain floating-point.	cl_ushort
CL_TENSOR_FLOAT	CL_TENSOR_DTYPE_FP32_EXP	Single precision floating-point.	cl_float.
CL_TENSOR_DOUBLE	CL_TENSOR_DTYPE_FP64_EXP	Double precision floating-point.	cl_double.
CL_TENSOR_COMPLEX64	CL_TENSOR_DTYPE_COMPLEX64_EXP	64-bit complex floating-point with 32-bit real and imaginary part.	cl_float2
CL_TENSOR_COMPLEX128	CL_TENSOR_DTYPE_COMPLEX128_EXP	128-bit complex floating-point with 64-bit real and imaginary part.	cl_double2