Api Overhead Benchmark is a set of tests aimed at measuring CPU-side execution duration of compute API calls.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
AppendLaunchKernel | measures time spent in zeCommandListAppendLaunchKernel on CPU. |
|
✔️ | ❌ |
AppendWaitOnEventsImmediate | Measures time spent to zeCommandListAppendWaitOnEvents using immediate command list. |
|
✔️ | ❌ |
CreateBuffer | measures time spent in clCreateBuffer on CPU. |
|
❌ | ✔️ |
CreateCommandList | measures time spent in zeCommandListCreate on CPU. |
|
✔️ | ❌ |
CreateCommandListImmediate | measures time spent in zeCommandListCreateImmediate on CPU. |
|
✔️ | ❌ |
DestroyCommandList | measures time spent in zeCommandListDestroy on CPU. |
|
✔️ | ❌ |
DestroyCommandListImmediate | measures time spent in zeCommandListDestroy on CPU, for immediate cmdlist. |
|
✔️ | ❌ |
DriverGet | measures time spent in driver get call on CPU. |
|
✔️ | ❌ |
DriverGetApiVersion | measures time spent in zeDriverGetApiVersion call on CPU. | ✔️ | ❌ | |
DriverGetProperties | measures time spent in zeDriverGetProperties call on CPU. | ✔️ | ❌ | |
EnqueueNdrNullLws | measures time spent in clEnqueueNDRangeKernel on CPU. Null LWS is provided, which causes driver to calculate it |
|
❌ | ✔️ |
EnqueueNdrTime | measures time spent in clEnqueueNDRangeKernel on CPU. |
|
❌ | ✔️ |
EventCreation | measures time spent to create event |
|
✔️ | ❌ |
EventQueryStatus | Measures time spent to query event status |
|
✔️ | ❌ |
ExecuteCommandList | measures time spent in zeCommandQueueExecuteCommandLists on CPU. |
|
✔️ | ❌ |
ExecuteCommandListForCopyEngine | measures CPU time spent in zeCommandQueueExecuteCommandLists for copy-only path |
|
✔️ | ❌ |
ExecuteCommandListImmediate | measures time spent in appending launch kernel for immediate command list on CPU. |
|
✔️ | ❌ |
ExecuteCommandListImmediateCopyQueue | measures time spent in appending memory copy for immediate command list on CPU with Copy Queue. |
|
✔️ | ❌ |
ExecuteCommandListImmediateMultiKernel | measures time spent in executing multiple instances of two different kernels with immediate command list on CPU |
|
✔️ | ❌ |
ExecuteCommandListWithFenceCreate | measures time spent in zeFenceCreate on CPU when fences are used. | ✔️ | ❌ | |
ExecuteCommandListWithFenceDestroy | measures time spent in zeFenceDestroy on CPU when fences are used. | ✔️ | ❌ | |
ExecuteCommandListWithFenceUsage | measures time spent in zeCommandQueueExecuteCommandLists and zeFenceSynchronize on CPU when fences are used. | ✔️ | ❌ | |
ExecuteCommandListWithIndirectAccess | measures time spent in zeCommandQueueExecuteCommandLists on CPU when indirect allocations are accessed. |
|
✔️ | ❌ |
ExecuteCommandListWithIndirectArguments | measures time spent in zeCommandQueueExecuteCommandLists on CPU when indirect allocations are used. |
|
✔️ | ❌ |
FlushTime | measures time spent in clEnqueueNDRangeKernel on CPU. |
|
❌ | ✔️ |
KernelSetArgumentValueImmediate | measures time spent in zeKernelSetArgumentValue for immediate arguments on CPU. |
|
✔️ | ❌ |
LifecycleCommandList | measures time spent in zeCommandListCreate + Close + Execute on CPU. |
|
✔️ | ❌ |
ResetCommandList | measures time spent in zeCommandListReset on CPU. |
|
✔️ | ❌ |
SetKernelArgSvmPointer | measures time spent in clSetKernelArgSVMPointer on CPU. |
|
✔️ | ✔️ |
SetKernelGroupSize | measures time spent in zeKernelSetGroupSize on CPU. |
|
✔️ | ❌ |
UsmMemoryAllocation | measures time spent in USM memory allocation APIs. |
|
✔️ | ❌ |
Atomic Benchmark is a set of tests aimed at measuring performance of atomic operations inside kernels.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
OneAtomic | enqueues kernel performing an atomic operation on a single address |
|
❌ | ✔️ |
OneAtomicExplicit | enqueues kernel performing an atomic operation on a single address using OpenCL 2.0 Atomics with explicit memory order and scope |
|
❌ | ✔️ |
OneLocalAtomic | enqueues kernel performing an atomic operation on a single location placed in SLM |
|
❌ | ✔️ |
OneLocalAtomicExplicit | enqueues kernel performing an atomic operation on a single location placed in SLM using OpenCL 2.0 Atomics with explicit memory order and scope |
|
❌ | ✔️ |
SeparateAtomics | enqueues kernel performing an atomic operation on different addresses |
|
❌ | ✔️ |
SeparateAtomicsExplicit | enqueues kernel performing an atomic operation on different addresses |
|
❌ | ✔️ |
EU Benchmark is a set of tests aimed at measuring performance of calculations performed in kernels.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
DoMathOperation | enqueues kernel performing a math operation |
|
❌ | ✔️ |
ReadAfterAtomicWrite | enqueues kernel, which writes to global memory using atomic and then reads non atomically |
|
❌ | ✔️ |
Gpu Commands Benchmark is a set of tests aimed at measuring GPU-side execution duration of various commands.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
BarrierBetweenKernels | measures time required to run a barrier command between 2 kernels, including potential cache flush commands |
|
✔️ | ❌ |
CopyWithEvent | measures time required to run a copy kernel with various event configurations. |
|
✔️ | ❌ |
EmptyKernel | measures time required to run an empty kernel on GPU. |
|
✔️ | ❌ |
EventCtxtSwitchLatency | measures context switching latency time required to switch between various engine types |
|
✔️ | ❌ |
KernelWithEvent | measures time required to run an empty kernel with various event configurations. |
|
✔️ | ❌ |
KernelWithWork | measures time required to run a GPU kernel which assigns values to elements of a buffer. |
|
✔️ | ❌ |
WaitOnEventCold | measures time required to service a signalled semaphore, that has never been waited for. |
|
✔️ | ❌ |
WaitOnEventFromWalker | measures time required to service a signalled semaphore coming from Walker command |
|
✔️ | ❌ |
WaitOnEventHot | measures time required to service a signalled semaphore, that was previously used |
|
✔️ | ❌ |
WriteTimestamp | measures time required to write a timestamp on GPU. |
|
✔️ | ❌ |
Memory Benchmark is a set of tests aimed at measuring bandwidth of memory transfers.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
CopyBuffer | allocates two OpenCL buffers and measures copy bandwidth between them. Buffers will be placed in device memory, if it's available. |
|
❌ | ✔️ |
CopyBufferRect | allocates two OpenCL buffers and measures rectangle copy bandwidth betweem them. Buffers will be placed in device memory, if it's available. |
|
❌ | ✔️ |
CopyEntireImage | allocates two image objects and measures copy bandwidth between them. Images will be placed in device memory, if it's available. |
|
✔️ | ✔️ |
FillBuffer | allocates an OpenCL buffer and measures fill bandwidth. Buffer will be placed in device memory, if it's available. |
|
❌ | ✔️ |
MapBuffer | allocates an OpenCL buffer and measures map bandwidth. Mapping operation means memory transfer from GPU to CPU or a no-op, depending on map flags. |
|
❌ | ✔️ |
ReadBuffer | allocates an OpenCL buffer and measures read bandwidth. Read operation means transfer from GPU to CPU. |
|
❌ | ✔️ |
ReadBufferMisaligned | allocates an OpenCL buffer and measures read bandwidth. Read operation means transfer from GPU to CPU. Destination pointer passed by the application will be misaligned by the specified amount of bytes. |
|
❌ | ✔️ |
ReadBufferRect | allocates an OpenCL buffer and measures rectangle read bandwidth. Rectangle read operation means transfer from GPU to CPU. |
|
❌ | ✔️ |
ReadDeviceMemBuffer | allocates two OpenCL buffers and measures source buffer read bandwidth. Source buffer resides in device memory. |
|
❌ | ✔️ |
RemoteAccessMemory | Uses stream memory triad to measure bandwidth with different percentages of remote memory access. |
|
❌ | ✔️ |
SLM_DataAccessLatency | generates SLM local memory transactions inside thread group to measure latency between reads (uses Intel only private intel_get_cycle_counter() ) |
|
❌ | ✔️ |
SlmSwitchLatency | Enqueues 2 kernels with different SLM size. Measures switch time between these kernels. |
|
✔️ | ❌ |
StreamAfterTransfer | Goal of this test is to measure how stream kernels perform right after host to device transfer populating the data. Test does clean caches, then emits transfers and then follows with stream kernel and measures GPU execution time of it. |
|
❌ | ✔️ |
StreamMemory | Streams memory inside of kernel in a fashion described by 'type'. Copy means one memory location is read from and the second one is written to. Triad means two buffers are read and one is written to. In read and write memory is only read or written to. |
|
✔️ | ✔️ |
StreamMemoryImmediate | Streams memory inside of kernel in a fashion described by 'type' using immediate command list. Copy means one memory location is read from and the second one is written to. Triad means two buffers are read and one is written to. In read and write memory is only read or written to. |
|
✔️ | ❌ |
UnmapBuffer | allocates an OpenCL buffer and measures unmap bandwidth. Unmapping operation meansmemory transfer from CPU to GPU or a no-op, depending on map flags. |
|
❌ | ✔️ |
UsmCopy | allocates two unified shared memory buffers and measures copy bandwidth between them. |
|
✔️ | ✔️ |
UsmCopyImmediate | allocates two unified shared memory buffers and measures copy bandwidth between them using immediate command list. |
|
✔️ | ❌ |
UsmCopyMultipleBlits | allocates two unified shared memory buffers, divides them into chunks, copies each chunk using a different copy engine and measures bandwidth. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even. |
|
✔️ | ✔️ |
UsmCopyStagingBuffers | Measures copy time from device/host to host/device. Host memory is non-USM allocation.Copy is done through staging USM buffers. Non-USM host ptr is never passed to L0 API, only through staging buffers. |
|
✔️ | ❌ |
UsmFill | allocates a unified memory buffer and measures fill bandwidth |
|
✔️ | ✔️ |
UsmFillImmediate | allocates a unified memory buffer and measures fill bandwidth using immediate command list |
|
✔️ | ❌ |
UsmFillMultipleBlits | allocates a unified shared memory buffer, divides it into chunks, copies each chunk using a different copy engine and measures bandwidth. Refer to UsmCopyMultipleBlits for more details. |
|
✔️ | ✔️ |
UsmFillSpecificPattern | allocates a unified memory buffer and measures fill bandwidth. Allow specifying arbitrary pattern. |
|
✔️ | ✔️ |
UsmImmediateCopyMultipleBlits | allocates two unified shared memory buffers, divides them into chunks, copies each chunk using a different copy engine with an immediate command list and measures bandwidth. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even. |
|
✔️ | ❌ |
UsmMemset | allocates a unified memory buffer and measures memset bandwidth |
|
❌ | ✔️ |
UsmSharedMigrateCpu | allocates a unified shared memory buffer and measures bandwidth for kernel that must migrate resource from GPU to CPU |
|
✔️ | ✔️ |
UsmSharedMigrateGpu | allocates a unified shared memory buffer and measures bandwidth for kernel that must migrate resource from CPU to GPU |
|
✔️ | ✔️ |
UsmSharedMigrateGpuForFill | allocates a unified shared memory buffer and measures bandwidth for memory fill operation that must migrate resource from CPU to GPU |
|
✔️ | ✔️ |
WriteBuffer | allocates an OpenCL buffer and measures write bandwidth. Write operation means transfer from CPU to GPU. |
|
❌ | ✔️ |
WriteBufferRect | allocates an OpenCL buffer and measures rectangle write bandwidth. Rectangle write operation means transfer from CPU to GPU. |
|
❌ | ✔️ |
Miscellaneous Benchmark is a set of tests measuring different simple compute scenarios.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
IoqKernelSwitchLatency | measures time from end of one kernel till start of next kernel for in order queue |
|
❌ | ✔️ |
KernelWithWork | measures time required to run a GPU kernel which assigns constant values to elements of a buffer. Each thread assigns one value. Benchmark checks the impact of kernel split. |
|
❌ | ✔️ |
Reduction | Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location. |
|
❌ | ✔️ |
Reduction2 | Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location. |
|
❌ | ✔️ |
Reduction3 | Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location. |
|
❌ | ✔️ |
Reduction4 | Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location. |
|
❌ | ✔️ |
Reduction5 | Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location. |
|
❌ | ✔️ |
VectorSum | Performs vector addition |
|
❌ | ✔️ |
Multithread Benchmark is a set of tests aimed at measuring how different commands benefit from multithreaded execution.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
ImmediateCommandListCompletion | measures completion latency of AppendMemoryCopy issued from multiple threads to Immediate Command Lists.Engines to be used for submissions are selected based on the enabled bits of engineMask.'threadsPerEngine' number of threads submits commands to each selected engine.If 'numberOfThreads' is greater than 'threadsPerEngine' x selected engine count, then the excess threads are assigned to selected engines one each, in a round-robin method.if selected engineCount == 1, then all threads are assigned to that engine. |
|
✔️ | ❌ |
ImmediateCommandListSubmission | measures submission latency of AppendLaunchKernel issued from multiple threads to Immediate Command Lists.'threadsPerEngine' count of threads submit commands to each engine.If 'numberOfThreads' is greater than 'threadsPerEngine' x engine count, then the excess threads are assigned to engines one each, in a round-robin method.if engineCount == 1, then all threads are assigned to the engine. |
|
✔️ | ❌ |
SvmCopy | enqueues multiple svm copies on multiple threads concurrently. |
|
✔️ | ✔️ |
Multi-tile Memory Benchmark is a set of tests aimed at measuring bandwidth of memory transfers performed on a multi-tile device.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
CopyBuffer | allocates two OpenCL buffers and measures copy bandwidth between them. Buffers will be placed in device memory, if it's available. |
|
❌ | ✔️ |
FillBuffer | allocates an OpenCL buffer and measures fill bandwidth. Buffer will be placed in device memory, if it's available. |
|
❌ | ✔️ |
ReadBuffer | allocates an OpenCL buffer and measures read bandwidth. Read operation means transfer from GPU to CPU. |
|
❌ | ✔️ |
UsmBidirectionalCopy | allocates two unified device memory buffers, each on a different tile, and measures copy bandwidth between. Test measures copies on two directions, which can be controlled with the -write parameter: with -write=1, each tile performs a write operation. For instance: queue is placed in tile 0, source is buffer in tile 0, and destination is in tile 1. Similarly for tile 1, queue is placed in tile 1, source in tile 1, and destination in tile 0. With -write=0, the destination and source are flipped: queue is placed in tile 0, source is buffer in tile 1, and destination is in tile 0, while for tile 1, queue is placed in tile 1, source in tile 0, and destination in tile 1. |
|
✔️ | ❌ |
UsmCopy | allocates two unified shared memory buffers and measures copy bandwidth between them using a builtin function. |
|
✔️ | ✔️ |
UsmCopyImmediate | allocates two unified shared memory buffers and measures copy bandwidth between them using a builtin function appended to an immediate list. |
|
✔️ | ❌ |
UsmCopyKernel | allocates two unified shared memory buffers and measures copy bandwidth between them using a custom kernel. |
|
✔️ | ✔️ |
UsmFill | allocates a unified shared memory buffer and measures fill bandwidth. |
|
✔️ | ✔️ |
UsmSharedMigrateCpu | allocates a unified shared memory buffer and measures time to migrate it from GPU to CPU. |
|
✔️ | ✔️ |
UsmSharedMigrateGpu | allocates a unified shared memory buffer and measures time to migrate it from CPU to GPU. |
|
✔️ | ✔️ |
WriteBuffer | allocates an OpenCL buffer and measures write bandwidth. Write operation means transfer from CPU to GPU. |
|
❌ | ✔️ |
Overlap Benchmark is a set of tests aimed at measuring how different commands benefit for simultaneous execution.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
KernelAndCopy | enqueues kernel and copy operation with the ability to perform both tasks on different command queues. |
|
❌ | ✔️ |
MultiProcessCompute | Creates a number of separate processes for each tile specified performing a compute workload and measures average time to complete all of them. Processes will use affinity mask to select specific sub-devices for the execution |
|
✔️ | ❌ |
MultiProcessComputeSharedBuffer | Creates a number of separate processes for each tile specified performing a compute workload and measures average time to complete all of them. Processes will use affinity mask to select specific sub-devices for the execution. A single buffer for each tile is created by parent process. All processes executing on a given tile will share it via IPC calls. |
|
✔️ | ❌ |
P2P Benchmark is a set of tests aimed at measuring bandwidth and latency of memory transfers between peer devices.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
UsmCopyMultipleBlits | allocates two unified device memory buffers on separate devices and performs a copy between sections (or chunks) of these using a different copy engine and measures bandwidth. Test first checks for P2P capabilities in the target platform before submitting the copy. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even. |
|
✔️ | ❌ |
UsmEUCopy | allocates two unified device memory buffers on separate devices, performs a copy between them using a compute engine, and reports bandwidth. Test first checks for P2P capabilities in the target platform before submitting the copy. |
|
✔️ | ❌ |
UsmImmediateCopyMultipleBlits | allocates two unified device memory buffers on separate devices and performs a copy between sections (or chunks) of these using a different copy engine with an immediate command list and measures bandwidth. Test first checks for P2P capabilities in the target platform before submitting the copy. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even. |
|
✔️ | ❌ |
Ulls Benchmark is a set of tests aimed at measuring Ultra Low Latency Submission (ULLS) performance impact.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
BestSubmission | enqueues a system memory write via PIPE_CONTROL and measures when update becomes visible on the CPU. | ✔️ | ❌ | |
BestWalkerSubmission | enqueues kernel, which updates system memory location and then busy-loops on CPU until the update becomes visible. | ✔️ | ✔️ | |
BestWalkerSubmissionImmediate | enqueues kernel, which updates system memory location and then busy-loops on CPU until the update becomes visible. Kernel is enqueued using low-latency immediate command list, so the test is LevelZero-specific. | ✔️ | ❌ | |
BestWalkerSubmissionImmediateMultiCmdlists | Append N kernels on N cmdlists, which updates system memory locations and then waits using busy-loop on CPU until the update becomes visible. Kernels are appended using immediate command lists.Amount of command lists is specified by cmdlistCount. |
|
✔️ | ❌ |
CompletionLatency | enqueues system memory write and measures time between the moment, when update is visible on CPU and the moment, when synchronizing call returns. | ✔️ | ❌ | |
CopySubmissionEvents | enqueues 4 byte copy to copy engine and return submission delta which is time between host API call and copy engine start |
|
✔️ | ✔️ |
EmptyKernel | enqueues empty kernel and measures time to launch it and wait for it on CPU, thus measuring walker spawn time. |
|
✔️ | ✔️ |
EmptyKernelImmediate | enqueues empty kernel and measures time to launch it using immediate command list and wait for it on CPU, thus measuring walker spawn time. |
|
✔️ | ❌ |
KernelSwitchLatency | measures time from end of one kernel till start of next kernel |
|
✔️ | ✔️ |
KernelSwitchLatencyImmediate | measures time from end of one kernel till start of next kernel using immediate command lists |
|
✔️ | ❌ |
KernelWithWork | measures time required to run a GPU kernel which assigns constant values to elements of a buffer. Each thread assigns one value. |
|
✔️ | ✔️ |
KernelWithWorkImmediate | measures time required to run a GPU kernel which assigns constant values to elements of a buffer using immediate command list. Each thread assigns one value. |
|
✔️ | ❌ |
MultiProcessImmediateCmdlistCompletion | measures completion latency of AppendMemoryCopy issued from multiple processes to Immediate Command Lists.Engines to be used for submissions are selected based on the enabled bits of engineMask.Bits of the 'engineMask' are indexed from right to left. So rightmost bit represents first engine and leftmost, the last engine.'processesPerEngine' number of processes submits commands to each selected engine.If 'numberOfProcesses' is greater than 'processesPerEngine' x selected engine count, then the excess processes are assigned to selected engines one each, in a round-robin method.if selected engineCount == 1, then all processes are assigned to that engine. |
|
✔️ | ❌ |
MultiProcessImmediateCmdlistSubmission | measures submission latency of walker command issued from multiple processes to Immediate Command Lists.'processesPerEngine' count of processes, submit commands to each engine.If 'numberOfProcesses' is greater than 'processesPerEngine' x engine count, then the excess processes are assigned to engines one each, in a round-robin method.if engineCount == 1, then all processes are assigned to the engine. |
|
✔️ | ❌ |
MultiQueueSubmission | enqueues kernel on multiple command queues |
|
✔️ | ✔️ |
NewResourcesSubmissionDevice | enqueues kernel that uses a buffer placed in device memory to measure resource preparation time. The resource is destroyed and recreated for each iteration to ensure it is a different memory allocation. |
|
✔️ | ✔️ |
NewResourcesSubmissionHost | enqueues kernel that uses a buffer placed in host memory to measure resource preparation time. The resource is destroyed and recreated for each iteration to ensure it is a different memory allocation. |
|
✔️ | ✔️ |
NewResourcesWithGpuAccess | enqueues kernel that accesses an entire buffer placed in device memory to measure resource preparation time. The resource is destroyed and recreated for each iteration to ensure it is a different memory allocation. |
|
✔️ | ✔️ |
QueuePriorities | Uses queues with different priorities to meassure submission and context switch latencies |
|
❌ | ✔️ |
ResourceReassign | Enqueues stress kernel which utilizes majority of GPU's execution units, then enqueues next kernel, measuring its execution time. Shows overhead releated to GPU's resources releasing and assigning. |
|
❌ | ✔️ |
RoundTripSubmission | enqueues kernel which updates system memory location and waits for it with a synchronizing API. | ✔️ | ✔️ | |
UsmSharedFirstCpuAccess | allocates a unified shared memory buffer and measures time to access it on CPU after creation. |
|
✔️ | ✔️ |
UsmSharedFirstGpuAccess | allocates a unified shared memory buffer and measures time to access it on GPU after creation. |
|
✔️ | ✔️ |
WalkerCompletionLatency | enqueues a kernel writing to system memory and measures time between the moment when update is visible on CPU and the moment when synchronizing call returns |
|
✔️ | ✔️ |
WalkerSubmissionEvents | enqueues an empty kernel with GPU-side profiling and checks delta between queue time and start time. | ✔️ | ✔️ | |
WriteLatency | unblocks event on GPU, then waits for timestamp being written. | ✔️ | ❌ |