Skip to content

Latest commit

 

History

History
211 lines (178 loc) · 62.7 KB

TESTS.md

File metadata and controls

211 lines (178 loc) · 62.7 KB

api_overhead_benchmark

Api Overhead Benchmark is a set of tests aimed at measuring CPU-side execution duration of compute API calls.

Test name Description Params L0 OCL
AppendLaunchKernel measures time spent in zeCommandListAppendLaunchKernel on CPU.
  • --event Pass output event to the enqueue call (0 or 1)
  • --wgc Workgroup count
  • --wgs Workgroup size, pass 0 to make the driver calculate it during enqueue
✔️
AppendWaitOnEventsImmediate Measures time spent to zeCommandListAppendWaitOnEvents using immediate command list.
  • --eventSignaled Event is already signaled before zeCommandListAppendWaitOnEvents call (0 or 1)
✔️
CreateBuffer measures time spent in clCreateBuffer on CPU.
  • --allocateAll Free buffers at the end of test, as opposed to freeing between iterations. Should disallow resource reuse (0 or 1)
  • --bufferSize Buffer size
  • --copyHostPtr CL_MEM_COPY_HOST_PTR flag (0 or 1)
  • --forceHostMemoryIntel CL_MEM_FORCE_HOST_MEMORY_INTEL flag (0 or 1)
  • --readOnly Read only buffer (0 or 1)
✔️
CreateCommandList measures time spent in zeCommandListCreate on CPU.
  • --CmdListCount Number of cmdlists to create
  • --CopyOnly Create copy only cmdlist (0 or 1)
✔️
CreateCommandListImmediate measures time spent in zeCommandListCreateImmediate on CPU.
  • --CmdListCount Number of cmdlists to create
✔️
DestroyCommandList measures time spent in zeCommandListDestroy on CPU.
  • --CmdListCount Number of cmdlists to destroy
✔️
DestroyCommandListImmediate measures time spent in zeCommandListDestroy on CPU, for immediate cmdlist.
  • --CmdListCount Number of immediate cmdlists to create
✔️
DriverGet measures time spent in driver get call on CPU.
  • --getDriverCount Whether to measure driver count or driver get (0 or 1)
✔️
DriverGetApiVersion measures time spent in zeDriverGetApiVersion call on CPU.
    ✔️
    DriverGetProperties measures time spent in zeDriverGetProperties call on CPU.
      ✔️
      EnqueueNdrNullLws measures time spent in clEnqueueNDRangeKernel on CPU. Null LWS is provided, which causes driver to calculate it
      • --event Pass output event to the enqueue call (0 or 1)
      • --gws Global work size
      • --ooq Use out of order queue (0 or 1)
      • --profiling Creating a profiling queue (0 or 1)
      ✔️
      EnqueueNdrTime measures time spent in clEnqueueNDRangeKernel on CPU.
      • --event Pass output event to the enqueue call (0 or 1)
      • --ooq Use out of order queue (0 or 1)
      • --profiling Creating a profiling queue (0 or 1)
      • --wgc Workgroup count
      • --wgs Workgroup size
      ✔️
      EventCreation measures time spent to create event
      • --hostVisible Event will set host visible flag (0 or 1)
      • --signal Type of signal scope (subdevice or device or host or none)
      • --useProfiling Event will use profiling (0 or 1)
      • --wait Type of wait scope (subdevice or device or host or none)
      ✔️
      EventQueryStatus Measures time spent to query event status
      • --eventSignaled Event will be set as signaled (0 or 1)
      ✔️
      ExecuteCommandList measures time spent in zeCommandQueueExecuteCommandLists on CPU.
      • --UseFence Pass a non-null ze_fence_handle_t to the API call (0 or 1)
      • --measureCompletionTime Measures time taken to complete the submission (default is to measure only Execute call) (0 or 1)
      ✔️
      ExecuteCommandListForCopyEngine measures CPU time spent in zeCommandQueueExecuteCommandLists for copy-only path
      • --UseFence Pass a non-null ze_fence_handle_t to the API call (0 or 1)
      • --measureCompletionTime Measures time taken to complete the submission (default is to measure only Execute call) (0 or 1)
      ✔️
      ExecuteCommandListImmediate measures time spent in appending launch kernel for immediate command list on CPU.
      • --CallsCount amount of calls that is being meassured
      • --KernelExecutionTime How long a single kernel executes, in us
      • --MeasureCompletionTime Measures time taken to complete the submission (default is to measure only Immediate call) (0 or 1)
      • --UseProfiling Pass a profiling ze_event_t to the API call (0 or 1)
      • --useBarrierSynchronization Uses barrier synchronization instead of waiting for event from last kernel (0 or 1)
      ✔️
      ExecuteCommandListImmediateCopyQueue measures time spent in appending memory copy for immediate command list on CPU with Copy Queue.
      • --IsCopyOnly If true, Copy Engine is selected. If false, Compute Engine is selected (0 or 1)
      • --MeasureCompletionTime Measures time taken to complete the submission (default is to measure only Immediate call) (0 or 1)
      ✔️
      ExecuteCommandListImmediateMultiKernel measures time spent in executing multiple instances of two different kernels with immediate command list on CPU
      • --AddBarrier Add a Barrier after certain number of Kernel launches, number of kernels before barrier is controlled by numKernelsBeforeBarrier (0 or 1)
      • --CallsCount amount of calls that is being measured
      • --KernelExecutionTime How long a single kernel executes, in us
      • --NumKernelsAfterBarrier Adds certain number of kernels after Barrier, Default is 2
      • --NumKernelsBeforeBarrier Adds certain number of kernels prior to Barrier, Default is 2
      ✔️
      ExecuteCommandListWithFenceCreate measures time spent in zeFenceCreate on CPU when fences are used.
        ✔️
        ExecuteCommandListWithFenceDestroy measures time spent in zeFenceDestroy on CPU when fences are used.
          ✔️
          ExecuteCommandListWithFenceUsage measures time spent in zeCommandQueueExecuteCommandLists and zeFenceSynchronize on CPU when fences are used.
            ✔️
            ExecuteCommandListWithIndirectAccess measures time spent in zeCommandQueueExecuteCommandLists on CPU when indirect allocations are accessed.
            • --AmountOfIndirectAllocations Amount of indirect allocations that are present in system
            ✔️
            ExecuteCommandListWithIndirectArguments measures time spent in zeCommandQueueExecuteCommandLists on CPU when indirect allocations are used.
            • --AmountOfIndirectAllocations Amount of indirect allocations that are present in system
            • --placement Placement of the indirect allocations (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            ✔️
            FlushTime measures time spent in clEnqueueNDRangeKernel on CPU.
            • --event Pass output event to the enqueue call (0 or 1)
            • --ooq Use out of order queue (0 or 1)
            • --wgc Workgroup count
            • --wgs Workgroup size, pass 0 to make the driver calculate it during enqueue
            ✔️
            KernelSetArgumentValueImmediate measures time spent in zeKernelSetArgumentValue for immediate arguments on CPU.
            • --argSize Kernel argument size in bytes
            ✔️
            LifecycleCommandList measures time spent in zeCommandListCreate + Close + Execute on CPU.
            • --CmdListCount Number of cmdlists to create
            • --CopyOnly Create copy only cmdlist (0 or 1)
            ✔️
            ResetCommandList measures time spent in zeCommandListReset on CPU.
            • --CopyOnly Create copy only cmdlist (0 or 1)
            • --size Size of the buffer
            • --sourcePlacement Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            ✔️
            SetKernelArgSvmPointer measures time spent in clSetKernelArgSVMPointer on CPU.
            • --allocationSize Size of svm allocations, in bytes
            • --allocationsCount Number of allocations
            • --reallocate Allocations will be freed and allocated again between setKernelArgs (0 or 1)
            ✔️ ✔️
            SetKernelGroupSize measures time spent in zeKernelSetGroupSize on CPU.
            • --asymmetricLocalWorkSize Use asymmetric local workSize (0 or 1)
            ✔️
            UsmMemoryAllocation measures time spent in USM memory allocation APIs.
            • --size Size to allocate
            • --type Type of memory being allocated (Device or Host or Shared)
            ✔️

            atomic_benchmark

            Atomic Benchmark is a set of tests aimed at measuring performance of atomic operations inside kernels.

            Test name Description Params L0 OCL
            OneAtomic enqueues kernel performing an atomic operation on a single address
            • --op Atomic operation to perform (Add or Sub or Xchg or CmpXchg or Inc or Dec or Min or Max or And or Or or Xor)
            • --type Data type of the atomic. Keep in mind not all operations are supported for floating points (Int32 or Float)
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            • --wgc Work group count
            • --wgs Work group size
            ✔️
            OneAtomicExplicit enqueues kernel performing an atomic operation on a single address using OpenCL 2.0 Atomics with explicit memory order and scope
            • --op Atomic operation to perform (Add or Sub or Xchg or CmpXchg or Inc or Dec or Min or Max or And or Or or Xor)
            • --order Memory order of an atomic operation (relaxed or acquire or release or acq_rel or seq_cst)
            • --scope Memory scope of an atomic operation (Workgroup or Device)
            • --type Data type of the atomic. Keep in mind not all operations are supported for floating points (Int32 or Float)
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            • --wgc Work group count
            • --wgs Work group size
            ✔️
            OneLocalAtomic enqueues kernel performing an atomic operation on a single location placed in SLM
            • --op Atomic operation to perform (Add or Sub or Xchg or CmpXchg or Inc or Dec or Min or Max or And or Or or Xor)
            • --type Data type of the atomic. Keep in mind not all operations are supported for floating points (Int32 or Float)
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            • --wgs Work group size
            ✔️
            OneLocalAtomicExplicit enqueues kernel performing an atomic operation on a single location placed in SLM using OpenCL 2.0 Atomics with explicit memory order and scope
            • --op Atomic operation to perform (Add or Sub or Xchg or CmpXchg or Inc or Dec or Min or Max or And or Or or Xor)
            • --order Memory order of an atomic operation (relaxed or acquire or release or acq_rel or seq_cst)
            • --scope Memory scope of an atomic operation (Workgroup or Device)
            • --type Data type of the atomic. Keep in mind not all operations are supported for floating points (Int32 or Float)
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            • --wgs Work group size
            ✔️
            SeparateAtomics enqueues kernel performing an atomic operation on different addresses
            • --atomicsPerCacheline Number of used addresses occupying a single cacheline (this causes operations to be serialized)
            • --op Atomic operation to perform (Add or Sub or Xchg or CmpXchg or Inc or Dec or Min or Max or And or Or or Xor)
            • --type Data type of the atomic. Keep in mind not all operations are supported for floating points (Int32 or Float)
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            • --wgc Work group count
            • --wgs Work group size
            ✔️
            SeparateAtomicsExplicit enqueues kernel performing an atomic operation on different addresses
            • --atomicsPerCacheline Number of used addresses occupying a single cacheline (this causes operations to be serialized)
            • --op Atomic operation to perform (Add or Sub or Xchg or CmpXchg or Inc or Dec or Min or Max or And or Or or Xor)
            • --order Memory order of an atomic operation (relaxed or acquire or release or acq_rel or seq_cst)
            • --scope Memory scope of an atomic operation (Workgroup or Device)
            • --type Data type of the atomic. Keep in mind not all operations are supported for floating points (Int32 or Float)
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            • --wgc Work group count
            • --wgs Work group size
            ✔️

            eu_benchmark

            EU Benchmark is a set of tests aimed at measuring performance of calculations performed in kernels.

            Test name Description Params L0 OCL
            DoMathOperation enqueues kernel performing a math operation
            • --op Math operation to perform (Add or Sub or Div or Modulo or Inc or Dec or Min or Max or And or Or or Xor)
            • --type Data type of the atomic. Keep in mind not all operations are supported for floating points (Int32 or Float)
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            • --wgc Work group count
            • --wgs Work group size
            ✔️
            ReadAfterAtomicWrite enqueues kernel, which writes to global memory using atomic and then reads non atomically
            • --atomic If true, write to global memory will be atomic. (0 or 1)
            • --shuffleRead If true, each thread will write and read different memory cell. Otherwise it will be the same one. (0 or 1)
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            • --wgs Workgroup size
            ✔️

            gpu_cmds_benchmark

            Gpu Commands Benchmark is a set of tests aimed at measuring GPU-side execution duration of various commands.

            Test name Description Params L0 OCL
            BarrierBetweenKernels measures time required to run a barrier command between 2 kernels, including potential cache flush commands
            • --bytes bytes to flush from L3
            • --memoryType memory type cached in L3 (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            • --onlyReads only reads cached in L3
            • --remoteAccess access cached from remote tile
            ✔️
            CopyWithEvent measures time required to run a copy kernel with various event configurations.
            • --devWaitEvent Use ZE_EVENT_SCOPE_FLAG_DEVICE for ze_event_desc_t::wait (0 or 1)
            • --hostSignalEvent Use ZE_EVENT_POOL_HOST_VISIBLE for ze_event_pool_desc_t::flags, and use ZE_EVENT_SCOPE_FLAG_HOST for ze_event_desc_t::signal (0 or 1)
            • --measuredCmds Number of commands being measured. Result is later divided by this number, to achieve time of a single command
            • --timestampEvent Use ZE_EVENT_POOL_FLAG_KERNEL_TIMESTAMP for ze_event_pool_desc_t::flags (0 or 1)
            ✔️
            EmptyKernel measures time required to run an empty kernel on GPU.
            • --measuredCommands Number of commands being measured. Result is later divided by this number, to achieve time of a single command
            • --wgc Workgroup count
            • --wgs Workgroup size (aka local work size)
            ✔️
            EventCtxtSwitchLatency measures context switching latency time required to switch between various engine types
            • --firstEngine first engine to measure context switch latency (RCS or CCS0 or CCS1 or CCS2 or CCS3 or BCS or BCS1 or BCS2 or BCS3 or BCS4 or BCS5 or BCS6 or BCS7 or BCS8)
            • --measuredCommands Number of commands being measured. Result is later divided by this number, to achieve time of a single command
            • --secondEngine second engine to measure context switch latency (RCS or CCS0 or CCS1 or CCS2 or CCS3 or BCS or BCS1 or BCS2 or BCS3 or BCS4 or BCS5 or BCS6 or BCS7 or BCS8)
            ✔️
            KernelWithEvent measures time required to run an empty kernel with various event configurations.
            • --devWaitEvent Use ZE_EVENT_SCOPE_FLAG_DEVICE for ze_event_desc_t::wait (0 or 1)
            • --hostSignalEvent Use ZE_EVENT_POOL_HOST_VISIBLE for ze_event_pool_desc_t::flags, and use ZE_EVENT_SCOPE_FLAG_HOST for ze_event_desc_t::signal (0 or 1)
            • --measuredCmds Number of commands being measured. Result is later divided by this number, to achieve time of a single command
            • --timestampEvent Use ZE_EVENT_POOL_FLAG_KERNEL_TIMESTAMP for ze_event_pool_desc_t::flags (0 or 1)
            • --wgc Workgroup count
            • --wgs Workgroup size (aka local work size)
            ✔️
            KernelWithWork measures time required to run a GPU kernel which assigns values to elements of a buffer.
            • --measuredCommands Number of commands being measured. Result is later divided by this number, to achieve time of a single command
            • --usedIds Which of the get_global_id() and get_local_id() calls will be used in the kernel (None or Global or Local or AtomicPerWkg)
            • --wgc Workgroup count
            • --wgs Workgroup size (aka local work size)
            ✔️
            WaitOnEventCold measures time required to service a signalled semaphore, that has never been waited for.
            • --measuredCommands Number of commands being measured. Result is later divided by this number, to achieve time of a single command
            ✔️
            WaitOnEventFromWalker measures time required to service a signalled semaphore coming from Walker command
            • --measuredCommands Number of commands being measured. Result is later divided by this number, to achieve time of a single command
            ✔️
            WaitOnEventHot measures time required to service a signalled semaphore, that was previously used
            • --measuredCommands Number of commands being measured. Result is later divided by this number, to achieve time of a single command
            ✔️
            WriteTimestamp measures time required to write a timestamp on GPU.
            • --measuredCommands Number of commands being measured. Result is later divided by this number, to achieve time of a single command
            ✔️

            memory_benchmark

            Memory Benchmark is a set of tests aimed at measuring bandwidth of memory transfers.

            Test name Description Params L0 OCL
            CopyBuffer allocates two OpenCL buffers and measures copy bandwidth between them. Buffers will be placed in device memory, if it's available.
            • --compressedDestination Select if the destination buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            • --compressedSource Select if the source buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            • --contents Contents of the buffers (Zeros or Random)
            • --size Size of the buffers
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            CopyBufferRect allocates two OpenCL buffers and measures rectangle copy bandwidth betweem them. Buffers will be placed in device memory, if it's available.
            • --dstCompressed Select if the destination buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            • --origin Origin of the rectangle
            • --rPitch Row pitch of the rectangle
            • --region Size of the rectangle
            • --sPitch Silice pitch of the rectangle
            • --size Size of the buffer
            • --srcCompressed Select if the source buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            ✔️
            CopyEntireImage allocates two image objects and measures copy bandwidth between them. Images will be placed in device memory, if it's available.
            • --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1)
            • --size Size of the image
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️ ✔️
            FillBuffer allocates an OpenCL buffer and measures fill bandwidth. Buffer will be placed in device memory, if it's available.
            • --compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            • --contents Contents of the buffer (Zeros or Random)
            • --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1)
            • --patternSize Size of the fill pattern
            • --size Size of the buffer
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            MapBuffer allocates an OpenCL buffer and measures map bandwidth. Mapping operation means memory transfer from GPU to CPU or a no-op, depending on map flags.
            • --compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            • --contents Contents of the buffer (Zeros or Random)
            • --mapFlags OpenCL map flags passed during memory mapping (Read or Write or WriteInvalidate)
            • --size Size of the buffer
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            ReadBuffer allocates an OpenCL buffer and measures read bandwidth. Read operation means transfer from GPU to CPU.
            • --compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            • --contents Contents of the buffer (Zeros or Random)
            • --reuse How hostptr allocation can be reused due to previous operations (None or Usm or Map)
            • --size Size of the buffer
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            ReadBufferMisaligned allocates an OpenCL buffer and measures read bandwidth. Read operation means transfer from GPU to CPU. Destination pointer passed by the application will be misaligned by the specified amount of bytes.
            • --misalignment Number of bytes by which misaligned the destination pointer will be misaligned
            • --size Size of the buffer
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            ReadBufferRect allocates an OpenCL buffer and measures rectangle read bandwidth. Rectangle read operation means transfer from GPU to CPU.
            • --compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            • --origin Origin of the rectangle
            • --rPitch Row pitch of the rectangle
            • --region Size of the rectangle
            • --sPitch Silice pitch of the rectangle
            • --size Size of the buffer
            ✔️
            ReadDeviceMemBuffer allocates two OpenCL buffers and measures source buffer read bandwidth. Source buffer resides in device memory.
            • --compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            • --size Size of the buffer
            ✔️
            RemoteAccessMemory Uses stream memory triad to measure bandwidth with different percentages of remote memory access.
            • --remoteFraction Fraction of remote memory access. 1 / n
            • --size Size of the memory to stream. Must be divisible by datatype size.
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            • --workItemSize Number of work items group together for remote check
            ✔️
            SLM_DataAccessLatency generates SLM local memory transactions inside thread group to measure latency between reads (uses Intel only private intel_get_cycle_counter() )
            • --direction write or read mode (0 or 1)
            • --occupancyDiv H/W load divider by 8, 4, 2, full occupancy
            • --size SLM Size
            ✔️
            SlmSwitchLatency Enqueues 2 kernels with different SLM size. Measures switch time between these kernels.
            • --firstSlmSize Size of the shared local memory per thread group. First kernel.
            • --secondSlmSize Size of the shared local memory per thread group. Second kernel.
            • --wgs Size of the work group.
            ✔️
            StreamAfterTransfer Goal of this test is to measure how stream kernels perform right after host to device transfer populating the data. Test does clean caches, then emits transfers and then follows with stream kernel and measures GPU execution time of it.
            • --size Size of the memory to stream. Must be divisible by datatype size.
            • --type Memory streaming type (Read or Write or Scale or Triad)
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            StreamMemory Streams memory inside of kernel in a fashion described by 'type'. Copy means one memory location is read from and the second one is written to. Triad means two buffers are read and one is written to. In read and write memory is only read or written to.
            • --size Size of the memory to stream. Must be divisible by datatype size.
            • --type Memory streaming type (Read or Write or Scale or Triad)
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️ ✔️
            StreamMemoryImmediate Streams memory inside of kernel in a fashion described by 'type' using immediate command list. Copy means one memory location is read from and the second one is written to. Triad means two buffers are read and one is written to. In read and write memory is only read or written to.
            • --size Size of the memory to stream. Must be divisible by datatype size.
            • --type Memory streaming type (Read or Write or Scale or Triad)
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            UnmapBuffer allocates an OpenCL buffer and measures unmap bandwidth. Unmapping operation meansmemory transfer from CPU to GPU or a no-op, depending on map flags.
            • --compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            • --contents Contents of the buffer (Zeros or Random)
            • --mapFlags OpenCL map flags passed during memory mapping (Read or Write or WriteInvalidate)
            • --size Size of the buffer
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            UsmCopy allocates two unified shared memory buffers and measures copy bandwidth between them.
            • --contents Contents of the buffers (Zeros or Random)
            • --dst Placement of the destination buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            • --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1)
            • --reuseCmdList Command list is reused between iterations (0 or 1)
            • --size Size of the buffer
            • --src Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️ ✔️
            UsmCopyImmediate allocates two unified shared memory buffers and measures copy bandwidth between them using immediate command list.
            • --contents Contents of the buffers (Zeros or Random)
            • --dst Placement of the destination buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            • --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1)
            • --size Size of the buffer
            • --src Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            UsmCopyMultipleBlits allocates two unified shared memory buffers, divides them into chunks, copies each chunk using a different copy engine and measures bandwidth. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even.
            • --blitters A bit mask for selecting copy engines
            • --dst Placement of the destination buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            • --size Size of the operation processed by each engine
            • --src Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            ✔️ ✔️
            UsmCopyStagingBuffers Measures copy time from device/host to host/device. Host memory is non-USM allocation.Copy is done through staging USM buffers. Non-USM host ptr is never passed to L0 API, only through staging buffers.
            • --chunks How much memory chunks should the buffer be splitted into
            • --dst Memory placement of destination (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            • --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1)
            • --size Size of the buffer
            ✔️
            UsmFill allocates a unified memory buffer and measures fill bandwidth
            • --contents Contents of the buffer (Zeros or Random)
            • --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1)
            • --memory Placement of the buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            • --patternContents Select contents of the fill pattern (Zeros or Random)
            • --patternSize Size of the fill pattern
            • --size Size of the buffer
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️ ✔️
            UsmFillImmediate allocates a unified memory buffer and measures fill bandwidth using immediate command list
            • --contents Contents of the buffer (Zeros or Random)
            • --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1)
            • --memory Placement of the buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            • --patternContents Select contents of the fill pattern (Zeros or Random)
            • --patternSize Size of the fill pattern
            • --size Size of the buffer
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            UsmFillMultipleBlits allocates a unified shared memory buffer, divides it into chunks, copies each chunk using a different copy engine and measures bandwidth. Refer to UsmCopyMultipleBlits for more details.
            • --blitters A bit mask for selecting copy engines
            • --memory Placement of buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            • --patternContents Select contents of the fill pattern (Zeros or Random)
            • --patternSize Size of the fill pattern
            • --size Size of the operation processed by each engine
            ✔️ ✔️
            UsmFillSpecificPattern allocates a unified memory buffer and measures fill bandwidth. Allow specifying arbitrary pattern.
            • --contents Contents of the buffer (Zeros or Random)
            • --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1)
            • --memory Placement of the buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            • --pattern The fill pattern represented hexadecimally, e.g. 0x91ABCD1254
            • --size Size of the buffer
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️ ✔️
            UsmImmediateCopyMultipleBlits allocates two unified shared memory buffers, divides them into chunks, copies each chunk using a different copy engine with an immediate command list and measures bandwidth. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even.
            • --blitters A bit mask for selecting copy engines
            • --dst Placement of the destination buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            • --size Size of the operation processed by each engine
            • --src Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            ✔️
            UsmMemset allocates a unified memory buffer and measures memset bandwidth
            • --contents Contents of the buffer (Zeros or Random)
            • --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1)
            • --memory Placement of the buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped)
            • --size Size of the buffer
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            UsmSharedMigrateCpu allocates a unified shared memory buffer and measures bandwidth for kernel that must migrate resource from GPU to CPU
            • --accessAllBytes Select, whether entire resource or only one byte will be accessed on CPU (0 or 1)
            • --size Size of the buffer
            ✔️ ✔️
            UsmSharedMigrateGpu allocates a unified shared memory buffer and measures bandwidth for kernel that must migrate resource from CPU to GPU
            • --prefetch Explicitly migrate shared allocation to device associated with command queue (0 or 1)
            • --size Size of the buffer
            ✔️ ✔️
            UsmSharedMigrateGpuForFill allocates a unified shared memory buffer and measures bandwidth for memory fill operation that must migrate resource from CPU to GPU
            • --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1)
            • --prefetch Explicitly migrate shared allocation to device associated with command queue (0 or 1)
            • --size Size of the buffer
            ✔️ ✔️
            WriteBuffer allocates an OpenCL buffer and measures write bandwidth. Write operation means transfer from CPU to GPU.
            • --compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            • --contents Contents of the buffer (Zeros or Random)
            • --reuse How hostptr allocation can be reused due to previous operations (None or Usm or Map)
            • --size Size of the buffer
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            WriteBufferRect allocates an OpenCL buffer and measures rectangle write bandwidth. Rectangle write operation means transfer from CPU to GPU.
            • --compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            • --origin Origin of the rectangle
            • --rPitch Row pitch of the rectangle
            • --region Size of the rectangle
            • --sPitch Silice pitch of the rectangle
            • --size Size of the buffer
            ✔️

            miscellaneous_benchmark

            Miscellaneous Benchmark is a set of tests measuring different simple compute scenarios.

            Test name Description Params L0 OCL
            IoqKernelSwitchLatency measures time from end of one kernel till start of next kernel for in order queue
            • --kernelCount Count of kernels
            • --useEvents Use events to synchronize between kernels (0 or 1)
            ✔️
            KernelWithWork measures time required to run a GPU kernel which assigns constant values to elements of a buffer. Each thread assigns one value. Benchmark checks the impact of kernel split.
            • --split How many times kernel is split
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            • --usedIds Which of the get_global_id() and get_local_id() calls will be used in the kernel (None or Global or Local or AtomicPerWkg)
            • --wgc Workgroup count
            • --wgs Workgroup size (aka local work size)
            ✔️
            Reduction Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location.
            • --numberOfElements Number of elements that will be reduced
            ✔️
            Reduction2 Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location.
            • --numberOfElements Number of elements that will be reduced
            ✔️
            Reduction3 Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location.
            • --numberOfElements Number of elements that will be reduced
            ✔️
            Reduction4 Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location.
            • --numberOfElements Number of elements that will be reduced
            ✔️
            Reduction5 Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location.
            • --numberOfElements Number of elements that will be reduced
            ✔️
            VectorSum Performs vector addition
            • --numberOfElementsX Number of elements in X dimension
            • --numberOfElementsY Number of elements in Y dimension
            • --numberOfElementsZ Number of elements in Z dimension
            ✔️

            multithread_benchmark

            Multithread Benchmark is a set of tests aimed at measuring how different commands benefit from multithreaded execution.

            Test name Description Params L0 OCL
            ImmediateCommandListCompletion measures completion latency of AppendMemoryCopy issued from multiple threads to Immediate Command Lists.Engines to be used for submissions are selected based on the enabled bits of engineMask.'threadsPerEngine' number of threads submits commands to each selected engine.If 'numberOfThreads' is greater than 'threadsPerEngine' x selected engine count, then the excess threads are assigned to selected engines one each, in a round-robin method.if selected engineCount == 1, then all threads are assigned to that engine.
            • --copySize copy size in bytes
            • --engineGroup engine group to be used
            • --engineMask bit mask for selecting engines to be used for submission
            • --numberOfThreads total number of threads
            • --threadsPerEngine number of threads submitting commands to each engine
            ✔️
            ImmediateCommandListSubmission measures submission latency of AppendLaunchKernel issued from multiple threads to Immediate Command Lists.'threadsPerEngine' count of threads submit commands to each engine.If 'numberOfThreads' is greater than 'threadsPerEngine' x engine count, then the excess threads are assigned to engines one each, in a round-robin method.if engineCount == 1, then all threads are assigned to the engine.
            • --numberOfThreads total number of threads
            • --threadsPerEngine number of threads submitting commands to each engine
            ✔️
            SvmCopy enqueues multiple svm copies on multiple threads concurrently.
            • --numberOfThreads Number of threads that will run concurrently
            ✔️ ✔️

            multitile_memory_benchmark

            Multi-tile Memory Benchmark is a set of tests aimed at measuring bandwidth of memory transfers performed on a multi-tile device.

            Test name Description Params L0 OCL
            CopyBuffer allocates two OpenCL buffers and measures copy bandwidth between them. Buffers will be placed in device memory, if it's available.
            • --context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --dst Placement of memory for the destination buffer (Root or Tile0 or Tile1 or Tile2 or Tile3)
            • --dstCompressed Select if the destination buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            • --queue Which device within the context will perform the operation (Root or Tile0 or Tile1 or Tile2 or Tile3)
            • --size Size of the buffers
            • --src Placement of memory for the source buffer (Root or Tile0 or Tile1 or Tile2 or Tile3)
            • --srcCompressed Select if the source buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            FillBuffer allocates an OpenCL buffer and measures fill bandwidth. Buffer will be placed in device memory, if it's available.
            • --compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            • --context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1)
            • --memory Placement of memory for the buffer (Root or Tile0 or Tile1 or Tile2 or Tile3)
            • --patternSize Size of the fill pattern
            • --queue Which device within the context will perform the operation (Root or Tile0 or Tile1 or Tile2 or Tile3)
            • --size Size of the buffer
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            ReadBuffer allocates an OpenCL buffer and measures read bandwidth. Read operation means transfer from GPU to CPU.
            • --compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            • --context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --memory Placement of memory for the buffer (Root or Tile0 or Tile1 or Tile2 or Tile3)
            • --queue Which device within the context will perform the operation (Root or Tile0 or Tile1 or Tile2 or Tile3)
            • --size Size of the buffer
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            UsmBidirectionalCopy allocates two unified device memory buffers, each on a different tile, and measures copy bandwidth between. Test measures copies on two directions, which can be controlled with the -write parameter: with -write=1, each tile performs a write operation. For instance: queue is placed in tile 0, source is buffer in tile 0, and destination is in tile 1. Similarly for tile 1, queue is placed in tile 1, source in tile 1, and destination in tile 0. With -write=0, the destination and source are flipped: queue is placed in tile 0, source is buffer in tile 1, and destination is in tile 0, while for tile 1, queue is placed in tile 1, source in tile 0, and destination in tile 1.
            • --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1)
            • --size Size of the buffers
            • --write Which operation is used, whether write or read (0 or 1)
            ✔️
            UsmCopy allocates two unified shared memory buffers and measures copy bandwidth between them using a builtin function.
            • --context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --dst Placement of memory for the destination buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1)
            • --queue Which device within the context will perform the operation (Root or Tile0 or Tile1 or Tile2 or Tile3)
            • --size Size of the buffers
            • --src Placement of memory for the source buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️ ✔️
            UsmCopyImmediate allocates two unified shared memory buffers and measures copy bandwidth between them using a builtin function appended to an immediate list.
            • --context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --dst Placement of memory for the destination buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1)
            • --queue Which device within the context will perform the operation (Root or Tile0 or Tile1 or Tile2 or Tile3)
            • --size Size of the buffers
            • --src Placement of memory for the source buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            UsmCopyKernel allocates two unified shared memory buffers and measures copy bandwidth between them using a custom kernel.
            • --context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --dst Placement of memory for the destination buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --queue Which device within the context will perform the operation (Root or Tile0 or Tile1 or Tile2 or Tile3)
            • --size Size of the buffers
            • --src Placement of memory for the source buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️ ✔️
            UsmFill allocates a unified shared memory buffer and measures fill bandwidth.
            • --context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1)
            • --memory Placement of memory for the buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --patternSize Size of the fill pattern
            • --queue Which device within the context will perform the operation (Root or Tile0 or Tile1 or Tile2 or Tile3)
            • --size Size of the buffer
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️ ✔️
            UsmSharedMigrateCpu allocates a unified shared memory buffer and measures time to migrate it from GPU to CPU.
            • --accessAllBytes Select, whether entire resource or only one byte will be accessed on CPU (0 or 1)
            • --context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --memory Placement of memory for the buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --size Size of the buffer
            ✔️ ✔️
            UsmSharedMigrateGpu allocates a unified shared memory buffer and measures time to migrate it from CPU to GPU.
            • --context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --memory Placement of memory for the buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --size Size of the buffer
            ✔️ ✔️
            WriteBuffer allocates an OpenCL buffer and measures write bandwidth. Write operation means transfer from CPU to GPU.
            • --compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)
            • --context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --memory Placement of memory for the buffer (Root or Tile0 or Tile1 or Tile2 or Tile3)
            • --queue Which device within the context will perform the operation (Root or Tile0 or Tile1 or Tile2 or Tile3)
            • --size Size of the buffer
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️

            overlap_benchmark

            Overlap Benchmark is a set of tests aimed at measuring how different commands benefit for simultaneous execution.

            Test name Description Params L0 OCL
            KernelAndCopy enqueues kernel and copy operation with the ability to perform both tasks on different command queues.
            • --runCopy Enqueue buffer to buffer copy during each iteration (0 or 1)
            • --runKernel Enqueue kernel during each iteration (0 or 1)
            • --twoQueues Enables using separate queues for both operations. Must be used with runCopy and runKernel (0 or 1)
            • --useCopyQueue Use a specialized copy queue for the copy operation. Must be used with runCopy (0 or 1)
            ✔️
            MultiProcessCompute Creates a number of separate processes for each tile specified performing a compute workload and measures average time to complete all of them. Processes will use affinity mask to select specific sub-devices for the execution
            • --opsPerKernel Operations performed in kernel, used to steer its execution time
            • --processesPerTile Number of processes that will be started on each of the tiles specified
            • --synchronize Synchronize all processes before each iteration (0 or 1)
            • --tiles Tiles for execution (Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --workgroupsPerProcess Number of workgroups that each process will start
            ✔️
            MultiProcessComputeSharedBuffer Creates a number of separate processes for each tile specified performing a compute workload and measures average time to complete all of them. Processes will use affinity mask to select specific sub-devices for the execution. A single buffer for each tile is created by parent process. All processes executing on a given tile will share it via IPC calls.
            • --processesPerTile Number of processes that will be started on each of the tiles specified
            • --synchronize Synchronize all processes before each iteration (0 or 1)
            • --tiles Tiles for execution (Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':')
            • --workgroupsPerProcess Number of workgroups that each process will start
            ✔️

            p2p_benchmark

            P2P Benchmark is a set of tests aimed at measuring bandwidth and latency of memory transfers between peer devices.

            Test name Description Params L0 OCL
            UsmCopyMultipleBlits allocates two unified device memory buffers on separate devices and performs a copy between sections (or chunks) of these using a different copy engine and measures bandwidth. Test first checks for P2P capabilities in the target platform before submitting the copy. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even.
            • --blitters A bit mask for selecting copy engines
            • --dstDeviceId Destination device
            • --size Size of the operation processed by each engine
            • --srcDeviceId Source device
            ✔️
            UsmEUCopy allocates two unified device memory buffers on separate devices, performs a copy between them using a compute engine, and reports bandwidth. Test first checks for P2P capabilities in the target platform before submitting the copy.
            • --contents Contents of the buffers (Zeros or Random)
            • --dstDeviceId Destination device
            • --reuseCmdList Command list is reused between iterations (0 or 1)
            • --size Size of the buffer
            • --srcDeviceId Source device
            • --useEvents Perform GPU-side measurements using events (0 or 1)
            ✔️
            UsmImmediateCopyMultipleBlits allocates two unified device memory buffers on separate devices and performs a copy between sections (or chunks) of these using a different copy engine with an immediate command list and measures bandwidth. Test first checks for P2P capabilities in the target platform before submitting the copy. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even.
            • --blitters A bit mask for selecting copy engines
            • --dstDeviceId Destination device
            • --size Size of the operation processed by each engine
            • --srcDeviceId Source device
            ✔️

            ulls_benchmark

            Ulls Benchmark is a set of tests aimed at measuring Ultra Low Latency Submission (ULLS) performance impact.

            Test name Description Params L0 OCL
            BestSubmission enqueues a system memory write via PIPE_CONTROL and measures when update becomes visible on the CPU.
              ✔️
              BestWalkerSubmission enqueues kernel, which updates system memory location and then busy-loops on CPU until the update becomes visible.
                ✔️ ✔️
                BestWalkerSubmissionImmediate enqueues kernel, which updates system memory location and then busy-loops on CPU until the update becomes visible. Kernel is enqueued using low-latency immediate command list, so the test is LevelZero-specific.
                  ✔️
                  BestWalkerSubmissionImmediateMultiCmdlists Append N kernels on N cmdlists, which updates system memory locations and then waits using busy-loop on CPU until the update becomes visible. Kernels are appended using immediate command lists.Amount of command lists is specified by cmdlistCount.
                  • --cmdlistCount Count of command lists
                  ✔️
                  CompletionLatency enqueues system memory write and measures time between the moment, when update is visible on CPU and the moment, when synchronizing call returns.
                    ✔️
                    CopySubmissionEvents enqueues 4 byte copy to copy engine and return submission delta which is time between host API call and copy engine start
                    • --engine Engine used for copying (RCS or CCS0 or CCS1 or CCS2 or CCS3 or BCS or BCS1 or BCS2 or BCS3 or BCS4 or BCS5 or BCS6 or BCS7 or BCS8)
                    ✔️ ✔️
                    EmptyKernel enqueues empty kernel and measures time to launch it and wait for it on CPU, thus measuring walker spawn time.
                    • --wgc Workgroup count
                    • --wgs Workgroup size (aka local work size)
                    ✔️ ✔️
                    EmptyKernelImmediate enqueues empty kernel and measures time to launch it using immediate command list and wait for it on CPU, thus measuring walker spawn time.
                    • --wgc Workgroup count
                    • --wgs Workgroup size (aka local work size)
                    ✔️
                    KernelSwitchLatency measures time from end of one kernel till start of next kernel
                    • --barrier synchronization with barrier instead of events (0 or 1)
                    • --flush Flush between kernels (0 or 1)
                    • --hostVisible events are with host visible flag (0 or 1)
                    • --kernelCount Count of kernels
                    ✔️ ✔️
                    KernelSwitchLatencyImmediate measures time from end of one kernel till start of next kernel using immediate command lists
                    • --barrier synchronization with barrier instead of events (0 or 1)
                    • --hostVisible events are with host visible flag (0 or 1)
                    • --kernelCount Count of kernels
                    ✔️
                    KernelWithWork measures time required to run a GPU kernel which assigns constant values to elements of a buffer. Each thread assigns one value.
                    • --usedIds Which of the get_global_id() and get_local_id() calls will be used in the kernel (None or Global or Local or AtomicPerWkg)
                    • --wgc Workgroup count
                    • --wgs Workgroup size (aka local work size)
                    ✔️ ✔️
                    KernelWithWorkImmediate measures time required to run a GPU kernel which assigns constant values to elements of a buffer using immediate command list. Each thread assigns one value.
                    • --usedIds Which of the get_global_id() and get_local_id() calls will be used in the kernel (None or Global or Local or AtomicPerWkg)
                    • --wgc Workgroup count
                    • --wgs Workgroup size (aka local work size)
                    ✔️
                    MultiProcessImmediateCmdlistCompletion measures completion latency of AppendMemoryCopy issued from multiple processes to Immediate Command Lists.Engines to be used for submissions are selected based on the enabled bits of engineMask.Bits of the 'engineMask' are indexed from right to left. So rightmost bit represents first engine and leftmost, the last engine.'processesPerEngine' number of processes submits commands to each selected engine.If 'numberOfProcesses' is greater than 'processesPerEngine' x selected engine count, then the excess processes are assigned to selected engines one each, in a round-robin method.if selected engineCount == 1, then all processes are assigned to that engine.
                    • --copySize copy size in bytes
                    • --engineGroup engine group to be used
                    • --engineMask bit mask for selecting engines to be used for submission
                    • --numberOfProcesses total number of processes
                    • --processesPerEngine number of processes submitting commands to each engine
                    ✔️
                    MultiProcessImmediateCmdlistSubmission measures submission latency of walker command issued from multiple processes to Immediate Command Lists.'processesPerEngine' count of processes, submit commands to each engine.If 'numberOfProcesses' is greater than 'processesPerEngine' x engine count, then the excess processes are assigned to engines one each, in a round-robin method.if engineCount == 1, then all processes are assigned to the engine.
                    • --numberOfProcesses total numer of processes
                    • --processesPerEngine number of processes submitting commands to each engine
                    ✔️
                    MultiQueueSubmission enqueues kernel on multiple command queues
                    • --queueCount Number of command queues created
                    • --wgc Workgroup count
                    • --wgs Workgroup size
                    ✔️ ✔️
                    NewResourcesSubmissionDevice enqueues kernel that uses a buffer placed in device memory to measure resource preparation time. The resource is destroyed and recreated for each iteration to ensure it is a different memory allocation.
                    • --size Size of the buffer
                    ✔️ ✔️
                    NewResourcesSubmissionHost enqueues kernel that uses a buffer placed in host memory to measure resource preparation time. The resource is destroyed and recreated for each iteration to ensure it is a different memory allocation.
                    • --size Size of the buffer
                    ✔️ ✔️
                    NewResourcesWithGpuAccess enqueues kernel that accesses an entire buffer placed in device memory to measure resource preparation time. The resource is destroyed and recreated for each iteration to ensure it is a different memory allocation.
                    • --size Size of the buffer
                    ✔️ ✔️
                    QueuePriorities Uses queues with different priorities to meassure submission and context switch latencies
                    • --highTime How long each work item is in high priority kernel
                    • --lowTime How long each work item is in low priority kernel
                    • --priorities Low priority command queue property is used (0 or 1)
                    • --sleep sleep time in us after low priority kernel flushed
                    • --wgc Workgroup count of high priority kernel
                    ✔️
                    ResourceReassign Enqueues stress kernel which utilizes majority of GPU's execution units, then enqueues next kernel, measuring its execution time. Shows overhead releated to GPU's resources releasing and assigning.
                    • --queueCount number of different command queues to which submits after stress kernel
                    ✔️
                    RoundTripSubmission enqueues kernel which updates system memory location and waits for it with a synchronizing API.
                      ✔️ ✔️
                      UsmSharedFirstCpuAccess allocates a unified shared memory buffer and measures time to access it on CPU after creation.
                      • --initialPlacement Hint for initial placement of the resource passed to the driver (Any or Host or Device)
                      • --size Size of the buffer
                      ✔️ ✔️
                      UsmSharedFirstGpuAccess allocates a unified shared memory buffer and measures time to access it on GPU after creation.
                      • --initialPlacement Hint for initial placement of the resource passed to the driver (Any or Host or Device)
                      • --size Size of the buffer
                      ✔️ ✔️
                      WalkerCompletionLatency enqueues a kernel writing to system memory and measures time between the moment when update is visible on CPU and the moment when synchronizing call returns
                      • --useFence Use fence during submission and for further completion. (0 or 1)
                      ✔️ ✔️
                      WalkerSubmissionEvents enqueues an empty kernel with GPU-side profiling and checks delta between queue time and start time.
                        ✔️ ✔️
                        WriteLatency unblocks event on GPU, then waits for timestamp being written.
                          ✔️