Skip to content

DSA Perfmon

nikhilprao edited this page Feb 21, 2023 · 9 revisions

DSA Perfmon

The IDXD Linux driver supports 'perfmon' counters as described in the DSA spec.

The perfmon support is designed to be used with the kernel's userspace 'perf' tool. Kernel configuration options necessary to enable perfmon support in the kernel are described followed by sample command lines to help get started with using the DSA perfmon counters.

Setup

CONFIG_INTEL_IDXD_PERFMON should be enabled in addition to the options needed to enable the DSA driver and SVM support.

CONFIG_IRQ_REMAP=y
CONFIG_INTEL_IOMMU=y
CONFIG_INTEL_IOMMU_SVM=y
CONFIG_PCI_ATS=y
CONFIG_PCI_PRI=y
CONFIG_PCI_PASID=y
CONFIG_INTEL_IDXD=m
CONFIG_INTEL_IDXD_SVM=y
CONFIG_DMA_ENGINE=m
CONFIG_DMATEST=m

Enable a shared WQ with a single engine

./scripts/setup_dsa.sh -d dsa0 -w1 -ms -e4

and execute the command line below

perf stat -e dsa0/event=0x8,event_category=0x3/ -a ./src/dsa_perf_micros -n1 -K[1]@dsa0,1 -i0

Since the command line above executed a single descriptor, the output of perf is as below

Performance counter stats for 'system wide':

1 dsa0/event=0x8,event_category=0x3/

Enqueue Retries

SWQ descriptor writes use the ENQCMD instruction, a retry response can be returned if the WQ is full or the number of descriptors in the WQ reaches the WQ threshold.

Command line to generate ENQCMD retries

./src/dsa_perf_micros -jcf -i-1 - -K[0-3]@dsa0,0 -s4k -n128 -o3

The command below can be run in separate terminal while the application is running, and will print the number of enqueue retries/sec.

perf stat -e dsa0/event=0x2,event_category=0x0/ -I 1000

Read/Write Bandwidth

With a shared WQ configured as mentioned below, run the command below

./src/dsa_perf_micros -jcf -i-1 -K[0-3]@dsa0,0 -s4k -n32 -o3

In another terminal run the command below

perf stat -e dsa0/event=0x1,event_category=0x1/ -I 1000

You should output similar to below.
time counts unit events
1.001032947 962,916,304 dsa0/event=0x1,event_category=0x1/

Page Faults

The command below counts

  1. Number of translations (with pasid) with no page fault (a total of 3, 1 each for src, dst and completion record)
  2. Number of translations (with pasid) with page fault (none)

perf stat -e dsa0/event=0x1,event_category=0x2/,dsa0/event=0x2,event_category=0x2/ -a ./src/dsa_perf_micros -n1 -K[1]@dsa0,1 -i0 -o3 -jcf

The output is as below

Performance counter stats for 'system wide':

           3      dsa0/event=0x1,event_category=0x2/
           0      dsa0/event=0x2,event_category=0x2/

Operations/Sec

With a shared WQ configured as mentioned below, run the command below

./src/dsa_perf_micros -jcf -i-1 -w1 -k0-3 -s4k -n32 -o3

In another terminal run the command below

perf stat -e dsa0/event=0x8,event_category=0x3/ -I 1000

The expected output is as below and corresponds to a memory copy bandwidth of 30 GB/s when 4 cores are submitting memmove transfers each of size 4K, with a maximum of 32 transfers outstanding per core at any given time (7.5M x 4K = 30GB/s)

time counts unit events
1.001113602 7,564,075 dsa0/event=0x8,event_category=0x3/

Completions

Number of successful completions can be measured as shown below, and is equal to number of operations submitted.

perf stat -e dsa0/event=0x8,event_category=0x3/,dsa0/event=0x20,event_category=0x4/ -a ./src/dsa_perf_micros -n1 -K[1]@dsa0,1 -i0 -o3

Performance counter stats for 'system wide':

             1      dsa0/event=0x8,event_category=0x3/
             1      dsa0/event=0x20,event_category=0x4/

Filters

DSA perfmon allows software to specify a set of filters that can be used to constrain the counting of selected events based on one or more conditions

Enable a shared WQ with a single engine

./scripts/setup_dsa.sh -d dsa0 -w1 -ms -e1

The command counts events from wq=0, tc=0, transfer size=4k, engine=0 for event_category=1, event=0x3 (Total input data processed, in units of 32 bytes)

perf stat -e dsa0/filter_wq=0x1,filter_tc=0x1,filter_sz=0x7,filter_eng=0x1,event=0x3,event_category=0x1/ -a ./src/dsa_perf_micros -n1 -K[1]@dsa0,0 -i0 -o3

Performance counter stats for 'system wide':

128 dsa0/filter_wq=0x1,filter_tc=0x2,filter_sz=0x8,filter_eng=0x1,event=0x1,event_category=0x1/

An iteration count of 0 runs a single 4K descriptor (128 units of 32 bytes), hence a count of 128 is reported.

Clone this wiki locally