- DMA Performance Benchmark
This document discusses benchmarking the performance of the Platform Direct Memory Access (P-DMA) Controller on PolarFire SoC. The document includes a complete set of benchmarking results, along with an accompanying synopsis.
The benchmarking application tests the performance of the P-DMA transferring data between a given source and destination. The possible source/destination combinations are any of the following memory locations:
- L2-LIM
- Scratchpad Memory
- Cached DDR
- Non-cached DDR
P-DMA performance is determined by several factors, principally: the source and destination of a transfer, the memory location where the program is executed from, whether transfer ordering is enforced, and the configuration of the L2 Cache when transferring to/from Cached DDR.
The program can be executed from any of the following memory locations:
- L2-LIM
- Scratchpad Memory
- Cached DDR
If transfer ordering is enforced data will be received at the destination in the same order that it was transmitted otherwise, data can arrive out of order.
Every benchmark is run a total of 6 times, for the 3 memory locations the program can be executed from with each of force ordering enabled and disabled.
All benchmarks are using a single P-DMA channel.
The P-DMA has 4 channels capable of transferring data independently of each other. As such the P-DMA can transfer data from a single source to multiple destinations, or from multiple sources to a single destination.
Increasing the number of P-DMA channels used in a transfer will increase P-DMA throughput, this improvement is primarily due to each channel transferring data concurrently.
The throughput depends on: the number of channels used in a transfer, and the number of different memory locations the channels are transferring to/from.
Throughput is highest using multiple channels to transfer data from a single source to a single destination, as opposed to multiple channels transferring data from a single source to multiple destinations, or from multiple destinations to a single source.
PolarFire MSS Detailed Block Diagram. PolarFire SoC MSS Technical Reference Manual, page 11.
The theoretical maximum rate of a transfer is determined by the width of the bus connecting a memory source and destination to the P-DMA, and by the processor clock rate. The processor clock rate is set using the MSS Configurator. The width of a given bus is an inherent property of the device (64b/128b in this case) and cannot be modified.
The L2-LIM and L2-Scratchpad are both parts of the L2 Cache controller. The cache controller along with the platform DMA controller is located within the CPU Core Complex, therefore the P-DMA can access both LIM and Scratchpad via the 128-bit TileLink bus.
Cached DDR is also accessible over the 128-bit TileLink bus and a 128-bit AXI4 bus, via the DDR Controller. Non-Cached DDR is accessible over the 128-bit TileLink bus and a 64-bit AXI4 bus. The L2 Cache and AXI4 buses are clocked at a rate that is half the frequency of the CPU cores, using the default setting of the MSS Configurator which is 300MHz.
Please refer to the PolarFire SoC Technical Reference Manual for further information on the P-DMA.
Using the above figures, the theoretical performance of the P-DMA when transferring between each of the 4 memory locations can be calculated as follows:
- For transfers using the TileLink bus and 128-bit AXI4 bus: 128b * 300MHz = 38400 Mb/s
- For transfers using the TileLink bus and the 64-bit AXI4 bus: 64b * 300MHz = 19200 Mb/s
The below table summarises the theoretical maximum performance for every source and destination pair:
Source\Destination | L2-LIM | Scratchpad | Cached DDR | Non-Cached DDR |
---|---|---|---|---|
L2-LIM | 38400 Mb/s | 38400 Mb/s | 38400 Mb/s | 19200 Mb/s |
Scratchpad | 38400 Mb/s | 38400 Mb/s | 38400 Mb/s | 19200 Mb/s |
Cached DDR | 38400 Mb/s | 38400 Mb/s | 38400 Mb/s | 19200 Mb/s |
Non-Cached DDR | 19200 Mb/s | 19200 Mb/s | 19200 Mb/s | 19200 Mb/s |
Executing From | Transfer Size (Mb) | Peak Rate (Mb/s) | % of Theoretical Max Rate |
---|---|---|---|
Cached DDR | 0.87 | 9560 | 25% |
Cached DDR + Force Ordering | 0.96 | 7656 | 20% |
L2-LIM | 0.953 | 8472 | 22% |
L2-LIM + Force Ordering | 0.998 | 7064 | 18% |
Scratchpad | 0.883 | 9560 | 25% |
Scratchpad + Force Ordering | 0.986 | 7656 | 20% |
Executing From | Transfer Size (Mb) | Peak Rate (Mb/s) | % of Theoretical Max Rate |
---|---|---|---|
Cached DDR | 1.997 | 19120 | 50% |
Cached DDR + Force Ordering | 2.093 | 12280 | 32% |
L2-LIM | 1.971 | 17656 | 46% |
L2-LIM + Force Ordering | 2.089 | 10944 | 28% |
Scratchpad | 2.083 | 19080 | 50% |
Scratchpad + Force Ordering | 2.074 | 12256 | 32% |
Executing From | Transfer Size (Mb) | Peak Rate (Mb/s) | % of Theoretical Max Rate |
---|---|---|---|
Cached DDR | 1.93 | 19120 | 50% |
Cached DDR + Force Ordering | 2.059 | 12256 | 32% |
L2-LIM | 1.958 | 17656 | 46% |
L2-LIM + Force Ordering | 2.07 | 10936 | 28% |
Scratchpad | 2.074 | 19080 | 50% |
Scratchpad + Force Ordering | 2.092 | 12264 | 32% |
Executing From | Transfer Size (Mb) | Peak Rate (Mb/s) | % of Theoretical Max Rate |
---|---|---|---|
Cached DDR | 2.054 | 16992 | 88% |
Cached DDR + Force Ordering | 0.031 | 5688 | 30% |
L2-LIM | 2.085 | 16928 | 88% |
L2-LIM + Force Ordering | 1.931 | 5472 | 28% |
Scratchpad | 1.907 | 16984 | 88% |
Scratchpad + Force Ordering | 0.031 | 5688 | 30% |
Executing From | Transfer Size (Mb) | Peak Rate (Mb/s) | % of Theoretical Max Rate |
---|---|---|---|
Cached DDR | 1.997 | 19112 | 50% |
Cached DDR + Force Ordering | 1.216 | 8064 | 21% |
L2-LIM | 1.833 | 16992 | 44% |
L2-LIM + Force Ordering | 1.893 | 7824 | 20% |
Scratchpad | 2.074 | 19072 | 50% |
Scratchpad + Force Ordering | 2.028 | 8064 | 21% |
Executing From | Transfer Size (Mb) | Peak Rate (Mb/s) | % of Theoretical Max Rate |
---|---|---|---|
Cached DDR | 1.878 | 30368 | 79% |
Cached DDR + Force Ordering | 1.955 | 14296 | 37% |
L2-LIM | 1.96 | 30224 | 79% |
L2-LIM + Force Ordering | 1.966 | 14168 | 37% |
Scratchpad | 1.752 | 30312 | 79% |
Scratchpad + Force Ordering | 1.959 | 14280 | 37% |
Executing From | Transfer Size (Mb) | Peak Rate (Mb/s) | % of Theoretical Max Rate |
---|---|---|---|
Cached DDR | 3.152 | 30464 | 79% |
Cached DDR + Force Ordering | 2.862 | 14560 | 38% |
L2-LIM | 3.807 | 30472 | 79% |
L2-LIM + Force Ordering | 3.914 | 14736 | 38% |
Scratchpad | 3.845 | 30504 | 79% |
Scratchpad + Force Ordering | 3.902 | 14800 | 39% |
Executing From | Transfer Size (Mb) | Peak Rate (Mb/s) | % of Theoretical Max Rate |
---|---|---|---|
Cached DDR | 3.046 | 17024 | 89% |
Cached DDR + Force Ordering | 3.639 | 6976 | 36% |
L2-LIM | 3.925 | 17000 | 89% |
L2-LIM + Force Ordering | 3.456 | 6960 | 36% |
Scratchpad | 3.833 | 17016 | 89% |
Scratchpad + Force Ordering | 2.176 | 6968 | 36% |
Executing From | Transfer Size (Mb) | Peak Rate (Mb/s) | % of Theoretical Max Rate |
---|---|---|---|
Cached DDR | 2.069 | 19120 | 50% |
Cached DDR + Force Ordering | 1.33 | 8064 | 21% |
L2-LIM | 1.928 | 16992 | 44% |
L2-LIM + Force Ordering | 1.907 | 7728 | 20% |
Scratchpad | 2.074 | 19072 | 50% |
Scratchpad + Force Ordering | 1.913 | 8064 | 21% |
Executing From | Transfer Size (Mb) | Peak Rate (Mb/s) | % of Theoretical Max Rate |
---|---|---|---|
Cached DDR | 2.786 | 30464 | 79% |
Cached DDR + Force Ordering | 2.426 | 14416 | 38% |
L2-LIM | 3.865 | 30464 | 79% |
L2-LIM + Force Ordering | 3.909 | 14736 | 38% |
Scratchpad | 3.672 | 30488 | 79% |
Scratchpad + Force Ordering | 3.89 | 14800 | 39% |
The drop off in performance that occurs in the above graph at ~; 0.256MB is due to the L2 cache becoming full. This is because the P-DMA can transfer data to the cache faster than the cache is able to write data to DDR memory.
The L2 Cache configuration used in this application has 0.512MB of space available to be used by Cached DDR. In the case of this benchmark cached DDR is being used as both the source and destination, therefore twice the space is required for each transfer, causing the cache to become full at ~; 0.256MB. Whereas in the Cached DDR to Non-Cached DDR, and Non-Cached DDR to Cached DDR benchmarks the entire cache is used only as either the source or the destination, as such the performance begins to degrade at ~; 0.512MB.
A full discussion of the impact of the L2 Cache configuration on the performance of transfers to Cached DDR is provided in the DMA benchmarking overview readme document.
Executing From | Transfer Size (Mb) | Peak Rate (Mb/s) | % of Theoretical Max Rate |
---|---|---|---|
Cached DDR | 1.272 | 30288 | 79% |
Cached DDR + Force Ordering | 1.742 | 13856 | 36% |
L2-LIM | 2.01 | 30248 | 79% |
L2-LIM + Force Ordering | 2.084 | 14232 | 37% |
Scratchpad | 1.993 | 30320 | 79% |
Scratchpad + Force Ordering | 2.091 | 14344 | 37% |
The drop off in performance at ~; 0.512MB is due to the cache becoming full, as the P-DMA can transfer data faster than the cache can clear data from DDR memory. More cache space can be allocated to the L2 Cache using the MSS Configurator.
A full discussion of the impact of the L2 Cache configuration on the performance of transfers to Cached DDR is provided in the DMA benchmarking overview readme document.
Executing From | Transfer Size (Mb) | Peak Rate (Mb/s) | % of Theoretical Max Rate |
---|---|---|---|
Cached DDR | 1.638 | 16984 | 88% |
Cached DDR + Force Ordering | 1.465 | 6968 | 36% |
L2-LIM | 3.942 | 17000 | 89% |
L2-LIM + Force Ordering | 3.449 | 6960 | 36% |
Scratchpad | 3.827 | 17016 | 89% |
Scratchpad + Force Ordering | 2.194 | 6968 | 36% |
Executing From | Transfer Size (Mb) | Peak Rate (Mb/s) | % of Theoretical Max Rate |
---|---|---|---|
Cached DDR | 1.527 | 12696 | 66% |
Cached DDR + Force Ordering | 0.118 | 2352 | 12% |
L2-LIM | 2.075 | 12392 | 65% |
L2-LIM + Force Ordering | 0.909 | 2352 | 12% |
Scratchpad | 2.062 | 12424 | 65% |
Scratchpad + Force Ordering | 1.756 | 2344 | 12% |
Executing From | Transfer Size (Mb) | Peak Rate (Mb/s) | % of Theoretical Max Rate |
---|---|---|---|
Cached DDR | 2.469 | 12704 | 66% |
Cached DDR + Force Ordering | 0.108 | 2352 | 12% |
L2-LIM | 3.83 | 12592 | 66% |
L2-LIM + Force Ordering | 0.874 | 2352 | 12% |
Scratchpad | 3.878 | 12608 | 66% |
Scratchpad + Force Ordering | 3.534 | 2352 | 12% |
The drop off in performance at ~; 0.512MB is due to the cache becoming full, as the P-DMA can transfer data to the cache faster than the cache can write data to DDR memory. More cache space can be allocated to the L2 Cache using the MSS Configurator. A full discussion of the impact of the L2 Cache configuration on the performance of transfers to Cached DDR is provided in the DMA benchmarking overview readme document.
Executing From | Transfer Size (Mb) | Peak Rate (Mb/s) | % of Theoretical Max Rate |
---|---|---|---|
Cached DDR | 0.642 | 12624 | 66% |
Cached DDR + Force Ordering | 0.082 | 2360 | 12% |
L2-LIM | 4.068 | 12600 | 66% |
L2-LIM + Force Ordering | 0.902 | 2352 | 12% |
Scratchpad | 4.186 | 12632 | 66% |
Scratchpad + Force Ordering | 3.578 | 2352 | 12% |
Executing From | Transfer Size (Mb) | Peak Rate (Mb/s) | % of Theoretical Max Rate |
---|---|---|---|
Cached DDR | 2.263 | 3920 | 20% |
Cached DDR + Force Ordering | 0.134 | 1960 | 10% |
L2-LIM | 1.646 | 3624 | 19% |
L2-LIM + Force Ordering | 0.798 | 1960 | 10% |
Scratchpad | 1.62 | 3624 | 19% |
Scratchpad + Force Ordering | 2.966 | 1960 | 10% |