Releases: pc2/StencilStream
StencilStream v3.0.0
StencilStream Version 2.1.1
This release contains hotfixes for regression errors introduced by newer oneAPI versions and brings it up-to-date with oneAPI version 2022.3.0.
StencilStream Version 2.1.0
We are verify exited to bring you the new version 2.1.0 of StencilStream, the Generic Stencil Simulation Library for FPGAs!
This major update brings two new features, along with some smaller tweaks:
- Arbitrary pipeline lengths
- A new stencil executor with a monotile architecture
Arbitrary Pipeline Lengths
Due to a small internal quirk, pipeline lengths always had to be a power of two. This was because the pipeline length is used to calculate the number of required banks, which DPC++ demands to be a power of two. This is circumvented by using the next biggest power of two as a number of banks, but only using the banks that are required to implement the demanded pipeline length. DPC++ automatically optimizes the superfluous banks away, which leads to the desired memory and area footprint.
A new stencil executor
With the introduction of StencilStream v2.0.0 came a new memory management architecture that divides the dynamic grid into tiles of static size, which allows unrestricted grid sizes and may provide finer runtime scaling. However, there may be applications where the single-tile approach from the v1.x.x version may be favorable due to it's reduced complexity. Therefore, the AbstractExecutor
has been introduced to separate the execution mechanic from the execution strategy, as well as the MonotileExecutor
which implements the monotile architecture.
We are looking forward to see the how the different strategies compare in different scenarios, so we would appreciate your feedback!
StencilStream Version 2.0.1
We are very exited to bring you the new version 2.0.1 of StencilStream, the Generic Stencil Simulation Library for FPGAs!
This version brings some minor improvements and bug fixes that were requested by users:
- Bugfix: Evaluating the correct number of generations when the requested number of generations is not a multiple of the pipeline length.
- Bugfix: Allowing cell sizes that are not a divisor of the flush size.
- New API: The method
RuntimeSample::get_mean_speed
returns the number of pipeline passes per second. - New API: The attribute
Stencil::grid_range
contains the range of the grid.
In addition to that, the FDTD example has been rewritten to produce more useful and correct data.
StencilStream Version 2.0.0
We are very exited to bring you the new version 2.0.0 of StencilStream, the Generic Stencil Simulation Library for FPGAs!
For this release, we have fundamentally changed the way StencilStream works internally which allows simulation grids of arbitrary size and better scaling for smaller grids. Let's go into the details:
What's new?
Architecture
StencilStream now uses a spatial tiling approach introduced by Hamid Reza Zohouri, Artur Podobas and Satoshi Matsuoka that partitions a dynamically sized grid into statically sized tiles which can be better handled by the processing pipeline.
Defined Grid Halo
This also allows for a new way to handle the grid halo; The cells outside the grid that are required to calculate the cells on the grid's edge. In the previous version, these cells were undefined and transition functions had use the indices to check for edge cases. Now, the user can provide a constant value to StencilStream and the pipeline guarantees that all cells in the grid halo will have this value. Old transition function might still work, but their complexity can be vastly reduced using this precondition.
For example, you would have needed to write a transition function like this in v1.1.1 in order to sum up the neighbors of a cell:
auto trans_func = [grid_width, grid_height](Stencil2D<float, 1> const &stencil, Stencil2DInfo const &info) {
float sum = 0;
if (info.center_cell_id.c > 0) {
sum += stencil[ID(-1, 0)];
}
if (info.center_cell_id.c < grid_width - 1) {
sum += stencil[ID(1, 0)];
}
if (info.center_cell_id.r > 0) {
sum += stencil[ID(0, -1)];
}
if (info.center_cell_id.r < grid_height - 1) {
sum += stencil[ID(0, 1)];
}
return sum;
};
Now, you can set the halo value to 0.0
and simply write:
auto trans_func = [](Stencil<float, 1> const &stencil) {
return stencil[ID(-1, 0)] + stencil[ID(1, 0)] + stencil[ID(0, -1)] + stencil[ID(0, 1)];
};
Edge cases are automatically handled by StencilStream.
Pipeline Length as a Template Parameter
The previous version of StencilStream used preprocessor macros to duplicate the execution stages of a pipeline. This came with the limitation that the pipeline length was capped at 1024 stages and that the length had to be set via a macro definition too. In version 2.0.0, we have overcome this limitation and the pipeline length of a design is now set as a template parameter of the StencilExecutor
class too.
Breaking Changes
This release also brings some breaking changes to the user-facing interface to reduce verbosity and increase clearness:
- StencilStream now uses
StencilStream
has the default directory name, and a one-file-per-class policy has been adopted where suitable. For example the include line for theStencilExecutor
class is#include <StencilStream/StencilExecutor.hpp>
instead of#include <stencil/stencil.hpp>
. - The
Stencil2D
has been renamed toStencil
- The
Stencil2DInfo
class has been merged intoStencil
, transition functions only accept aStencil
instance as a parameter. - The
StencilExecutor
class has been completely rewritten.
More information
More information on how StencilStream is structured and how the interface is designed can be found in the documentation. It is both hosted online and attached as a tarball.
What's next?
This version marks the introduction of the spatial tiling architecture. Until now, we have focused on correctness and clearness with only some second thoughts on performance. In subsequent releases, we will profile and improve the performance of StencilStream and also provide optimization guides for users to achieve the full potential of their applications.
Your feedback is always welcome! Please submit an issue if you find a bug or have a feature request.
StencilStream Version 1.1.1
Changes
This release adds a benchmark mode to the hotspot
and fdtd
examples. In fdtd
, this can be enabled with the -b
flag. In hotspot
, it can be enabled by appending true
to the list of arguments.
Performance
This release contains the isolated StencilStream library as well as synthesized application binaries. All of these binaries have been synthesized using oneAPI version beta-10, and the targeted boards are the Nallatech/Bittware 520N Board as well as the Intel PAC Stratix 10. Below are performance metrics of some sample applications. The conway
application is optimized for readability, not for performance, and is therefore not listed.
Bittware/Nallatech 520N (Stratix 10 GX 2800)
Application | Cycles per Loop | Pipeline Depth | Cycle Frequency | Generations per Second | Overall Performance | Logic Usage | Register Usage | RAM Usage | DSP Usage |
---|---|---|---|---|---|---|---|---|---|
hotspot |
1.07 cycles | 200 cores | 206.25 MHz | 36933 G/s | 580.91 GFLOPS | 79.38% | 49.02% | 35.35% | 52.13% |
fdtd |
16.54 cycles | 35 cores | 272.50 MHz | 243.56 G/s | 136.10 GFLOPS | 79.93% | 49.61% | 47.64% | 52.66% |
Intel PAC D5005 (Stratix 10 SX 2800)
Application | Cycles per Loop | Pipeline Depth | Cycle Frequency | Generations per Second | Overall Performance | Logic Usage | Register Usage | RAM Usage | DSP Usage |
---|---|---|---|---|---|---|---|---|---|
hotspot |
0.98 cycles | 200 cores | 163.00 MHz | 31644.4 G/s | 497.724 GFLOPS | 83.67% | 50.16% | 35.60% | 52.13% |
fdtd |
6.69 cycles | 20 cores | 221.00 MHz | 157.61 G/s | 78.01 GFLOPS | 63.93% | 35.65% | 33.45% | 30.30% |
StencilStream Version 1.0.0
This release contains the isolated StencilStream library as well as synthesized application binaries. All of these binaries have been synthesized using oneAPI version beta-10, and the targeted boards are the Nallatech/Bittware 520N Board as well as the Intel PAC Stratix 10. Below are performance metrics of some sample applications. The conway
application is optimized for readability, not for performance, and is therefore not listed.
Nallatech/Bittware 520N Board (Stratix 10 GX 2800)
Application | Main Loop II | Pipeline Depth | Cycle Frequency | Generations per Second | Overall Performance | Logic Usage | Register Usage | RAM Usage | DSP Usage |
---|---|---|---|---|---|---|---|---|---|
hotspot |
1.05 cycles | 225 cores | 79.63 MHz | 16,328 G/s | 256.84 GFLOPS | 85.34% | 51.23% | 38.31% | 58.64% |
fdtd |
1.73 cycles | 30 cores | 225 MHz | 233.10 G/s | 29.02 KFLOPS | 83.19% | 50.37% | 43.91% | 45.42% |
Intel PAC (Stratix 10 SX)
Application | Main Loop II | Pipeline Depth | Cycle Frequency | Generations per Second | Overall Performance | Logic Usage | Register Usage | RAM Usage | DSP Usage |
---|---|---|---|---|---|---|---|---|---|
hotspot |
1.06 cycles | 100 cores | 225.00 MHz | 20,161.29 G/s | 317.17 GFLOPS | 64.26% | 35.75% | 25.09% | 26.11% |
fdtd |
1.45 cycles | 20 cores | 218.00 MHz | 178.95 G/s | 24.43 KFLOPS | 69.41% | 37.87% | 34.66% | 30.29% |
StencilStream Version 1.0.0 RC1
This release contains the isolated StencilStream library as well as synthesized application binaries. All of these binaries have been synthesized using oneAPI version beta-10 and target the Nallatech/Bittware 520N Board, powered by a Stratix 10 GX 2800. Below are some measured performance metrics. Note that the conway
application does not provide runtime information and therefore has some blank fields.
Performance Metrics
Application | Main Loop II | Pipeline Depth | Cycle Frequency | Generations per Second | Overall Performance | Logic Usage | Register Usage | RAM Usage | DSP Usage |
---|---|---|---|---|---|---|---|---|---|
hotspot |
1.05 cycles | 225 cores | 79.63 MHz | 16,328 G/s | 256.84 GFLOPS | 85.34% | 51.23% | 38.31% | 58.64% |
fdtd |
1.73 cycles | 30 cores | 225 MHz | 233.10 G/s | 29.02 KFLOPS | 83.19% | 50.37% | 43.91% | 45.42% |
conway |
- | 10 cores | 353.33 Mhz | - | - | 25.41% | 12.78% | 8.45% | 0.05% |
Copy Bug, build0
copy_bug_0 Changing the scripts for the PAC