-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Placement of functions in application binary significantly affects performance #324
Comments
Nice find! So it seems like there will be a bit of an art to laying out the hot functions in the final binary to fit nicely into cache. We can use our knowledge of the accelerator implementation to pick the hot functions. But then we will need to use some heuristics for how to lay them out. It's probably not sufficient to just put them next to each other, is it? Does alignment also have an effect? Is there a rule of thumb we can make, something like "ensure your hot functions are 2KB-aligned, next to each other, and fit inside 2KB"? The answer will vary depending on which CPU icache configuration we have used too -- in CFU-Playground we typically use |
+1 Great find! |
It should be. Placing the hot functions next to each other will ensure that they map to different cache sets, assuming the combined size of those functions doesn't exceed the cache size. Note that this holds only if the cache is direct-mapped (i.e. each cache set holds only 1 cache line), which is the case for the
Yes, although a very small one. Aligning the start address of the block of hot functions to the cache line size will help if the size of that block is very close to the size of the instruction cache. For instance, if the size of the block is 2044 bytes and the i-cache is 2048 bytes, then aligning the start of the block to the cache line size will ensure that it fits in the cache.
The rule of thumb would be "ensure your hot functions are next to each other, and fit inside the instruction cache". IIRC the |
We found that the placement of functions in the binary can significantly affect the performance due to instruction cache misses.
We used the Arty board for testing, with the
slim+cfu
CPU variant (2KiB direct-mapped icache, line size 32B).As an example, in the
hps_accel
project, we can force instruction cache misses in theConvPerChannel4x4()
function by putting theLoadInput()
function at an address where it would map to the same cache set as the place inConvPerChannel4x4()
where it is called. This can be accomplished by modifying thecommon/ld/linker.ld
linker script as follows:With this change, running the person detection model on cat input takes 203M cycles.
On the other hand, we can prevent cache misses by putting the two functions right next to each other, like so:
With this change, the same model on the same input takes 188M cycles.
Note that on the Arty, the application code is in RAM, so fetching a cache line is less costly compared to fetching it from flash. On HPS hardware, the code is in flash, so the performance difference would likely be significantly larger.
The obvious way to prevent the linker from causing cache conflicts is to modify the linker script, as shown by the second example. This modification should probably be project-specific, which can be done by making the
LDSCRIPT
variable overridable, and overriding it in specific projects. PR #323 does this for thehps_accel
project.The text was updated successfully, but these errors were encountered: