Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Placement of functions in application binary significantly affects performance #324

Open
j-piecuch opened this issue Oct 12, 2021 · 3 comments

Comments

@j-piecuch
Copy link

j-piecuch commented Oct 12, 2021

We found that the placement of functions in the binary can significantly affect the performance due to instruction cache misses.
We used the Arty board for testing, with the slim+cfu CPU variant (2KiB direct-mapped icache, line size 32B).

As an example, in the hps_accel project, we can force instruction cache misses in the ConvPerChannel4x4() function by putting the LoadInput() function at an address where it would map to the same cache set as the place in ConvPerChannel4x4() where it is called. This can be accomplished by modifying the common/ld/linker.ld linker script as follows:

    .text :
    {
        _ftext = .;
        *(.text.start)
        *(.text.*ConvPerChannel4x4*)
        . = ALIGN(2048);
        . = . + 0x220;
        *(.text.*LoadInput*)
        *(.text .stub .text.* .gnu.linkonce.t.*)
        _etext = .;
    } > main_ram

With this change, running the person detection model on cat input takes 203M cycles.

On the other hand, we can prevent cache misses by putting the two functions right next to each other, like so:

    .text :
    {
        _ftext = .;
        *(.text.start)
        *(.text.*ConvPerChannel4x4*)
        *(.text.*LoadInput*)
        *(.text .stub .text.* .gnu.linkonce.t.*)
        _etext = .;
    } > main_ram

With this change, the same model on the same input takes 188M cycles.

Note that on the Arty, the application code is in RAM, so fetching a cache line is less costly compared to fetching it from flash. On HPS hardware, the code is in flash, so the performance difference would likely be significantly larger.

The obvious way to prevent the linker from causing cache conflicts is to modify the linker script, as shown by the second example. This modification should probably be project-specific, which can be done by making the LDSCRIPT variable overridable, and overriding it in specific projects. PR #323 does this for the hps_accel project.

@danc86
Copy link
Collaborator

danc86 commented Oct 13, 2021

Nice find! So it seems like there will be a bit of an art to laying out the hot functions in the final binary to fit nicely into cache.

We can use our knowledge of the accelerator implementation to pick the hot functions. But then we will need to use some heuristics for how to lay them out. It's probably not sufficient to just put them next to each other, is it? Does alignment also have an effect?

Is there a rule of thumb we can make, something like "ensure your hot functions are 2KB-aligned, next to each other, and fit inside 2KB"? The answer will vary depending on which CPU icache configuration we have used too -- in CFU-Playground we typically use full+cfu by default but the hps_accel project is using slim+cfu to save block RAMs.

@alanvgreen
Copy link
Collaborator

+1 Great find!

@j-piecuch
Copy link
Author

j-piecuch commented Oct 13, 2021

It's probably not sufficient to just put them next to each other, is it?

It should be. Placing the hot functions next to each other will ensure that they map to different cache sets, assuming the combined size of those functions doesn't exceed the cache size. Note that this holds only if the cache is direct-mapped (i.e. each cache set holds only 1 cache line), which is the case for the slim+cfu and full+cfu variants.

Does alignment also have an effect?

Yes, although a very small one. Aligning the start address of the block of hot functions to the cache line size will help if the size of that block is very close to the size of the instruction cache. For instance, if the size of the block is 2044 bytes and the i-cache is 2048 bytes, then aligning the start of the block to the cache line size will ensure that it fits in the cache.

Is there a rule of thumb we can make, something like "ensure your hot functions are 2KB-aligned, next to each other, and fit inside 2KB"?

The rule of thumb would be "ensure your hot functions are next to each other, and fit inside the instruction cache". IIRC the full+cfu variant has a 4KiB i-cache, so there's twice as much space compared to the slim+cfu variant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants