Placement of functions in application binary significantly affects performance #324

j-piecuch · 2021-10-12T11:15:32Z

We found that the placement of functions in the binary can significantly affect the performance due to instruction cache misses.
We used the Arty board for testing, with the slim+cfu CPU variant (2KiB direct-mapped icache, line size 32B).

As an example, in the hps_accel project, we can force instruction cache misses in the ConvPerChannel4x4() function by putting the LoadInput() function at an address where it would map to the same cache set as the place in ConvPerChannel4x4() where it is called. This can be accomplished by modifying the common/ld/linker.ld linker script as follows:

    .text :
    {
        _ftext = .;
        *(.text.start)
        *(.text.*ConvPerChannel4x4*)
        . = ALIGN(2048);
        . = . + 0x220;
        *(.text.*LoadInput*)
        *(.text .stub .text.* .gnu.linkonce.t.*)
        _etext = .;
    } > main_ram

With this change, running the person detection model on cat input takes 203M cycles.

On the other hand, we can prevent cache misses by putting the two functions right next to each other, like so:

    .text :
    {
        _ftext = .;
        *(.text.start)
        *(.text.*ConvPerChannel4x4*)
        *(.text.*LoadInput*)
        *(.text .stub .text.* .gnu.linkonce.t.*)
        _etext = .;
    } > main_ram

With this change, the same model on the same input takes 188M cycles.

Note that on the Arty, the application code is in RAM, so fetching a cache line is less costly compared to fetching it from flash. On HPS hardware, the code is in flash, so the performance difference would likely be significantly larger.

The obvious way to prevent the linker from causing cache conflicts is to modify the linker script, as shown by the second example. This modification should probably be project-specific, which can be done by making the LDSCRIPT variable overridable, and overriding it in specific projects. PR #323 does this for the hps_accel project.

The text was updated successfully, but these errors were encountered:

danc86 · 2021-10-13T05:57:55Z

Nice find! So it seems like there will be a bit of an art to laying out the hot functions in the final binary to fit nicely into cache.

We can use our knowledge of the accelerator implementation to pick the hot functions. But then we will need to use some heuristics for how to lay them out. It's probably not sufficient to just put them next to each other, is it? Does alignment also have an effect?

Is there a rule of thumb we can make, something like "ensure your hot functions are 2KB-aligned, next to each other, and fit inside 2KB"? The answer will vary depending on which CPU icache configuration we have used too -- in CFU-Playground we typically use full+cfu by default but the hps_accel project is using slim+cfu to save block RAMs.

alanvgreen · 2021-10-13T06:13:41Z

+1 Great find!

j-piecuch · 2021-10-13T06:49:20Z

It's probably not sufficient to just put them next to each other, is it?

It should be. Placing the hot functions next to each other will ensure that they map to different cache sets, assuming the combined size of those functions doesn't exceed the cache size. Note that this holds only if the cache is direct-mapped (i.e. each cache set holds only 1 cache line), which is the case for the slim+cfu and full+cfu variants.

Does alignment also have an effect?

Yes, although a very small one. Aligning the start address of the block of hot functions to the cache line size will help if the size of that block is very close to the size of the instruction cache. For instance, if the size of the block is 2044 bytes and the i-cache is 2048 bytes, then aligning the start of the block to the cache line size will ensure that it fits in the cache.

Is there a rule of thumb we can make, something like "ensure your hot functions are 2KB-aligned, next to each other, and fit inside 2KB"?

The rule of thumb would be "ensure your hot functions are next to each other, and fit inside the instruction cache". IIRC the full+cfu variant has a 4KiB i-cache, so there's twice as much space compared to the slim+cfu variant.

tcal-x mentioned this issue Oct 21, 2021

Are we seeing the full benefits of LTO (link-time optimization) in CFU Playground? #330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Placement of functions in application binary significantly affects performance #324

Placement of functions in application binary significantly affects performance #324

j-piecuch commented Oct 12, 2021 •

edited

Loading

danc86 commented Oct 13, 2021

alanvgreen commented Oct 13, 2021

j-piecuch commented Oct 13, 2021 •

edited

Loading

Placement of functions in application binary significantly affects performance #324

Placement of functions in application binary significantly affects performance #324

Comments

j-piecuch commented Oct 12, 2021 • edited Loading

danc86 commented Oct 13, 2021

alanvgreen commented Oct 13, 2021

j-piecuch commented Oct 13, 2021 • edited Loading

j-piecuch commented Oct 12, 2021 •

edited

Loading

j-piecuch commented Oct 13, 2021 •

edited

Loading