Welcome to the RISC-V Code Size Reduction Group

If you can't measure it you can't improve it

This is the current proposal for reducing code size:

Zce ISA proposal

This will be the home for the all of the code size reduction proposals, analysis, results etc.

Documentation of existing ISA extensions

Existing ISA extensions to reduce code size

Toolchain optimisation suggestions

Anders: Second GP (thread pointer - requires ABI change) to allow more data to be in easy reach without building long addresses
Jeremy: Recognising similar/same constants in the linker and simplifying them
Jim Wilson: fix -mno-strict-align
Reported by Matteo: riscvarchive/riscv-gcc#193
Use C.SWSP instead of SB/SH when zeroing stack variables, so can use 16-bit encodings and if possible group variables together to allow fewer stores to cover them all

Code size analysis suggestions

Build a histogram of instruction distribution
find common pairs of instructions to see if they can be combined

Publicly available benchmarks

Embench
- antecedents BEEBS, MiBench, Mälardalen WCET benchmarks, Hacker's Delight/A Hacker's Assistant
Hi3961 Huawei WiFi IoT platform
Code size benchmarks
Opus codec
TACLe Benchmarks
- I didn't see any code here, just details of a committee...... Tariq
softfloat and testfloat
MLPerf - specifically for AI/ML applications
- Seems to be Debian installation only which I don't have available....... Tariq
BDTI
- Need to be licensed? It's a US company so I can't talk to them.... Tariq
Zephyr includes BLE (bluetooth)
- Nordic SDK
- Examples
Need to find more device drivers....
LLVM test suite

Proprietary benchmarks

Huawei IoT code
others?

Useful papers

Peijie Li's Berkeley paper

Useful software

RISCV-NEWOP - auto detect fused instructions from a spike trace and add the instructions to spike . For details of the presenter click here

Reference Architectures

These are architectures we could compare against. The "official" comparison architectures have not yet been decided, but almost certainly need freely available ISA manuals and GCC+LLVM ports

ARMv7-M / Cortex-M3 manual is here
ARCv1 / ARC700 manual is here ARCv2 would be better but is proprietary (ISA and toolchain)
NanoMIPs manual is here see "save" instruction on page 163 and "restore/restore.js" instructions on page 155
AVR32 manual is here
J-core manual is here a reimplementation of Hitachi SH2
SuperH instruction reference is here

Reference Toolchains

GNU Arm Embedded Toolchain / LLVM? Version / download link?
ARC

Benchmark issues

Embench:
- Add bigger programs to be more representative.
- Add FP intensive benchmarks.
- Cubic benchmark uses long double type, which is interpreted as 128-bit by RISC-V and 64-bit by Arm. This brings to possibly unfair comparisons both with and without libraries. In the former case, the linker links FP-functions to deal with quad-precision FP numbers, which are way heavier than the ones for double-precision; however, even without considering the libraries, the RISC-V code size is bloated because the quad-precision FP numbers should be passed by reference, with higher memory usage.

RISC-V Known Compiler Inefficiencies

Frame pointer allocation inefficiency for > 4KiB data on the stack: riscvarchive/riscv-gcc#193

Linker script and Global Pointer optimization

The position of the Global Pointer (GP) is fundamental for the code size problem since memory operations exploit it as a base register for the addressing without the need to pre-load an immediate (the base for the addressing) in a register. Given the 12-bit signed-immediate encoding of the memory operations, the GP is effective when the memory address falls within the interval [GP-2KiB, GP+2KiB), so frequently accessed data should be placed in this range to maximize the code-size/performance benefit. This operation can be done inside the linker script.

ABI modification proposal

A second GP would be especially beneficial for applications with scattered memory maps, in which some important data can be put far away from the GP. A possible approach can be a modification of the ABI to free-up the Thread Pointer (TP) register, which is often unused in many applications. This special register cannot be allocated by the compiler, and making it general-purpose or modifying its actual role (e.g., a second GP) could benefit both code-size and performance of the embedded applications. As already mentioned, the TP can also be used as a general-purpose register. This would be important also for the RV32E extension, as the TP is one of the 16 available registers (x4).

Library support

Optimized library functions can make the difference for two reasons:

Code tailored for embedded application
Smaller + Faster code, assembly-optimized in the critical parts
No intricate inter-dependencies, which call a lot of different functions even if they are not fundamental

Different libraries to optimize:

Low-level runtime library (e.g. libgcc)
Optimize the FP functions, which bloat the code especially when linked to small programs
Other external libraries, like newlib
picolibc by Keith Packard (https://keithp.com/blogs/picolibc/, https://github.com/picolibc/picolibc) is the evolution of newlib-nano, a shrunk version of newlib. Citing the GitHub description: "Picolibc is library offering standard C library APIs that targets small embedded systems with limited RAM. Picolibc was formed by blending code from Newlib and AVR Libc."

Code size reduction ideas

Need a lot more detail for these, they're just placeholders at the moment

runtime library optimisation
link time optimisation including dead code elimination
function prologue/epilogue optimisation in software, to close the gap with the PUSH/POP ISA extension proposal
smaller instruction sequences to jump to distant addresses
- including jump-chaining, where a tree of jumps is built to allow access to distant functions using mainly shorter encodings
smaller instruction sequences to load/store to distant addresses
smaller instruction sequences to load 32-bit constants
load/store multiple
- specified as a register list, or as base register and register count?
- what's best for the compiler? base + count could probably fit in a 16-bit encoding
ADD instruction(s) with a carry out for chaining into longer ADDs (Matteo)
- add sl, xl, yl -> implicitly save the carry bit
- addH sh, xh, yh <- implicitly add the carry bit
- But I don't know if outside the soft-float FP functions we would have significant gains.

From Anders Lindgren:

Better support for 8 and 16 bit data
- Today, most RISC-V instructions work on the full registers. This makes the generated code more efficient to handle 32 bit data than 8 and 16 bit data. Effectively, the compiler must ensure that 8 and 16 bit data are properly extended before it can perform things like compares on them. To make things worse, RISC-V doesn't provide instructions to perform extensions so typically two instructions are needed to perform extensions (with the exception of 8 bit zero extension which can be done using "ANDI Rd, Rs, 0xFF"). Instructions to perform sign and zero extend (preferably with compact variants) are obvious candidates. In addition, we could consider 8 and 16 bit variants (and for RV64 32 bit variants) for various instructions like compare, right shift, division, and modulo. One thing that makes the situation worse is that the ABI requires arguments and return values to be correctly extended. Hence a small function like "short f(short x, short y) { return x + y; }" require 4 instructions (add, shift left, signed shift right, ret). I would like to see if the overall code size would shrink if the ABI didn't require this, and, if so, recommend that the EABI (which isn't ratified) is changed to that fewer extension instructions are needed.
Insert and extract parts of registers
- If it would be easier to insert and extract parts of registers, we could avoid storing things on the stack. Concretely, a RV32 processor register could be used to store four bytes or two halfwords.
Improved compare with constants
- Today, when comparing a value against a non-zero constant, at least two instructions are needed. Instructions that compare a register against commonly used constants (imm5?) could reduce code size. We need to see which constants and which comparisons are most effective.
- See this proposal for combined compare-immediate-branch
Address calculations with scaling
- In C, when doing address calculations, the index value is scaled with the object size to produce the end address. Today, this is done using an explicit shift (when the size of the object is a power of two) or a multiplication. We should look into loads, stores, and load-effective-address with this scaling builtin. Since most arrays use elements of size 2, 4, and 8 we could restrict ourselves to this.

Experiments

enable B-extension, maybe a subset could become part of a future code-size reduction ISA extension

Outputs from the group

Improved open source compiler technology (GCC and LLVM)
- code size optimised compilers with and without Zce (see below)
- for example function prologue/epilogue should be smaller than -msave-restore is now in GCC.
Code size reduction ISA extensions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Welcome to the RISC-V Code Size Reduction Group

Reference Architectures

Reference Toolchains

Benchmark issues

RISC-V Known Compiler Inefficiencies

Linker script and Global Pointer optimization

ABI modification proposal

Library support

Code size reduction ideas

Experiments

Outputs from the group

Files

README.md

Latest commit

History

README.md

File metadata and controls

Welcome to the RISC-V Code Size Reduction Group

Reference Architectures

Reference Toolchains

Benchmark issues

RISC-V Known Compiler Inefficiencies

Linker script and Global Pointer optimization

ABI modification proposal

Library support

Code size reduction ideas

Experiments

Outputs from the group