hw: Add native bootrom #168

fischeti · 2024-07-19T12:28:53Z

Adds native bootrom to the cluster instead of fetching from externally. Also adds additional scratch registers to the peripherals which can be used to write the entry point of the binary.

The current implemented bootrom is the following. The clusters enable interrupts of the cluster internal clint as well as the software interrupts. After an interrupt, the cores start fetching from the scratch1 register, which needs to be written with the entryaddress of the binary.

_snitch_park:
    # Set trap vector
    la      t0, _snitch_resume
    csrw    mtvec, t0
    # Enable software and cluster interrupts
    csrsi   mstatus, MSTATUS_MIE # CSR set (uimm)
    lui     t0, 0x80  # (1 << 19) cluster interrupts
    addi    t0, t0, 8 # (1 << 3) software interrupts
    csrw    mie, t0
    wfi

_snitch_resume:
    auipc   t0, 0
    # We need to know the address of the scratch1 register in
    # the peripherals, which is a constant offset of our current PC,
    # independent of the cluster configuration.
    # This offset can be calculated as follows:
    # - 0x20 (start of this bootrom)
    # + 0x1000 (bootrom size 4kB)
    # + 0x188 (offset of the scratch1 register)
    li      t1, 0x1168
    add     t0, t0, t1
    lw      t0, 0(t0)
    jalr    ra, 0(t0)
    j       _snitch_park

To make the jump to the scratch register independent of any configuration, we decided to fix the bootrom size to 4kB, which is placed after the TCDM and before the peripherals in the address map.

By default, the internal bootrom is now enabled, but it can also be disabled in the configuration with the int_bootrom_enable flag. The native bootrom can also be anbled with or without the alias feature. If both AliasRegionEnable and IntBootromEnable are set, the cores will start fetching from the BootRomAliasStart. Otherwise, the boot address needs to be provided with BootAddr as before.

To support writting the entry address of the binary to the scratch register, as well as to trigger an interrupt, I created a VIP module (inspired by Cheshire), which combines all task necessary to write to the cluster from outside.

TODO

Add bootrom to rtl target and prerequisites
Evaluate area overhead with a 512-bit wide interconnect.
Adapt configuration for static 4kB bootrom
Change bootrom, fix testbench
Install newest verilator version (v5.032) on IIS systems
Credit initial authors before merging

Co-authored-by: Milos Hirsl <[email protected]>
Co-authored-by: Thierry Dubochet <[email protected]>

colluca

If I understand correctly, the motivation for this PR is to increase the probability of the Snitch cores to successfully boot at test time, as booting from an internal boot ROM decouples it from external components' reliability, e.g. the system-level interconnect.

While in Occamy the boot ROM was unfortunately located in a different clock domain, past many interconnect adapters, other defenses could be put in place. For instance, in a system such as FlooOccamy, the boot ROM could occupy just one of the many tiles in the system-level NoC. Unreliability in the components on this path would most likely kill the deployment of any application, independent on the reliability of the boot process. Thus, I still struggle to see the usefulness of this PR.

I think before proceeding with this PR, we need to clearly sort out and weigh its actual advantages and disadvantages. I still don't see any advantage, while I see the following disadvantages:

Hardware cost. What is the real cost of adding another port to the wide interconnect? In the thesis, the cost was estimated with a 64-bit wide interconnect configuration. The reported +60 GE sounds unrealistic to me even considering this interconnect, together with the axi_to_mem adapter, (small) boot ROM and additional scratch registers.
Increased code complexity and maintenance effort.
Increased configuration complexity.

If we then choose to go on with this PR, I suggest to make the internal boot feature parametrizable so that the associated hardware cost can be fully removed at configuration time. More comments follow in the review. In any case, we can merge the scratch registers.

hw/snitch_cluster/src/snitch_cluster.sv

hw/snitch_cluster/src/snitch_cluster_peripheral/snitch_cluster_peripheral_reg.hjson

target/snitch_cluster/cfg/default.hjson

target/snitch_cluster/util/gen_bootrom.py

Bender.yml

target/snitch_cluster/Makefile

hw/snitch_cluster/src/snitch_cluster_peripheral/Makefile

fischeti · 2025-01-17T15:08:03Z

If I understand correctly, the motivation for this PR is to increase the probability of the Snitch cores to successfully boot at test time, as booting from an internal boot ROM decouples it from external components' reliability, e.g. the system-level interconnect.

While in Occamy the boot ROM was unfortunately located in a different clock domain, past many interconnect adapters, other defenses could be put in place. For instance, in a system such as FlooOccamy, the boot ROM could occupy just one of the many tiles in the system-level NoC. Unreliability in the components on this path would most likely kill the deployment of any application, independent on the reliability of the boot process. Thus, I still struggle to see the usefulness of this PR.

The main reason to have a native cluster bootrom is that you don't have a single point of failure anymore which is far away (potentially in a different clock domain). Also, having high contention from multiple clusters on a single bootrom is also not really desirable. It is also much easier to verify the bootrom on the cluster level (e.g. with post-layout simulation) which is almost impossible on the top-level.

I think before proceeding with this PR, we need to clearly sort out and weigh its actual advantages and disadvantages. I still don't see any advantage, while I see the following disadvantages:

For me, the advantage is that the bootrom might actually work😉

Hardware cost. What is the real cost of adding another port to the wide interconnect? In the thesis, the cost was estimated with a 64-bit wide interconnect configuration. The reported +60 GE sounds unrealistic to me even considering this interconnect, together with the axi_to_mem adapter, (small) boot ROM and additional scratch registers.

This we can evaluate again for a more realistic 512-bit configuration. I think the reported 60GE are only the bootrom itself and does not account for the increase in Xbar size and the additional scratch registers. But my feeling is that the overhead will not be too significant, with tool optimization.

Increased code complexity and maintenance effort.

I don't think the code is that complex and I don't really see a big maintenance overhead. The bootrom in this repository should not really change, since it anyway needs to be adapted/overwritten for systems that integrate the snitch cluster.

Increased configuration complexity.

The current way it is configured has room for improvement, yes.

If we then choose to go on with this PR, I suggest to make the internal boot feature parametrizable so that the associated hardware cost can be fully removed at configuration time. More comments follow in the review. In any case, we can merge the scratch registers.

This makes sense to me, and could be done.

Maybe @paulsc96 and @thommythomaso can also give their two cents on it, since they have the most experience with malfunctioning bootroms and initially proposed this thesis.

paulsc96 · 2025-01-17T15:21:54Z

I think @fischeti summarized the advantages quite well. Adding this internal boot ROM turns the cluster into a much more autonomous IP, greatly simplifying integration and testing and avoiding fatal integration mistakes.

It also speeds up initialization in multi-cluster systems dramatically (which may be critical for useful large-scale simulation) and saves integrators lots of valuable time in integration and system-level testing.

Finally, it provides a standard interface for interacting with clusters, and the ability to freely repoint execution is a versatile escape hatch in case integration mistakes are made or other problems arise. We actually relied on such an escape hatch through an incidental TLB in an extremely critical chip very recently, so it is a wise choice to integrate such a mechanism straight into the cluster.

EDIT: Regarding the area cost, While this may add some area (as most 1% realistically?) there are numerous inefficiencies in the cluster I would tackle to reduce area long before this would become a concern.

colluca · 2025-01-17T16:01:07Z

It also speeds up initialization in multi-cluster systems dramatically (which may be critical for useful large-scale simulation) and saves integrators lots of valuable time in integration and system-level testing.

I thought about this, but you're gonna run into the same slowdown right after you terminate the first few boot ROM instructions, so I don't think the speedup will be dramatic.

Finally, it provides a standard interface for interacting with clusters, and the ability to freely repoint execution is a versatile escape hatch in case integration mistakes are made or other problems arise. We actually relied on such an escape hatch through an incidental TLB in an extremely critical chip very recently, so it is a wise choice to integrate such a mechanism straight into the cluster.

Well, this is only true if the boot ROM is actually reused, and the integrating systems don't just implement their own (in contrast with what @fischeti said). But in that case the same can be obtained even if we don't instantiate it within each cluster. Couldn't we just provide the boot ROM code and utilities, and test it by integrating it within the testbench, while still not instantiating it in every cluster?

fischeti · 2025-01-17T20:05:29Z

Finally, it provides a standard interface for interacting with clusters, and the ability to freely repoint execution is a versatile escape hatch in case integration mistakes are made or other problems arise. We actually relied on such an escape hatch through an incidental TLB in an extremely critical chip very recently, so it is a wise choice to integrate such a mechanism straight into the cluster.

Well, this is only true if the boot ROM is actually reused, and the integrating systems don't just implement their own (in contrast with what @fischeti said). But in that case the same can be obtained even if we don't instantiate it within each cluster. Couldn't we just provide the boot ROM code and utilities, and test it by integrating it within the testbench, while still not instantiating it in every cluster?

I changed my mind regarding this. As Paul said, I think it would make more sense to provide the snitch cluster as a standalone IP with a proper default bootrom that makes it easy to integrate into an actual system without the need to implement your custom one (which would still be possible if needed). So the default bootrom should already implement the trampoline functionality with the scratch register.

I don't really see how integrating the generated bootrom into the testbench would improve the situation. The goal in the end is that we can verify the cluster boot procedure standalone, which is only possible if you have a block that combines the cluster and bootrom. Of course, you could write a wrapper around the snitch cluster that includes a bootrom, but this only shifts the complexity and makes integration unnecessarily complex.

I can do a synthesis run to compare the increase in complexity in the Xbar, but some increase is justified in my opinion given that this increases the flexibility and safety of the boot process.

Setup toolchain for generating bootrom. Created scratch registers. Begun setup for instantiating bootrom.

Bootrom fully instantiated. Fetches Bootrom instructions in simulation.

Setup toolchain for generating bootrom. Created scratch registers. Begun setup for instantiating bootrom.

Bootrom fully instantiated. Fetches Bootrom instructions in simulation.

…d set Performance Counters to track Retired Instructions during Boot Procedure.

`Bootrom` is not a legal index if `IntBootromEnable` is unset

Starting the performance counter takes very long for some reason, and DMA transfer starts before counter starts tracking

This reverts commit a63105a.

This reverts commit e3d68fe.

This is required to run it on IIS machines since verilator currently can only be run in the oseda environment

fischeti changed the title ~~hw: Add native bootrom to the cluster~~ hw: Add native bootrom Jul 22, 2024

fischeti marked this pull request as ready for review July 29, 2024 07:23

fischeti requested review from paulsc96, lucabertaccini and colluca as code owners July 29, 2024 07:23

fischeti force-pushed the bootrom branch 2 times, most recently from 7026e09 to 85232d4 Compare August 5, 2024 07:36

paulsc96 marked this pull request as draft August 6, 2024 15:17

fischeti force-pushed the bootrom branch from 85232d4 to d3513f9 Compare August 29, 2024 17:11

fischeti marked this pull request as ready for review August 30, 2024 13:11

colluca requested changes Jan 17, 2025

View reviewed changes

fischeti marked this pull request as draft January 20, 2025 14:21

Thierry Dubochet and others added 14 commits January 22, 2025 14:02

Chagned makro names

0a087b3

Setup Bootrom Generation.

fc90063

Setup toolchain for generating bootrom. Created scratch registers. Begun setup for instantiating bootrom.

Bootable Bootrom instantiated.

f9a4937

Bootrom fully instantiated. Fetches Bootrom instructions in simulation.

Setup Bootrom Generation.

809282d

Setup toolchain for generating bootrom. Created scratch registers. Begun setup for instantiating bootrom.

Bootable Bootrom instantiated.

1187157

Bootrom fully instantiated. Fetches Bootrom instructions in simulation.

Linked Bootrom Scratch Register to correct address.

86a13d7

Fixed Scratch Register linking address. Modified regfile for synthesis.

06b0dda

Modified bootrom to avoid synthesis issues.

bd87217

Moved Bootrom Space into Cluster Peripheral.

79c2fc7

Changed Scratch Registers Software Permissions to allow Read/Write an…

e3bbbb7

…d set Performance Counters to track Retired Instructions during Boot Procedure.

Adjusted assignment size from generated boot data to actual boot size.

67b808c

Fixed assign statement.

b246b63

Modified bootrom after debugging to fetch address at scratch register 1.

5733e0a

git: Remove unused config file

32293ed

container: Update verilator version

61a2b5a

fischeti force-pushed the bootrom branch from 1a52bd7 to 61a2b5a Compare January 22, 2025 13:04

fischeti added 20 commits January 22, 2025 14:14

hw: Fix non-bootrom configurations

1f6ee92

`Bootrom` is not a legal index if `IntBootromEnable` is unset

sw: Fix base address of peripherals

4bb5d83

hw: Try to fix elaboration

aed6c8a

env: Bump VCS and Questa versions

a5f5326

sw: Fix perf_cnt test

a63105a

Starting the performance counter takes very long for some reason, and DMA transfer starts before counter starts tracking

Revert "sw: Fix perf_cnt test"

f8813bd

This reverts commit a63105a.

sw: Fix address map

f41cebf

bootrom: Remove l3 base

a86fcbb

banshee: Fix cl_clint address in config

6ee3b63

hw: Clarify parameter comments

c45cfb8

vlt: Use verilator from oseda environment

464f6a5

ci: Use oseda environment for vlt compilation

e3d68fe

vlt: Clean up verilator build

1f9777b

Revert "ci: Use oseda environment for vlt compilation"

5e27beb

This reverts commit e3d68fe.

vlt: Wrap simulation binary in script

f60f8a7

This is required to run it on IIS machines since verilator currently can only be run in the oseda environment

vlt: Use absolute path of verilator binary in the wrapper

c633547

vlt: Pass additional variables for IPC

4aaf2ca

ipc: Allocate enough size for nullcharacter

3e0db3a

test: Implement get_bin_entry for verilator without using bootdata

c673d4e

util: Move bootrom generation script

5469bca

fischeti marked this pull request as ready for review January 27, 2025 11:28

fischeti requested a review from viv-eth as a code owner January 27, 2025 11:28

treewide: Final revision, minor formatting changes and software bug fix

81693c9

colluca approved these changes Jan 27, 2025

View reviewed changes

colluca merged commit c7eb9c2 into main Jan 28, 2025
27 checks passed

colluca deleted the bootrom branch January 28, 2025 08:51

This was referenced Feb 18, 2025

treewide: Fix regressions from #168 #204

Merged

target: Fix testbench clock period regression from #168 #205

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hw: Add native bootrom #168

hw: Add native bootrom #168

fischeti commented Jul 19, 2024 •

edited by colluca

Loading

colluca left a comment

fischeti commented Jan 17, 2025

paulsc96 commented Jan 17, 2025 •

edited

Loading

colluca commented Jan 17, 2025

fischeti commented Jan 17, 2025

hw: Add native bootrom #168

hw: Add native bootrom #168

Conversation

fischeti commented Jul 19, 2024 • edited by colluca Loading

TODO

colluca left a comment

Choose a reason for hiding this comment

fischeti commented Jan 17, 2025

paulsc96 commented Jan 17, 2025 • edited Loading

colluca commented Jan 17, 2025

fischeti commented Jan 17, 2025

fischeti commented Jul 19, 2024 •

edited by colluca

Loading

paulsc96 commented Jan 17, 2025 •

edited

Loading