Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hw: Add native bootrom #168

Merged
merged 90 commits into from
Jan 28, 2025
Merged

hw: Add native bootrom #168

merged 90 commits into from
Jan 28, 2025

Conversation

fischeti
Copy link
Contributor

@fischeti fischeti commented Jul 19, 2024

Adds native bootrom to the cluster instead of fetching from externally. Also adds additional scratch registers to the peripherals which can be used to write the entry point of the binary.

The current implemented bootrom is the following. The clusters enable interrupts of the cluster internal clint as well as the software interrupts. After an interrupt, the cores start fetching from the scratch1 register, which needs to be written with the entryaddress of the binary.

_snitch_park:
    # Set trap vector
    la      t0, _snitch_resume
    csrw    mtvec, t0
    # Enable software and cluster interrupts
    csrsi   mstatus, MSTATUS_MIE # CSR set (uimm)
    lui     t0, 0x80  # (1 << 19) cluster interrupts
    addi    t0, t0, 8 # (1 << 3) software interrupts
    csrw    mie, t0
    wfi

_snitch_resume:
    auipc   t0, 0
    # We need to know the address of the scratch1 register in
    # the peripherals, which is a constant offset of our current PC,
    # independent of the cluster configuration.
    # This offset can be calculated as follows:
    # - 0x20 (start of this bootrom)
    # + 0x1000 (bootrom size 4kB)
    # + 0x188 (offset of the scratch1 register)
    li      t1, 0x1168
    add     t0, t0, t1
    lw      t0, 0(t0)
    jalr    ra, 0(t0)
    j       _snitch_park

To make the jump to the scratch register independent of any configuration, we decided to fix the bootrom size to 4kB, which is placed after the TCDM and before the peripherals in the address map.

By default, the internal bootrom is now enabled, but it can also be disabled in the configuration with the int_bootrom_enable flag. The native bootrom can also be anbled with or without the alias feature. If both AliasRegionEnable and IntBootromEnable are set, the cores will start fetching from the BootRomAliasStart. Otherwise, the boot address needs to be provided with BootAddr as before.

To support writting the entry address of the binary to the scratch register, as well as to trigger an interrupt, I created a VIP module (inspired by Cheshire), which combines all task necessary to write to the cluster from outside.

TODO

  • Add bootrom to rtl target and prerequisites
  • Evaluate area overhead with a 512-bit wide interconnect.
  • Adapt configuration for static 4kB bootrom
  • Change bootrom, fix testbench
  • Install newest verilator version (v5.032) on IIS systems
  • Credit initial authors before merging
Co-authored-by: Milos Hirsl <[email protected]>
Co-authored-by: Thierry Dubochet <[email protected]>

@fischeti fischeti changed the title hw: Add native bootrom to the cluster hw: Add native bootrom Jul 22, 2024
@fischeti fischeti marked this pull request as ready for review July 29, 2024 07:23
@fischeti fischeti force-pushed the bootrom branch 2 times, most recently from 7026e09 to 85232d4 Compare August 5, 2024 07:36
@paulsc96 paulsc96 marked this pull request as draft August 6, 2024 15:17
@fischeti fischeti marked this pull request as ready for review August 30, 2024 13:11
Copy link
Collaborator

@colluca colluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the motivation for this PR is to increase the probability of the Snitch cores to successfully boot at test time, as booting from an internal boot ROM decouples it from external components' reliability, e.g. the system-level interconnect.

While in Occamy the boot ROM was unfortunately located in a different clock domain, past many interconnect adapters, other defenses could be put in place. For instance, in a system such as FlooOccamy, the boot ROM could occupy just one of the many tiles in the system-level NoC. Unreliability in the components on this path would most likely kill the deployment of any application, independent on the reliability of the boot process. Thus, I still struggle to see the usefulness of this PR.

I think before proceeding with this PR, we need to clearly sort out and weigh its actual advantages and disadvantages. I still don't see any advantage, while I see the following disadvantages:

  1. Hardware cost. What is the real cost of adding another port to the wide interconnect? In the thesis, the cost was estimated with a 64-bit wide interconnect configuration. The reported +60 GE sounds unrealistic to me even considering this interconnect, together with the axi_to_mem adapter, (small) boot ROM and additional scratch registers.
  2. Increased code complexity and maintenance effort.
  3. Increased configuration complexity.

If we then choose to go on with this PR, I suggest to make the internal boot feature parametrizable so that the associated hardware cost can be fully removed at configuration time. More comments follow in the review. In any case, we can merge the scratch registers.

@fischeti
Copy link
Contributor Author

If I understand correctly, the motivation for this PR is to increase the probability of the Snitch cores to successfully boot at test time, as booting from an internal boot ROM decouples it from external components' reliability, e.g. the system-level interconnect.

While in Occamy the boot ROM was unfortunately located in a different clock domain, past many interconnect adapters, other defenses could be put in place. For instance, in a system such as FlooOccamy, the boot ROM could occupy just one of the many tiles in the system-level NoC. Unreliability in the components on this path would most likely kill the deployment of any application, independent on the reliability of the boot process. Thus, I still struggle to see the usefulness of this PR.

The main reason to have a native cluster bootrom is that you don't have a single point of failure anymore which is far away (potentially in a different clock domain). Also, having high contention from multiple clusters on a single bootrom is also not really desirable. It is also much easier to verify the bootrom on the cluster level (e.g. with post-layout simulation) which is almost impossible on the top-level.

I think before proceeding with this PR, we need to clearly sort out and weigh its actual advantages and disadvantages. I still don't see any advantage, while I see the following disadvantages:

For me, the advantage is that the bootrom might actually work😉

  1. Hardware cost. What is the real cost of adding another port to the wide interconnect? In the thesis, the cost was estimated with a 64-bit wide interconnect configuration. The reported +60 GE sounds unrealistic to me even considering this interconnect, together with the axi_to_mem adapter, (small) boot ROM and additional scratch registers.

This we can evaluate again for a more realistic 512-bit configuration. I think the reported 60GE are only the bootrom itself and does not account for the increase in Xbar size and the additional scratch registers. But my feeling is that the overhead will not be too significant, with tool optimization.

  1. Increased code complexity and maintenance effort.

I don't think the code is that complex and I don't really see a big maintenance overhead. The bootrom in this repository should not really change, since it anyway needs to be adapted/overwritten for systems that integrate the snitch cluster.

  1. Increased configuration complexity.

The current way it is configured has room for improvement, yes.

If we then choose to go on with this PR, I suggest to make the internal boot feature parametrizable so that the associated hardware cost can be fully removed at configuration time. More comments follow in the review. In any case, we can merge the scratch registers.

This makes sense to me, and could be done.

Maybe @paulsc96 and @thommythomaso can also give their two cents on it, since they have the most experience with malfunctioning bootroms and initially proposed this thesis.

@paulsc96
Copy link
Member

paulsc96 commented Jan 17, 2025

I think @fischeti summarized the advantages quite well. Adding this internal boot ROM turns the cluster into a much more autonomous IP, greatly simplifying integration and testing and avoiding fatal integration mistakes.

It also speeds up initialization in multi-cluster systems dramatically (which may be critical for useful large-scale simulation) and saves integrators lots of valuable time in integration and system-level testing.

Finally, it provides a standard interface for interacting with clusters, and the ability to freely repoint execution is a versatile escape hatch in case integration mistakes are made or other problems arise. We actually relied on such an escape hatch through an incidental TLB in an extremely critical chip very recently, so it is a wise choice to integrate such a mechanism straight into the cluster.

EDIT: Regarding the area cost, While this may add some area (as most 1% realistically?) there are numerous inefficiencies in the cluster I would tackle to reduce area long before this would become a concern.

@colluca
Copy link
Collaborator

colluca commented Jan 17, 2025

It also speeds up initialization in multi-cluster systems dramatically (which may be critical for useful large-scale simulation) and saves integrators lots of valuable time in integration and system-level testing.

I thought about this, but you're gonna run into the same slowdown right after you terminate the first few boot ROM instructions, so I don't think the speedup will be dramatic.

Finally, it provides a standard interface for interacting with clusters, and the ability to freely repoint execution is a versatile escape hatch in case integration mistakes are made or other problems arise. We actually relied on such an escape hatch through an incidental TLB in an extremely critical chip very recently, so it is a wise choice to integrate such a mechanism straight into the cluster.

Well, this is only true if the boot ROM is actually reused, and the integrating systems don't just implement their own (in contrast with what @fischeti said). But in that case the same can be obtained even if we don't instantiate it within each cluster. Couldn't we just provide the boot ROM code and utilities, and test it by integrating it within the testbench, while still not instantiating it in every cluster?

@fischeti
Copy link
Contributor Author

Finally, it provides a standard interface for interacting with clusters, and the ability to freely repoint execution is a versatile escape hatch in case integration mistakes are made or other problems arise. We actually relied on such an escape hatch through an incidental TLB in an extremely critical chip very recently, so it is a wise choice to integrate such a mechanism straight into the cluster.

Well, this is only true if the boot ROM is actually reused, and the integrating systems don't just implement their own (in contrast with what @fischeti said). But in that case the same can be obtained even if we don't instantiate it within each cluster. Couldn't we just provide the boot ROM code and utilities, and test it by integrating it within the testbench, while still not instantiating it in every cluster?

I changed my mind regarding this. As Paul said, I think it would make more sense to provide the snitch cluster as a standalone IP with a proper default bootrom that makes it easy to integrate into an actual system without the need to implement your custom one (which would still be possible if needed). So the default bootrom should already implement the trampoline functionality with the scratch register.

I don't really see how integrating the generated bootrom into the testbench would improve the situation. The goal in the end is that we can verify the cluster boot procedure standalone, which is only possible if you have a block that combines the cluster and bootrom. Of course, you could write a wrapper around the snitch cluster that includes a bootrom, but this only shifts the complexity and makes integration unnecessarily complex.

I can do a synthesis run to compare the increase in complexity in the Xbar, but some increase is justified in my opinion given that this increases the flexibility and safety of the boot process.

@fischeti fischeti marked this pull request as draft January 20, 2025 14:21
Thierry Dubochet and others added 14 commits January 22, 2025 14:02
Setup toolchain for generating bootrom.
Created scratch registers.
Begun setup for instantiating bootrom.
Bootrom fully instantiated.
Fetches Bootrom instructions in simulation.
Setup toolchain for generating bootrom.
Created scratch registers.
Begun setup for instantiating bootrom.
Bootrom fully instantiated.
Fetches Bootrom instructions in simulation.
…d set Performance Counters to track Retired Instructions during Boot Procedure.
@fischeti fischeti marked this pull request as ready for review January 27, 2025 11:28
@fischeti fischeti requested a review from viv-eth as a code owner January 27, 2025 11:28
@colluca colluca merged commit c7eb9c2 into main Jan 28, 2025
27 checks passed
@colluca colluca deleted the bootrom branch January 28, 2025 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants