Skip to content

Commit

Permalink
✨ add memory coherency logic (#1176)
Browse files Browse the repository at this point in the history
  • Loading branch information
stnolting authored Feb 4, 2025
2 parents 15e447d + ecf7cb9 commit f56cade
Show file tree
Hide file tree
Showing 33 changed files with 653 additions and 1,047 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ mimpid = 0x01040312 -> Version 01.04.03.12 -> v1.4.3.12

| Date | Version | Comment | Ticket |
|:----:|:-------:|:--------|:------:|
| 03.02.2025 | 1.11.0.8 | :sparkles: add explicit memory ordering/coherence support; :warning: remove WDT "halt-on-debug" and "halt-on-sleep" options; :bug: rework cache module fixing several (minor?) design flaws | [#1176](https://github.com/stnolting/neorv32/pull/1176) |
| 03.02.2025 | 1.11.0.7 | :bug: add missing CFS clock gen enable signal | [#1177](https://github.com/stnolting/neorv32/pull/1177) |
| 01.02.2025 | 1.11.0.6 | :warning: remove XIP module | [#1175](https://github.com/stnolting/neorv32/pull/1175) |
| 01.02.2025 | 1.11.0.5 | minor rtl optimizations and cleanups; :warning: remove DMA "fence" feature | [#1174](https://github.com/stnolting/neorv32/pull/1174) |
Expand Down
45 changes: 21 additions & 24 deletions docs/datasheet/cpu.adoc
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
<<<
:sectnums:
== NEORV32 Central Processing Unit (CPU)

Expand Down Expand Up @@ -66,7 +67,7 @@ direction as seen from the CPU.
[options="header", grid="rows"]
|=======================
| Signal | Width/Type | Dir | Description
4+^| **Global Signals**
4+^| **Clock and reset**
| `clk_i` | 1 | in | Global clock line, all registers triggering on rising edge.
| `rstn_i` | 1 | in | Global reset, low-active.
4+^| **Interrupts (<<_traps_exceptions_and_interrupts>>)**
Expand All @@ -75,20 +76,17 @@ direction as seen from the CPU.
| `mti_i` | 1 | in | RISC-V machine timer interrupt.
| `firq_i` | 16 | in | Custom fast interrupt request signals.
| `dbi_i` | 1 | in | Request CPU to halt and enter debug mode (RISC-V <<_on_chip_debugger_ocd>>).
4+^| **<<_inter_core_communication_icc>> links**
| `icc_tx_o` | `icc_t` | out | TX link
| `icc_rx_i` | `icc_t` | in | RX link
4+^| **Instruction <<_bus_interface>>**
| `ibus_req_o` | `bus_req_t` | out | Instruction fetch bus request.
| `ibus_rsp_i` | `bus_rsp_t` | in | Instruction fetch bus response.
4+^| **Data <<_bus_interface>>**
| `dbus_req_o` | `bus_req_t` | out | Data access (load/store) bus request.
| `dbus_rsp_i` | `bus_rsp_t` | in | Data access (load/store) bus response.
4+^| **<<_inter_core_communication_icc>> TX links**
| `icc_tx_rdy_o` | 2 | out | Data available for cores `0..1`.
| `icc_tx_ack_i` | 2 | in | Read-enable from cores `0..1`.
| `icc_tx_dat_o` | 2*32 | out | Data for cores `0..1`.
4+^| **<<_inter_core_communication_icc>> RX links**
| `icc_rx_rdy_i` | 2 | in | Data available from cores `0..1`.
| `icc_rx_ack_o` | 2 | out | Read-enable for cores `0..1`.
| `icc_rx_dat_i` | 2*32 | in | Data from cores `0..1`.
4+^| **<<_memory_coherence>> status**
| `mem_sync_i` | 1 | in | Requested coherence established when set (single-shot)
|=======================

.Bus Interface Protocol
Expand Down Expand Up @@ -424,12 +422,11 @@ always valid when set.
| `rw` | 1 | Access direction (`0` = read, `1` = write)
| `src` | 1 | Access source (`0` = instruction fetch, `1` = load/store)
| `priv` | 1 | Set if privileged (M-mode) access
| `debug` | 1 | Set if debug mode access
| `amo` | 1 | Set if current access is an atomic memory operation (<<_atomic_memory_access>>)
| `amoop` | 4 | Type of atomic memory operation (<<_atomic_memory_access>>)
3+^| **Out-Of-Band Signals**
| `fence` | 1 | Data/instruction fence request; single-shot
| `sleep` | 1 | Set if ALL upstream devices are in <<_sleep_mode>>
| `debug` | 1 | Set if the upstream device is in debug-mode
| `fence` | 1 | Data (load/store; `fence`) or instruction (instruction-fetch; `fence.i`) fence request; single-shot; see <<_memory_coherence>>
|=======================

.Bus Interface - Response Bus (`bus_rsp_t`)
Expand Down Expand Up @@ -463,7 +460,7 @@ The figure below shows three exemplary bus accesses:
. A write access to address `B_addr` writing `wdata` (fastest response; `ACK` arrives right in the next cycle).
. A failing read access to address `C_addr` (slow response; `ERR` arrives after several cycles).

.Three Exemplary Bus Transactions (showing only in-band signals)
.Three Exemplary Bus Transactions (showing only in-band signals; privileged non-debug non-atomic accesses)
image::bus_interface.png[700]

.Adding Register Stages
Expand Down Expand Up @@ -501,8 +498,8 @@ operation:

.Cache Coherency
[IMPORTANT]
Atomic operations **always bypass** the CPU caches using direct/uncached accesses. Care must be taken
to maintain data <<_cache_coherency>>.
Atomic operations **always bypass** the (CPU) caches using direct/uncached accesses. Care must be taken
to maintain data synchronization. See section <<_memory_coherence>> for more information.


<<<
Expand Down Expand Up @@ -632,7 +629,7 @@ The `I` ISA extensions is the base RISC-V integer ISA that is always enabled.
| Jump/call | `jal[r]` | 6
| Load/store | `lb` `lh` `lw` `lbu` `lhu` `sb` `sh` `sw` | 5
| System | `ecall` `ebreak` | 3
| Data fence | `fence` | 5
| Data fence | `fence` | depends on the memory system
| System | `wfi` | 3
| System | `mret` | 5
| Illegal inst. | - | 3
Expand All @@ -641,10 +638,10 @@ The `I` ISA extensions is the base RISC-V integer ISA that is always enabled.
.`fence` Instruction
[NOTE]
Analogous to the `fence.i` instruction (<<_zifencei_isa_extension>>) the `fence` instruction triggers
a data cache synchronization operation. See section <<_cache_coherency>> for more information.
Furthermore, the `fence` instruction word's _predecessor_ and _successor_ bits (used for memory ordering)
are not evaluated by the hardware at all.

a load/store memory synchronization operation. The CPU will stall until the requested coherence is
established (`mem_sync_i` goes high). See section <<_memory_coherence>> for more information.
NEORV32 ignores the predecessor and successor fields and always executes a conservative fence on all
operations.

.`wfi` Instruction
[NOTE]
Expand Down Expand Up @@ -716,16 +713,16 @@ The instruction word's `aq` and `lr` memory ordering bits are not evaluated by t
==== `Zifencei` ISA Extension

The `Zifencei` CPU extension allows manual synchronization of the instruction stream. This extension is always enabled.

Analogous to the `fence` instruction the `fence.i` instruction triggers an instruction cache synchronization operation.
See section <<_cache_coherency>> for more information.
This instruction is the only standard mechanism to ensure that stores visible to a hart will also be visible to its
instruction fetches. The CPU will stall until the requested coherence is established (`mem_sync_i` goes high).
See section <<_memory_coherence>> for more information.

.Instructions and Timing
[cols="<2,<4,<3"]
[options="header", grid="rows"]
|=======================
| Class | Instructions | Execution cycles
| Instruction fence | `fence.i` | 5
| Instruction fence | `fence.i` | depends on the memory system
|=======================


Expand Down
4 changes: 3 additions & 1 deletion docs/datasheet/on_chip_debugger.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -667,7 +667,7 @@ Debug-mode is entered on any of the following events:
. A hardware trigger from the <<_trigger_module>> fires (`exe` and `action` in <<_tdata1>> / `mcontrol` are set).

[NOTE]
From a hardware point of view these debug-mode-entry conditions are special traps (synchronous exceptions or
From a hardware point of view these debug-mode-entry conditions are normal traps (synchronous exceptions or
asynchronous interrupts) that are handled transparently by the control logic.

**Whenever the CPU enters debug-mode it performs the following operations:**
Expand All @@ -684,6 +684,8 @@ asynchronous interrupts) that are handled transparently by the control logic.
**When the CPU is in debug-mode:**

* while in debug mode, the CPU executes the parking loop and - if requested by the DM - the program buffer
* all **caches are bypassed** when in debug-mode; hence, a <<_memory_coherence>> has to be re-established when entering debug-mode
and when leaving debug-mode
* effective CPU privilege level is `machine` mode; any active physical memory protection (PMP) configuration is bypassed
* the `wfi` instruction acts as a `nop` (also during single-stepping)
* if an exception occurs while being in debug mode:
Expand Down
1 change: 1 addition & 0 deletions docs/datasheet/overview.adoc
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
<<<
:sectnums:
== Overview

Expand Down
1 change: 1 addition & 0 deletions docs/datasheet/rationale.adoc
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
<<<
:sectnums:
=== Rationale

Expand Down
54 changes: 30 additions & 24 deletions docs/datasheet/soc.adoc
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@

// ####################################################################################################################
<<<
:sectnums:
== NEORV32 Processor (SoC)

Expand Down Expand Up @@ -595,7 +594,7 @@ content of the addresses memory cell) is sent back to the requesting CPU.
.Direct Access
[IMPORTANT]
Atomic operations **always bypass** the CPU's <<_processor_internal_data_cache_dcache, data cache>>
using direct/uncached accesses. Care must be taken to maintain data <<_cache_coherency>>.
using direct/uncached accesses. Care must be taken to maintain data <<_memory_coherence>>.

.Physical Memory Attributes
[NOTE]
Expand All @@ -610,43 +609,50 @@ cannot be interrupted. Hence, they execute in an atomic way.


:sectnums:
==== Cache Coherency
==== Memory Coherence

In total the NEORV32 Processor provides up to three optional caches organized in two levels. Level-1
caches are closer to the CPU while level-2 caches are closer to main memory (however, this highly depends
on the the actual cache configurations).
Depending on the configuration, the NEORV32 processor provides several _layer_ of memory consisting
of caches, buffers and storage.

* The CPU instruction prefetch buffer ("level-0")
* The <<_processor_internal_data_cache_dcache>> (level-1)
* The <<_processor_internal_instruction_cache_icache>> (level-1)
* The cache of the <<_processor_external_bus_interface_xbus>> (level-2)
* Internal and external memories

As all caches operate transparently for the software, special attention must therefore be paid to coherence.
Note that coherence and cache _synchronization_ is **not** performed by the hardware itself (there is no
snooping implemented).
All caches and buffers operate transparently for the software. Hence, special attention must therefore be
paid to maintain coherence. Note that coherence and cache _synchronization_ is **not** automatically performed
by the hardware itself as there is no snooping implemented.

The NEORV32 uses two instructions for manual cache synchronization (both instructions are always available
regardless of the actual CPU/ISA configuration):
NEORV32 uses two instructions for manual memory synchronization which are always available
regardless of the actual CPU/ISA configuration:

* `fence` (<<_i_isa_extension>> / <<_e_isa_extension>>)
* `fence.i` (<<_zifencei_isa_extension>>)

By executing the "data" `fence` instruction the CPU's data cache is synchronized in four steps:
By executing the "data" `fence` instruction the CPU's load/store operations are ordered
and synchronized across the entire system:

[start=1]
. The CPU data cache is flushed: all local modifications are copied to the next higher memory level;
this can be the XBUS cache or main memory.
. The CPU data cache is cleared invalidating all local entries.
. The synchronization request is sent to the next-higher memory level (for example to the XBUS cache
so it can perform the same synchronization steps).
. The CPU data cache is reloaded with up-to-date data from the next higher memory level.
. The CPU data cache (if enabled) is flushed and invalidated: all local modifications are copied to
the next higher memory level (for example the internal DMEM or the XBUS-cache).
. The CPU data cache is cleared invalidating so the next load/store access will cause a cache miss
that will fetch up-to-date data from the memory system.
. The synchronization request is forwarded to the next-higher memory level. If the XBUS cache is implemented
it will also be flushed and invalidated.

By executing the "instruction" `fence.i` instruction the CPU's instruction cache is synchronized in three steps:
By executing the "instruction" `fence.i` instruction the CPU's instruction-fetch cache is are ordered
and synchronized across the entire system:

[start=1]
. The synchronization request is sent to the next-higher memory level (for example to the XBUS cache
so it can perform the same synchronization steps).
. The CPU instruction cache is cleared invalidating all local entries.
. The CPU instruction cache is reloaded with up-to-date data from the next higher memory level.
. Perform all the steps that are performed by the `fence` instruction.
. The CPU instruction cache is cleared invalidating all local entries so the next instruction fetch access
will cause a cache miss that will fetch up-to-date data from the memory system.

.CPU Stall While Synchronizing
[IMPORTANT]
Executing any fence instruction will stall the CPU until all the requested ordering/synchronization
steps are completed.


<<<
Expand Down
32 changes: 12 additions & 20 deletions docs/datasheet/soc_dcache.adoc
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
<<<
<<<
:sectnums:
==== Processor-Internal Data Cache (dCACHE)

[cols="<3,<3,<4"]
[grid="none"]
|=======================
| Hardware source files: | neorv32_cache.vhd | Generic cache module
| Software driver files: | none | _implicitly used_
| Software driver files: | none |
| Top entity ports: | none |
| Configuration generics: | `DCACHE_EN` | implement processor-internal data cache when `true`
| | `DCACHE_NUM_BLOCKS` | number of cache blocks (pages/lines)
| | `DCACHE_BLOCK_SIZE` | size of a cache block in bytes
| | `DCACHE_NUM_BLOCKS` | number of cache blocks (pages or lines); has to be a power of two
| | `DCACHE_BLOCK_SIZE` | size of a cache block in bytes; has to be a power of two
| CPU interrupts: | none |
|=======================

Expand All @@ -21,33 +22,24 @@ The processor features an optional data cache to improve performance when using
access latency. The cache is connected directly to the CPU's data access interface and provides
full-transparent accesses. The cache is direct-mapped and uses "write-allocate" and "write-back" strategies.

.Cached/Uncached Accesses
.Uncached Accesses
[NOTE]
The data cache provides direct accesses (= uncached) to memory in order to access memory-mapped IO (like the
processor-internal IO/peripheral modules). All accesses that target the address range from `0xF0000000` to `0xFFFFFFFF`
will not be cached at all (see section <<_address_space>>). Direct/uncached accesses have **lower** priority than
cache block operations to allow continuous burst transfer and also to maintain logical instruction forward
progress / data coherency. Furthermore, the atomic memory operations of the <<_zaamo_isa_extension>> will
always **bypass** the cache.

.Caching Internal Memories
[NOTE]
The data cache is intended to accelerate data access to **processor-external** memories.
The CPU cache(s) should not be implemented when using only processor-internal data and instruction memories.
will not be cached at all (see section <<_address_space>>). Furthermore, the atomic memory operations
of the <<_zaamo_isa_extension>> will always **bypass** the cache.

.Manual Cache Flush/Clear/Reload
.Manual Cache Flush/Clear/Reload and Memory Coherence
[NOTE]
By executing the `fence` instruction the data cache is flushed, cleared and reloaded.
See section <<_cache_coherency>> for more information.
See section <<_memory_coherence>> for more information.

.Retrieve Cache Configuration from Software
[TIP]
Software can retrieve the cache configuration/layout from the <<_sysinfo_cache_configuration>> register.

.Bus Access Fault Handling
[NOTE]
The cache always loads a complete cache block (aligned to the block size) every time a
cache miss is detected. Each cached word from this block provides a single status bit that indicates if the
according bus access was successful or caused a bus error. Hence, the whole cache block remains valid even
if certain addresses inside caused a bus error. If the CPU accesses any of the faulty cache words, a
data bus error exception is raised.
If the cache encounters a bus error when uploading a modified block to the next memory level or when
downloading a new block from the next memory level, the entire block is invalidated and a bus access
error exception is raised.
32 changes: 12 additions & 20 deletions docs/datasheet/soc_icache.adoc
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
<<<
<<<
:sectnums:
==== Processor-Internal Instruction Cache (iCACHE)

[cols="<3,<3,<4"]
[grid="none"]
|=======================
| Hardware source files: | neorv32_cache.vhd | Generic cache module
| Software driver files: | none | _implicitly used_
| Software driver files: | none |
| Top entity ports: | none |
| Configuration generics: | `ICACHE_EN` | implement processor-internal instruction cache when `true`
| | `ICACHE_NUM_BLOCKS` | number of cache blocks (pages/lines)
| | `ICACHE_BLOCK_SIZE` | size of a cache block in bytes
| | `ICACHE_NUM_BLOCKS` | number of cache blocks (pages or lines); has to be a power of two
| | `ICACHE_BLOCK_SIZE` | size of a cache block in bytes; has to be a power of two
| CPU interrupts: | none |
|=======================

Expand All @@ -21,33 +22,24 @@ The processor features an optional instruction cache to improve performance when
access latency. The cache is connected directly to the CPU's instruction fetch interface and provides
full-transparent accesses. The cache is direct-mapped and read-only.

.Cached/Uncached Accesses
.Uncached Accesses
[NOTE]
The data cache provides direct accesses (= uncached) to memory in order to access memory-mapped IO (like the
processor-internal IO/peripheral modules). All accesses that target the address range from `0xF0000000` to `0xFFFFFFFF`
will not be cached at all (see section <<_address_space>>). Direct/uncached accesses have **lower** priority than
cache block operations to allow continuous burst transfer and also to maintain logical instruction forward
progress / data coherency. Furthermore, the atomic memory operations of the <<_zaamo_isa_extension>> will
always **bypass** the cache.

.Caching Internal Memories
[NOTE]
The data cache is intended to accelerate data access to **processor-external** memories.
The CPU cache(s) should not be implemented when using only processor-internal data and instruction memories.
will not be cached at all (see section <<_address_space>>). Furthermore, the atomic memory operations
of the <<_zaamo_isa_extension>> will always **bypass** the cache.

.Manual Cache Clear/Reload
.Manual Cache Flush/Clear/Reload and Memory Coherence
[NOTE]
By executing the `fence.i` instruction the instruction cache is cleared and reloaded.
See section <<_cache_coherency>> for more information.
See section <<_memory_coherence>> for more information.

.Retrieve Cache Configuration from Software
[TIP]
Software can retrieve the cache configuration/layout from the <<_sysinfo_cache_configuration>> register.

.Bus Access Fault Handling
[NOTE]
The cache always loads a complete cache block (aligned to the block size) every time a
cache miss is detected. Each cached word from this block provides a single status bit that indicates if the
according bus access was successful or caused a bus error. Hence, the whole cache block remains valid even
if certain addresses inside caused a bus error. If the CPU accesses any of the faulty cache words, an
instruction bus error exception is raised.
If the cache encounters a bus error when uploading a modified block to the next memory level or when
downloading a new block from the next memory level, the entire block is invalidated and a bus access
error exception is raised.
Loading

0 comments on commit f56cade

Please sign in to comment.