✨ add memory coherency logic (#1176)

stnolting · Feb 4, 2025 · f56cade · f56cade
2 parents 15e447d + ecf7cb9
commit f56cade
Show file tree

Hide file tree

Showing 33 changed files with 653 additions and 1,047 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -29,6 +29,7 @@ mimpid = 0x01040312 -> Version 01.04.03.12 -> v1.4.3.12
 
 | Date | Version | Comment | Ticket |
 |:----:|:-------:|:--------|:------:|
+| 03.02.2025 | 1.11.0.8 | :sparkles: add explicit memory ordering/coherence support; :warning: remove WDT "halt-on-debug" and "halt-on-sleep" options; :bug: rework cache module fixing several (minor?) design flaws | [#1176](https://github.com/stnolting/neorv32/pull/1176) |
 | 03.02.2025 | 1.11.0.7 | :bug: add missing CFS clock gen enable signal | [#1177](https://github.com/stnolting/neorv32/pull/1177) |
 | 01.02.2025 | 1.11.0.6 | :warning: remove XIP module | [#1175](https://github.com/stnolting/neorv32/pull/1175) |
 | 01.02.2025 | 1.11.0.5 | minor rtl optimizations and cleanups; :warning: remove DMA "fence" feature | [#1174](https://github.com/stnolting/neorv32/pull/1174) |

diff --git a/docs/datasheet/cpu.adoc b/docs/datasheet/cpu.adoc
@@ -1,3 +1,4 @@
+<<<
 :sectnums:
 == NEORV32 Central Processing Unit (CPU)
 
@@ -66,7 +67,7 @@ direction as seen from the CPU.
 [options="header", grid="rows"]
 |=======================
 | Signal | Width/Type | Dir | Description
-4+^| **Global Signals**
+4+^| **Clock and reset**
 | `clk_i`        | 1           | in  | Global clock line, all registers triggering on rising edge.
 | `rstn_i`       | 1           | in  | Global reset, low-active.
 4+^| **Interrupts (<<_traps_exceptions_and_interrupts>>)**
@@ -75,20 +76,17 @@ direction as seen from the CPU.
 | `mti_i`        | 1           | in  | RISC-V machine timer interrupt.
 | `firq_i`       | 16          | in  | Custom fast interrupt request signals.
 | `dbi_i`        | 1           | in  | Request CPU to halt and enter debug mode (RISC-V <<_on_chip_debugger_ocd>>).
+4+^| **<<_inter_core_communication_icc>> links**
+| `icc_tx_o` | `icc_t`         | out | TX link
+| `icc_rx_i` | `icc_t`         | in  | RX link
 4+^| **Instruction <<_bus_interface>>**
 | `ibus_req_o`   | `bus_req_t` | out | Instruction fetch bus request.
 | `ibus_rsp_i`   | `bus_rsp_t` | in  | Instruction fetch bus response.
 4+^| **Data <<_bus_interface>>**
 | `dbus_req_o`   | `bus_req_t` | out | Data access (load/store) bus request.
 | `dbus_rsp_i`   | `bus_rsp_t` | in  | Data access (load/store) bus response.
-4+^| **<<_inter_core_communication_icc>> TX links**
-| `icc_tx_rdy_o` | 2           | out | Data available for cores `0..1`.
-| `icc_tx_ack_i` | 2           | in  | Read-enable from cores `0..1`.
-| `icc_tx_dat_o` | 2*32        | out | Data for cores `0..1`.
-4+^| **<<_inter_core_communication_icc>> RX links**
-| `icc_rx_rdy_i` | 2           | in  | Data available from cores `0..1`.
-| `icc_rx_ack_o` | 2           | out | Read-enable for cores `0..1`.
-| `icc_rx_dat_i` | 2*32        | in  | Data from cores `0..1`.
+4+^| **<<_memory_coherence>> status**
+| `mem_sync_i`   | 1           | in  | Requested coherence established when set (single-shot)
 |=======================
 
 .Bus Interface Protocol
@@ -424,12 +422,11 @@ always valid when set.
 | `rw`    |     1 | Access direction (`0` = read, `1` = write)
 | `src`   |     1 | Access source (`0` = instruction fetch, `1` = load/store)
 | `priv`  |     1 | Set if privileged (M-mode) access
+| `debug` |     1 | Set if debug mode access
 | `amo`   |     1 | Set if current access is an atomic memory operation (<<_atomic_memory_access>>)
 | `amoop` |     4 | Type of atomic memory operation (<<_atomic_memory_access>>)
 3+^| **Out-Of-Band Signals**
-| `fence` |     1 | Data/instruction fence request; single-shot
-| `sleep` |     1 | Set if ALL upstream devices are in <<_sleep_mode>>
-| `debug` |     1 | Set if the upstream device is in debug-mode
+| `fence` |     1 | Data (load/store; `fence`) or instruction (instruction-fetch; `fence.i`) fence request; single-shot; see <<_memory_coherence>>
 |=======================
 
 .Bus Interface - Response Bus (`bus_rsp_t`)
@@ -463,7 +460,7 @@ The figure below shows three exemplary bus accesses:
 . A write access to address `B_addr` writing `wdata` (fastest response; `ACK` arrives right in the next cycle).
 . A failing read access to address `C_addr` (slow response; `ERR` arrives after several cycles).
 
-.Three Exemplary Bus Transactions (showing only in-band signals)
+.Three Exemplary Bus Transactions (showing only in-band signals; privileged non-debug non-atomic accesses)
 image::bus_interface.png[700]
 
 .Adding Register Stages
@@ -501,8 +498,8 @@ operation:
 
 .Cache Coherency
 [IMPORTANT]
-Atomic operations **always bypass** the CPU caches using direct/uncached accesses. Care must be taken
-to maintain data <<_cache_coherency>>.
+Atomic operations **always bypass** the (CPU) caches using direct/uncached accesses. Care must be taken
+to maintain data synchronization. See section <<_memory_coherence>> for more information.
 
 
 <<<
@@ -632,7 +629,7 @@ The `I` ISA extensions is the base RISC-V integer ISA that is always enabled.
 | Jump/call     | `jal[r]`                                                                  | 6
 | Load/store    | `lb` `lh` `lw` `lbu` `lhu` `sb` `sh` `sw`                                 | 5
 | System        | `ecall` `ebreak`                                                          | 3
-| Data fence    | `fence`                                                                   | 5
+| Data fence    | `fence`                                                                   | depends on the memory system
 | System        | `wfi`                                                                     | 3
 | System        | `mret`                                                                    | 5
 | Illegal inst. | -                                                                         | 3
@@ -641,10 +638,10 @@ The `I` ISA extensions is the base RISC-V integer ISA that is always enabled.
 .`fence` Instruction
 [NOTE]
 Analogous to the `fence.i` instruction (<<_zifencei_isa_extension>>) the `fence` instruction triggers
-a data cache synchronization operation. See section <<_cache_coherency>> for more information.
-Furthermore, the `fence` instruction word's _predecessor_ and _successor_ bits (used for memory ordering)
-are not evaluated by the hardware at all.
-
+a load/store memory synchronization operation. The CPU will stall until the requested coherence is
+established (`mem_sync_i` goes high). See section <<_memory_coherence>> for more information.
+NEORV32 ignores the predecessor and successor fields and always executes a conservative fence on all
+operations.
 
 .`wfi` Instruction
 [NOTE]
@@ -716,16 +713,16 @@ The instruction word's `aq` and `lr` memory ordering bits are not evaluated by t
 ==== `Zifencei` ISA Extension
 
 The `Zifencei` CPU extension allows manual synchronization of the instruction stream. This extension is always enabled.
-
-Analogous to the `fence` instruction the `fence.i` instruction triggers an instruction cache synchronization operation.
-See section <<_cache_coherency>> for more information.
+This instruction is the only standard mechanism to ensure that stores visible to a hart will also be visible to its
+instruction fetches. The CPU will stall until the requested coherence is established (`mem_sync_i` goes high).
+See section <<_memory_coherence>> for more information.
 
 .Instructions and Timing
 [cols="<2,<4,<3"]
 [options="header", grid="rows"]
 |=======================
 | Class | Instructions | Execution cycles
-| Instruction fence | `fence.i` | 5
+| Instruction fence | `fence.i` | depends on the memory system
 |=======================
 
 

diff --git a/docs/datasheet/on_chip_debugger.adoc b/docs/datasheet/on_chip_debugger.adoc
@@ -667,7 +667,7 @@ Debug-mode is entered on any of the following events:
 . A hardware trigger from the <<_trigger_module>> fires (`exe` and `action` in <<_tdata1>> / `mcontrol` are set).
 
 [NOTE]
-From a hardware point of view these debug-mode-entry conditions are special traps (synchronous exceptions or
+From a hardware point of view these debug-mode-entry conditions are normal traps (synchronous exceptions or
 asynchronous interrupts) that are handled transparently by the control logic.
 
 **Whenever the CPU enters debug-mode it performs the following operations:**
@@ -684,6 +684,8 @@ asynchronous interrupts) that are handled transparently by the control logic.
 **When the CPU is in debug-mode:**
 
 * while in debug mode, the CPU executes the parking loop and - if requested by the DM - the program buffer
+* all **caches are bypassed** when in debug-mode; hence, a <<_memory_coherence>> has to be re-established when entering debug-mode
+and when leaving debug-mode
 * effective CPU privilege level is `machine` mode; any active physical memory protection (PMP) configuration is bypassed
 * the `wfi` instruction acts as a `nop` (also during single-stepping)
 * if an exception occurs while being in debug mode:

diff --git a/docs/datasheet/overview.adoc b/docs/datasheet/overview.adoc
@@ -1,3 +1,4 @@
+<<<
 :sectnums:
 == Overview
 

diff --git a/docs/datasheet/rationale.adoc b/docs/datasheet/rationale.adoc
@@ -1,3 +1,4 @@
+<<<
 :sectnums:
 === Rationale
 

diff --git a/docs/datasheet/soc.adoc b/docs/datasheet/soc.adoc
@@ -1,5 +1,4 @@
-
-// ####################################################################################################################
+<<<
 :sectnums:
 == NEORV32 Processor (SoC)
 
@@ -595,7 +594,7 @@ content of the addresses memory cell) is sent back to the requesting CPU.
 .Direct Access
 [IMPORTANT]
 Atomic operations **always bypass** the CPU's <<_processor_internal_data_cache_dcache, data cache>>
-using direct/uncached accesses. Care must be taken to maintain data <<_cache_coherency>>.
+using direct/uncached accesses. Care must be taken to maintain data <<_memory_coherence>>.
 
 .Physical Memory Attributes
 [NOTE]
@@ -610,43 +609,50 @@ cannot be interrupted. Hence, they execute in an atomic way.
 
 
 :sectnums:
-==== Cache Coherency
+==== Memory Coherence
 
-In total the NEORV32 Processor provides up to three optional caches organized in two levels. Level-1
-caches are closer to the CPU while level-2 caches are closer to main memory (however, this highly depends
-on the the actual cache configurations).
+Depending on the configuration, the NEORV32 processor provides several _layer_ of memory consisting
+of caches, buffers and storage.
 
+* The CPU instruction prefetch buffer ("level-0")
 * The <<_processor_internal_data_cache_dcache>> (level-1)
 * The <<_processor_internal_instruction_cache_icache>> (level-1)
 * The cache of the <<_processor_external_bus_interface_xbus>> (level-2)
+* Internal and external memories
 
-As all caches operate transparently for the software, special attention must therefore be paid to coherence.
-Note that coherence and cache _synchronization_ is **not** performed by the hardware itself (there is no
-snooping implemented).
+All caches and buffers operate transparently for the software. Hence, special attention must therefore be
+paid to maintain coherence. Note that coherence and cache _synchronization_ is **not** automatically performed
+by the hardware itself as there is no snooping implemented.
 
-The NEORV32 uses two instructions for manual cache synchronization (both instructions are always available
-regardless of the actual CPU/ISA configuration):
+NEORV32 uses two instructions for manual memory synchronization which are always available
+regardless of the actual CPU/ISA configuration:
 
 * `fence` (<<_i_isa_extension>> / <<_e_isa_extension>>)
 * `fence.i` (<<_zifencei_isa_extension>>)
 
-By executing the "data" `fence` instruction the CPU's data cache is synchronized in four steps:
+By executing the "data" `fence` instruction the CPU's load/store operations are ordered
+and synchronized across the entire system:
 
 [start=1]
-. The CPU data cache is flushed: all local modifications are copied to the next higher memory level;
-this can be the XBUS cache or main memory.
-. The CPU data cache is cleared invalidating all local entries.
-. The synchronization request is sent to the next-higher memory level (for example to the XBUS cache
-so it can perform the same synchronization steps).
-. The CPU data cache is reloaded with up-to-date data from the next higher memory level.
+. The CPU data cache (if enabled) is flushed and invalidated: all local modifications are copied to
+the next higher memory level (for example the internal DMEM or the XBUS-cache).
+. The CPU data cache is cleared invalidating so the next load/store access will cause a cache miss
+that will fetch up-to-date data from the memory system.
+. The synchronization request is forwarded to the next-higher memory level. If the XBUS cache is implemented
+it will also be flushed and invalidated.
 
-By executing the "instruction" `fence.i` instruction the CPU's instruction cache is synchronized in three steps:
+By executing the "instruction" `fence.i` instruction the CPU's instruction-fetch cache is are ordered
+and synchronized across the entire system:
 
 [start=1]
-. The synchronization request is sent to the next-higher memory level (for example to the XBUS cache
-so it can perform the same synchronization steps).
-. The CPU instruction cache is cleared invalidating all local entries.
-. The CPU instruction cache is reloaded with up-to-date data from the next higher memory level.
+. Perform all the steps that are performed by the `fence` instruction.
+. The CPU instruction cache is cleared invalidating all local entries so the next instruction fetch access
+will cause a cache miss that will fetch up-to-date data from the memory system.
+
+.CPU Stall While Synchronizing
+[IMPORTANT]
+Executing any fence instruction will stall the CPU until all the requested ordering/synchronization
+steps are completed.
 
 
 <<<

diff --git a/docs/datasheet/soc_dcache.adoc b/docs/datasheet/soc_dcache.adoc
@@ -1,16 +1,17 @@
 <<<
+<<<
 :sectnums:
 ==== Processor-Internal Data Cache (dCACHE)
 
 [cols="<3,<3,<4"]
 [grid="none"]
 |=======================
 | Hardware source files:  | neorv32_cache.vhd   | Generic cache module
-| Software driver files:  | none                | _implicitly used_
+| Software driver files:  | none                |
 | Top entity ports:       | none                |
 | Configuration generics: | `DCACHE_EN`         | implement processor-internal data cache when `true`
-|                         | `DCACHE_NUM_BLOCKS` | number of cache blocks (pages/lines)
-|                         | `DCACHE_BLOCK_SIZE` | size of a cache block in bytes
+|                         | `DCACHE_NUM_BLOCKS` | number of cache blocks (pages or lines); has to be a power of two
+|                         | `DCACHE_BLOCK_SIZE` | size of a cache block in bytes; has to be a power of two
 | CPU interrupts:         | none |
 |=======================
 
@@ -21,33 +22,24 @@ The processor features an optional data cache to improve performance when using
 access latency. The cache is connected directly to the CPU's data access interface and provides
 full-transparent accesses. The cache is direct-mapped and uses "write-allocate" and "write-back" strategies.
 
-.Cached/Uncached Accesses
+.Uncached Accesses
 [NOTE]
 The data cache provides direct accesses (= uncached) to memory in order to access memory-mapped IO (like the
 processor-internal IO/peripheral modules). All accesses that target the address range from `0xF0000000` to `0xFFFFFFFF`
-will not be cached at all (see section <<_address_space>>). Direct/uncached accesses have **lower** priority than
-cache block operations to allow continuous burst transfer and also to maintain logical instruction forward
-progress / data coherency. Furthermore, the atomic memory operations of the <<_zaamo_isa_extension>> will
-always **bypass** the cache.
-
-.Caching Internal Memories
-[NOTE]
-The data cache is intended to accelerate data access to **processor-external** memories.
-The CPU cache(s) should not be implemented when using only processor-internal data and instruction memories.
+will not be cached at all (see section <<_address_space>>). Furthermore, the atomic memory operations
+of the <<_zaamo_isa_extension>> will always **bypass** the cache.
 
-.Manual Cache Flush/Clear/Reload
+.Manual Cache Flush/Clear/Reload and Memory Coherence
 [NOTE]
 By executing the `fence` instruction the data cache is flushed, cleared and reloaded.
-See section <<_cache_coherency>> for more information.
+See section <<_memory_coherence>> for more information.
 
 .Retrieve Cache Configuration from Software
 [TIP]
 Software can retrieve the cache configuration/layout from the <<_sysinfo_cache_configuration>> register.
 
 .Bus Access Fault Handling
 [NOTE]
-The cache always loads a complete cache block (aligned to the block size) every time a
-cache miss is detected. Each cached word from this block provides a single status bit that indicates if the
-according bus access was successful or caused a bus error. Hence, the whole cache block remains valid even
-if certain addresses inside caused a bus error. If the CPU accesses any of the faulty cache words, a
-data bus error exception is raised.
+If the cache encounters a bus error when uploading a modified block to the next memory level or when
+downloading a new block from the next memory level, the entire block is invalidated and a bus access
+error exception is raised.
diff --git a/docs/datasheet/soc_icache.adoc b/docs/datasheet/soc_icache.adoc
@@ -1,16 +1,17 @@
 <<<
+<<<
 :sectnums:
 ==== Processor-Internal Instruction Cache (iCACHE)
 
 [cols="<3,<3,<4"]
 [grid="none"]
 |=======================
 | Hardware source files:  | neorv32_cache.vhd   | Generic cache module
-| Software driver files:  | none                | _implicitly used_
+| Software driver files:  | none                | 
 | Top entity ports:       | none                |
 | Configuration generics: | `ICACHE_EN`         | implement processor-internal instruction cache when `true`
-|                         | `ICACHE_NUM_BLOCKS` | number of cache blocks (pages/lines)
-|                         | `ICACHE_BLOCK_SIZE` | size of a cache block in bytes
+|                         | `ICACHE_NUM_BLOCKS` | number of cache blocks (pages or lines); has to be a power of two
+|                         | `ICACHE_BLOCK_SIZE` | size of a cache block in bytes; has to be a power of two
 | CPU interrupts:         | none |
 |=======================
 
@@ -21,33 +22,24 @@ The processor features an optional instruction cache to improve performance when
 access latency. The cache is connected directly to the CPU's instruction fetch interface and provides
 full-transparent accesses. The cache is direct-mapped and read-only.
 
-.Cached/Uncached Accesses
+.Uncached Accesses
 [NOTE]
 The data cache provides direct accesses (= uncached) to memory in order to access memory-mapped IO (like the
 processor-internal IO/peripheral modules). All accesses that target the address range from `0xF0000000` to `0xFFFFFFFF`
-will not be cached at all (see section <<_address_space>>). Direct/uncached accesses have **lower** priority than
-cache block operations to allow continuous burst transfer and also to maintain logical instruction forward
-progress / data coherency. Furthermore, the atomic memory operations of the <<_zaamo_isa_extension>> will
-always **bypass** the cache.
-
-.Caching Internal Memories
-[NOTE]
-The data cache is intended to accelerate data access to **processor-external** memories.
-The CPU cache(s) should not be implemented when using only processor-internal data and instruction memories.
+will not be cached at all (see section <<_address_space>>). Furthermore, the atomic memory operations
+of the <<_zaamo_isa_extension>> will always **bypass** the cache.
 
-.Manual Cache Clear/Reload
+.Manual Cache Flush/Clear/Reload and Memory Coherence
 [NOTE]
 By executing the `fence.i` instruction the instruction cache is cleared and reloaded.
-See section <<_cache_coherency>> for more information.
+See section <<_memory_coherence>> for more information.
 
 .Retrieve Cache Configuration from Software
 [TIP]
 Software can retrieve the cache configuration/layout from the <<_sysinfo_cache_configuration>> register.
 
 .Bus Access Fault Handling
 [NOTE]
-The cache always loads a complete cache block (aligned to the block size) every time a
-cache miss is detected. Each cached word from this block provides a single status bit that indicates if the
-according bus access was successful or caused a bus error. Hence, the whole cache block remains valid even
-if certain addresses inside caused a bus error. If the CPU accesses any of the faulty cache words, an
-instruction bus error exception is raised.
+If the cache encounters a bus error when uploading a modified block to the next memory level or when
+downloading a new block from the next memory level, the entire block is invalidated and a bus access
+error exception is raised.