Skip to content

Latest commit

 

History

History
1025 lines (780 loc) · 48.6 KB

TCB.md

File metadata and controls

1025 lines (780 loc) · 48.6 KB

Tightly Coupled Bus

Tightly Coupled Bus is a simple general purpose system bus based on FPGA/ASIC SRAM memory interfaces.

Introduction

The idea and name comes from tightly coupled memories, which require a simple interface to avoid complexity impact on timing and area, and at the same time the full memory interface throughput must be achievable.

This are few alternative naming for tightly coupled memories:

A processor native system bus is usually custom designed to support exactly the features that are present in the processor itself. This also means there are differences between the protocols used by instruction fetch and load/store unit.

The TCB protocol is designed to fulfill the shared needs of simple CPU/SoC designs and can be used for:

  • CPU instruction fetch interface,
  • CPU/DMA load/store interface,
  • simple cache hierarchies,
  • SoC interconnect (crossbar),
  • SoC peripheral interface.

The design is based on the following principles:

  • Intended for closely coupled memories and caches, and therefore based on synchronous/static memory (SRAM) interfaces.
  • Minimize latency and maximize throughput.
  • Support pipelining for both writes and reads to minimize stalling overhead. Meaning the handshake is done during the arbitration phase (explained later).
  • Handshake based on the AMBA AXI family of protocols (VALID/READY).
  • Low power consumption should be considered by reducing signal toggling and other means.
  • Choice of common (half-duplex) or independent (full-duplex) read/write channels, allowing additional trade-off between throughput and complexity.

What it is not intended for:

  • It is not optimized for clock domain crossing (CDC), which has a large delay (latency) between the start of a request and the response, and the delay has some unpredictability.
  • Does not provide out of order access functionality.
  • It is not a good fit for managers with a variable pipeline delay in the load/store unit.

Terminology

TCB terminology and syntax is mostly based on:

  • AMBA AXI family of protocols,
  • Verilog/SystemVerilog HDL language.

Interconnect terms

module short description
manager man Managers are modules driving requests toward a subordinate and receiving a response from it. This term is equivalent to master.
subordinate sub Subordinates are module receiving requests from a manager and responding to it. This term is equivalent to slave.
monitor mon Monitors do not drive any protocol signals, they only observe them for error checking, statistics and logging.

Transfer level terms (protocol timing)

term description
clock period The term clock period is preferred over clock cycle to avoid confusion with access cycle which can be multiple clock periods long.
handshake Exchange of valid and ready signals during between manager and subordinate.
cycle A request cycle is one or more clock periods long exchange between a master and a subordinate governed by a valid/ready handshake, and it ends with a transfer. A response cycle starts DLY clock periods after the request cycle and has the same length.
transfer Each access cycle ends in a single clock period long transfer when valid and ready handshake signals are both active.
request The collective value of signals (address, write enable, byte enable, write data) driven by a manager, while valid is active during an access cycle. And sampled by a subordinate during a transfer.
response In the current protocol version, a response is a single clock period delayed by a fixed number of clock periods from each transfer, in it read data and error status are driven by a subordinate and sampled by a manager. Future versions of the protocol might have responses encompassing multiple clock periods.
backpressure A subordinate can delay the transfer by driving the ready signal low.
back-to-back Performing transfers continuously in each clock period, without idling the bus by waiting the for a response before issuing e new request.

Transaction level terms

A transaction is the atomic exchange of a desired data length requiring one or more transfers. The following words can be used to describe a transaction. TODO: check for a TLM definition.

descriptor description
write Used for CPU store operations.
read Used for CPU load operations.
aligned Address and transaction size or byte enable signals follow CPU ISA alignment rules.
misaligned A misaligned transaction can be a single misaligned transfer or be split into multiple aligned transfers.
split A transaction with a size exceeding the data bus width can be split into multiple transfers.
atomic In addition to CPU ISA atomic instructions, atomicity is desired in split transactions.
burst Bursts are intended for communication between cache levels and high latency memories.

While the bus width and transaction sizes are not limited to a finite set, the the following transaction sizes also have names.

size description
byte 8-bit wide data.
half 16-bit wide data.
word 32-bit wide data.
double 64-bit wide data.
long 128-bit wide data.

Peripheral driver terms

Parameters are used in HDL code. Terms quasi-static, dynamic and volatile are used to describe properties of configuration, control and status registers of a peripheral.

term Description
parameter Static (compile time) configuration of a HDL/RTL module, parameter in Verilog or generic in VHDL.
quasi-static Can be driven at runtime during initialization, but is static (not changing) during system operation.
dynamic Can be driven at runtime during system operation, is expected to change.
volatile Can change at runtime during system operation.
configuration Peripheral register/field containing configuration information, they are usually quasi-static, never volatile.
control Peripheral register/field used to control system operation at runtime, they are dynamic signals.
status Peripheral register/field used to monitor system operation at runtime, they are volatile signals.

Acronyms

acronym definition
TCB Tightly Coupled Bus
CRW Common Read/Write channel
IRW Independent Read/Write channels
RDC ReaD Channel
WRC WRite Channel
BFM Bus Functional Model
TLM Transaction-level modeling
LSB Least Significant Bit/Byte
MSB Most Significant Bit/Byte

Naming conventions

Mostly for aesthetic reasons (vertical alignment) all signal and names are three-letter abbreviations (TLA).

Suffixes specifying the direction of module ports as input/output (in/out, i/o) can be avoided. Instead signals can be organized into sets with a prefix or are grouped into a SystemVerilog interface. Set names shall use specifiers like manager/subordinate (man/sub) or request/response (req/rsp).

Base protocol

The TCB protocol base is comprised of a valid/ready handshake for the request and a parameterized fixed delay (integer number of clock periods) for the response. Special considerations should be made for signal values during reset and reset release and assertion.

Manager, subordinate and monitor

System signals clock and reset

System signals are propagated globally from a system controller to managers and subordinates. Implementations with separate clock/reset/power domains can have multiple independent system signal sets.

signal description
clk Clock (active on rising edge).
rst Reset (active high) can be synchronous or asynchronous depending on implementation.

TODO: define power domain functionality.

Handshake amd request/response signal sets

The manager initiates a request with the handshake signal vld (valid). Backpressure from the subordinate is supported by the handshake signal rdy (ready).

If no backpressure conditions are possible, the rdy signal can be omitted, and the manager shall interpret it as always being active (rdy==1'b1).

NOTE: The handshake signals intentionally use names from the AMBA AXI family of protocols, since the handshake is governed by compatible (equivalent) rules. Otherwise the TCB protocol bears no relation to AMBA.

signal direction description
vld man -> sub Handshake valid.
rdy sub -> man Handshake ready (can be omitted if there is no backpressure).

Signals going from manager to subordinate are part of the request set, signals going in the opposite direction are part of the response set. This signal sets are used to provide transaction type details, addressing and data.

signal direction description
req man -> sub Request set.
rsp sub -> man Response set.

While the handshake defines the request transfer, the response is always provided DLY clock periods after the handshake transfer.

parameter type description
DLY int unsigned Response delay.

Handshake rules

Handshake signals shall follow the same basic principles as defined for the AMBA AXI family of protocols:

  • vld shell be inactive during reset.
  • While valid is not active all other signals shall be ignored (X in timing/waveform diagrams).
  • Once the manager asserts vld, it must not remove it till the cycle is completed by an active rdy signal.
  • The manager must not wait for rdy to be asserted before starting a new cycle by asserting vld.
  • The subordinate can assert/remove the rdy signal without restrictions.
  • There is no inherent timeout mechanism.
  • TODO: clarify rdy behavior if only part of the system is under reset.

This means once a request cycle is initiated, it must be completed with a transfer. Since rdy can be asserted during reset (rdy can be a constant value), vld must not be asserted, since this would indicate transfers while in reset state. Since the subordinate is allowed to wait for vld before asserting rdy (no restrictions), the manager shall not wait for rdy before asserting vld, since this could result in a lockup or a combinational loop.

There is no integrated timeout abort mechanism, although it would be possible to place such functionality into a module placed between a manager and a subordinate. The required additional complexity is not discussed in this document.

Transfer and request/response sequence

The manager shall drive a valid request signal set req while the vld handshake signal is active. The subordinate shell sample the request signal set req at the rising clock edge while both vld and rdy handshake signals are active indicating a transfer trn (local signal).

When the delay parameter is zero (DLY=0), the subordinate shall provide the response rsp combinationally in the same clock period as the transfer trn is active. When the delay parameter is greater then zero (DLY>0), the subordinate shall provide the response rsp sequentially in DLY clock periods after the transfer trn is active.

Handshake transfer and request/response

Reset release and assertion sequences

A global system reset rst can be asserted at any moment, as long as it applies to the entire interconnect and all managers/subordinates connected to it.

TODO: A correct reset assertion sequence for just part of the system separated into multiple clock/reset/power domains is explained separately in the reference interconnect library documentation.

The handshake valid vld must be inactive during reset. After the reset signal rst is released there must be at least one clock period before vld can be asserted. The handshake ready signal can be active or inactive during reset, but it is not allowed to toggle. After the reset signal rst is released there must be at least one clock period before rdy can toggle.

This timing is based on the assumption that reset is not used as a normal combinational signal. In this case the vld signal depends on a register toggling after reset is released, and this can only happen with the described timing. The same explanation stands for rdy if it is not a constant value.

Reset sequence

Reset sequence length

Ideally all devices would require the reset to be active for only a single clock period. Long (multiple clock periods) reset sequences are sometimes required so that reset values can propagate through flipflops without reset. If a device requires a longer active reset, this must be documented. A global reset shall be applied for the longest sequence required by eny devices in the same domain. Requiring long active reset sequences just in case should be avoided, the exact required reset sequence length shall be derived from the RTL.

Sequential logic without reset

It is allowed to use reset capable flipflops only for control signals (handshake signals in TCB), while address, data and other signals use flipflops without a reset for example to reduce ASIC area. While this approach does not affect functionality, it affects reproducibility of power consumption tests. It might also have some effect on the viability of side channel attacks.

Memory mapped access protocol

For the protocol to support memories and memory mapped peripherals, the request and response signal sets must be further defined to contain the read/write control signal, the address, byte enable, read/write data busses, and various optional extensions.

Parameters

All TCB interfaces are parameterized. In addition to the base protocol parameter DLY there are parameters for:

  • defining the address/data/... signal widths,
  • defining how data bytes are packed into the data bus.

Signal width parameters

parameter default type description
PHY.SLW 8 int unsigned Selection width (in most cases it should be 8, the size of a byte).
PHY.ABW 32 int unsigned Address bus width.
PHY.DBW 32 int unsigned Data bus width.
PHY_BEW DBW/SLW int unsigned Byte enable width is the number of selection width units fitting into the data width.

The selection width parameter SLW defines the number of bits in a byte, for all standard use cases this defaults to 8. TODO: research use cases where SLW is not the default.

There are few restrictions on the address bus width ABW. Sometimes the size of the RISC-V load/store immediate (12-bit) is relevant. Similarly ARM defines a 12-bit memory management page size.

Since TCB was designed with 32-bit CPU/SoC/peripherals in mind, 32-bit is the default data bus width DBW and 4-bit is the default byte enable width BEW. Byte enable width BEW is a calculated local parameter, it should not passed across module hierarchy.

Data packing parameters

Data packing parameters are listed and described in the data packing section of the document.

Custom extension signal type parameters

Data types for custom extension signals are listed here without details. Further in the document there are definitions for some standard configurations.

parameter description
tcb_req_cmd_t Custom request command signal cmd type.
tcb_rsp_sts_t Custom response status signal sts type.

Signals

Most signals are designed to directly interface with ASIC/FPGA SRAM memories:

  • address adr,
  • write enable wen and byte enable ben,
  • write data wdt and read data rdt.
signal width description
req.cmd custom Custom request command protocol extensions.
req.ndn 1 Read/write data endianness. Only used in reference mode.
req.wen 1 Write enable.
req.ren 1 Read enable (only used in full-duplex channel configuration).
req.adr ABW Address.
req.siz SIZ* Transfer size. Only used in reference mode.
req.ben BEW Byte enable/select. Only used in memory mode.
req.wdt DBW Write data.
rsp.rdt DBW Read data.
rsp.sts custom Custom response status protocol extensions.

The custom protocol extension signals, request command cmd and response status sts, do not directly affect the content of the data transfer. They are described in the next section.

Since bi-endianness support is an important part of the TCB protocol, the endianness selection signal ndn is listed prominently. It is a dynamic property of each data transfer and therefore a signal and not a parameter.

For an interface modelled over a memory interface, read enable ren is not accessible by the used, internally it is assigned the negated value of write enable wen.

The transfer size signal siz is an alternative signal to byte enable ben. The choice between the two alternatives is described in the data packing section.

(*) How exactly the width of the siz signal depends on the SIZ parameter is described in the data packing section.

Custom protocol extension signals

TODO: Custom protocol extension signals are still in the draft stage.

The request command signals cmd are used to:

  • extend the protocol into multi transfer transactions and
  • to provide performance (latency, power, ...) optimizations.
signal width description
req.cmd.lck 1 Arbitration lock.
req.cmd.rpt 1 Repeat address access.
req.cmd.inc 1 Incrementing address access.

The arbitration lock lck is used to implement atomic accesses by combining multiple transfers into a single transaction:

  • split transaction misaligned access,
  • transactions larger than data bus/transfer size,
  • uninterruptible burst transactions,
  • ...

NOTE: The lock signal lck has a similar functionality to AXI-Stream LAST signal, but with an inverted active state (lck = ~LAST). While with AXI-Stream the common case are long packets ending with a LAST pulse, for a system bus single transfer transactions are more common than large transactions. The lck signal polarity is selected to be inactive by default.

The repeat address access rpt is used to reduce power consumption on repeated read accesses to the same address. The incrementing address access inc is used to tell prefetch mechanisms whether the address is the expected one.

The response status error signal err is used for handling error conditions:

  • access to inactive subsystem with clock/power gating support,
  • address decoder errors while accessing undefined regions,
  • unsupported transfer size/alignment.
signal width usage description
rsp.sts.err 1 optional Error response (can be omitted if there are no error conditions).

Various implementations can add custom (user defined) signals to either the request or response, some examples of custom signals would be:

  • cache related signals,
  • burst support,
  • quality of service signals,
  • multiple types of error responses,
  • ...

Optional signal subsets and defaults

Custom implementations can use a subset of the full signal list. Some rules are provided for handling the missing signals.

ROM would be an example of a device which only requires the read data bus. When constructing subsets, please consider other protocols (AXI-Stream, ...) which might be more appropriate.

To connect a managers and a subordinates with a differing set of optional signals, an adapter is needed which would provide:

  • a default for outputs and
  • a handler for inputs. The output default shall be chosen to match the protocol subset. (wen=1'b0 and wdt=DBW'bx for ROM). The input handler can either ignore the signal or cause an error condition. Default output values can always be ignored by an input handler, or simply no handler is needed.

The following table defines some defaults and handlers.

usecase signal default handler
interconnect req.cmd 'b0 Subordinates can ignore it.
ROM req.wen 1'b0 Respond with error on write access to subordinate without write support.
ROM req.wdt DBW'bx Can be ignored, wen requires handling.
peripheral req.ben BEW'b1 Access with less than the full width shall trigger an error.
interconnect rsp.sts 'b0 Can be ignored, if no error conditions are possible, otherwise requires and external handler (watchdog, ...).

The custom request command also has sensible defaults.

signal default handler
req.cmd.lck 1'b0 If another manager can access the same segment, respond with error, otherwise ignore.
req.cmd.rpt 1'b0 Subordinates can ignore it.
req.cmd.inc 1'b0 Subordinates can ignore it.

Data packing

A combination of parameters and runtime signals defines how bytes (smallest data units) are organized inside the read/write data bus, and across transfers for multi transfer transactions.

To a degree data packing rules are a generalization of endianness rules.

This section will first document the parameters and then provide examples of packing with some parameter configurations.

Data packing parameters

The following parameters affect data packing.

parameter default type (enumeration) description
PHY.ALW clog2(DBW/SLW) int unsigned Alignment width, number of least significant address bits which are zero.
PHY.SIZ LOGARITHMIC tcb_par_size_t Transfer size encoding, logarithmic or linear.
PHY.MOD REFERENCE tcb_par_mode_t Data position mode.
PHY.ORD DESCENDING tcb_par_order_t Byte order, ascending or descending.

Only a subset of 4 configurations from all parameter combinations results in practical and useful data packing rule. The rest are reserved with no intention to be documented and implemented.

MOD ORD ALW ndn description
REFERENCE DESCENDING any ignored Packing used by CPU registers.
REFERENCE ASCENDING any ignored Reserved, not used.
MEMORY DESCENDING 0 both RISC-V memory model with misaligned access support.
MEMORY DESCENDING clog2(BEW) both RISC-V memory model with only aligned accesses supported.
MEMORY ASCENDING 0 both Reserved, not used.
MEMORY ASCENDING clog2(BEW) both OpenPOWER storage operands.

The reference mode is a new concept added to TCB.

For contemporary useful configurations the RISC-V ISA is used as a reference.

The OpenPOWER specific configuration is included for historic compatibility, and completeness.

Alignment width

Alignment width ALW defines what kind of data alignments are supported. The values can be between 0 (no alignment requirements) and clog2(BEW) (full alignment is required). Only this two values are documented, other values in between can be used for custom implementations.

Transfer size encoding

The SIZ parameter encoding defines the following options.

  • LOGARITHMIC,
  • LINEAR. The parameter defines the encoding of the transfer size signal siz, and it defines restrictions on the valid encodings of the byte enable signal ben.

The default and backward compatible option is LOGARITHMIC. With this option, the transfer size is limited to a power of two of bytes.

The translation between the transfer size signal siz and the number of bytes being transferred is num=2**siz.

siz num. size description
'd0 1 byte 8-bit wide data.
'd1 2 half 16-bit wide data.
'd2 4 word 32-bit wide data.
'd3 8 double 64-bit wide data.
'd4 16 long 128-bit wide data.

The width of the transfer size signal siz in the logarithmic case is clog2(clog2(DBW/SLW))==clog2(clog2(BEW)).

NOTE: The linear option is an experimental proposal and is not yet compatible with any other standard or implementation.

A more generic option LINEAR allows for any number of bytes up to the data bus width to be transferred. In this case the number of transferred bytes is num=siz+1.

The width of the transfer size signal siz in the linear case is clog2(DBW/SLW)==clog2(BEW).

The byte enable signal ben is restricted in both the logarithmic and linear cases to only number long sequences of adjacent active bits (representing data bytes), with all other bits inactive. The same restriction for the number applies as for the transfer size signal siz.

How active bytes are aligned inside the entire data bus is further defined by other parameters ALW/MOD/ORD, the address adr and the endianness signal ndn.

Data position mode

The MOD parameter encoding defines the following options.

  • MEMORY,
  • REFERENCE. The name reference is based on the idea, that if a monitor was placed on multiple points of a mixed configuration interconnect, all data would be translated to a common reference before being compared.

The MEMORY mode defines the same data packing scheme as memories.

In memory mode the the byte enable signal ben provides the information about the transfer size, which is the number of active bits in the ben vector.

The REFERENCE mode is based on how ISAs define the placement of byte/half/word/double into its general purpose registers. In registers data of any size is always stored aligned to the right. In reference mode data is always aligned to the right, regardless of the address, address alignment, endianness, ...

In reference mode the transfer size signal siz provides the information about the transfer size.

The main purpose of this mode is to connect peripherals to the CPU or DMA. without the need for byte reordering logic between the two.

Another use case would be a RISV-V instruction fetch interface with C extension support, where the instruction is always aligned the same way, regardless on whether the instruction is 32-bit or 16-bit aligned in the memory. In this case a multiplexer for aligning the instruction would still be needed, but it would be placed in the interconnect instead of the CPU.

Byte order

The ORD parameter encoding defines the following options.

  • DESCENDING,
  • ASCENDING.

Almost all modern standards and HDL/schematic implementations use the DESCENDING order. Here indexing starts with 0 on the right side and increments to the left side of the vector. When writing bit vectors and equivalent packed byte arrays in SystemVerilog:

logic     [31:0] data_bit_vector;
logic [3:0][7:0] data_byte_array;

Byte addressing follows the same rules so it increments from the right to the left.

The ASCENDING order was prominently used in the OpenPOWER specification and its big endian predecessors. Here indexing starts with 0 on the left side and increments to the right side of the vector. When writing bit vectors and equivalent packed byte arrays in SystemVerilog:

logic     [0:31] data_bit_vector;
logic [0:3][0:7] data_byte_array;

Byte addressing follows the same rules so it increments from the left to the right.

Due to the current prevalence of descending indexing order and little-endian ISAs, it can be difficult and confusing to understand big endian (bi-endian) compatibility. A few reasons that aggravate the confusion:

  • while OpenPOWER defines all 64-bit registers with ascending order [0:63], a load/store byte operation would place the byte in the register aligned to the right [56:63],
  • on OpenPOWER the least significant bit of the program counter or address pointer is [63],
  • not all native big-endian ISAs use the ascending order,
  • early bi-endian approaches differ from moderns ones.

Modern OpenPOWER implementations use ascending order in the core to match the specification, but use descending order on the system bus, which is usually AMBA AXI based. The only practical use case for ascending order would probably be while interfacing with historic hardware.

Channel configuration

The CHN parameter is used to configure channel read/write capabilities.

parameter default type (enumeration) description
PHY.CHN COMMON_HALF_DUPLEX tcb_par_channel_t Channel configuration.

The following configurations are defined,

value wen ren wdt rdt description
COMMON_HALF_DUPLEX wen ~wen used used Each transfer can only enable either read or write data.
COMMON_FULL_DUPLEX wen ren used used The address is common read/write data can be controlled independently.
INDEPENDENT_WRITE 1'b1 1'b0 used unused Write data is always enabled, read data is unused.
INDEPENDENT_READ 1'b0 1'b1 unused used Write data is unused, read data is always enabled.

Common half duplex

This is the common approach based on a SRAM memory interface. The address is shared between read/write operations and each transfer can only be either a read or a write, controlled by the write enable wen signal.

wen description
1'b0 Write request.
1'b1 Read request.

In this channel configuration the read enable signal ren is not used, internally can be assigned the negated value of write enable ren.

assign ren = ~wen;

Common full duplex

Use cases:

  • data swap,
  • control request returning status before request,
  • control request returning instant same clock period feedback.

The data swap operation can be used between a CPU GPR and a memory mapped register.

The control request use cases are similar to what RISC-V ISA Zicsr instructions do.

The case where read and write enable signals are both inactive during a transfer is reserved.

Independent read/write

The read and write operations are separated into independent channels. The main purpose is to provide full-duplex access to independent addresses. One advantage this approach provides is reduced power consumption in peripherals, since during a write access the read data decoder and multiplexer are not active and during a read access the write data enable decoder is not active.

Examples

Data packing examples

All provided examples are configured for:

  • logarithmic size SIZ=LOGARITHMIC,
  • descending order ORD=DESCENDING. Examples are given for the next data packing configurations:
  • reference mode, fixed of variable size transfers with and without misaligned access support,
  • memory mode, with and without misaligned access support, for both little and big endianness.

The examples list all supported read/write transfers in a table. Unsupported transfers can be handles by ignoring the request and responding with an error. Alternatively unsupported transfers can just cause undefined behavior.

Reference mode

Examples for the following reference mode configurations are provided:

  • data bus width sized transfers with size aligned address,
  • any size transfers with size aligned address,
  • any size transfers with no alignment restrictions address,
  • instruction fetch for RISC-V with C extension.
Fixed data width

It is common to only allow full data bus width and aligned transfers when accessing peripherals. This case would specify the following parameter values and signal restrictions:

  • reference mode MOD=REFERENCE,
  • full alignment required ALW=$clog2(DBW/SLW)=clog2(BEW)
  • transfer size equal to data bus width siz==$clog2(ALW),
  • aligned address to data bus width adr[ALW-1:0]=='0,
  • the transfer endianness ndn is ignored.

The following table lists such transfers for a 32-bit data bus.

size adr[1:0] siz[1:0] wdt[31:00]/rdt[31:00]
word 2'd0 2'd2 {[31:24], [23:16], [15:08], [07:00]}
Variable data width, aligned

If transfer size restrictions are relaxed down to a single byte, small registers can be arranged into a more compact structure, thus reducing the address space. This case would specify the following parameter values and signal restrictions:

  • reference mode MOD=REFERENCE,
  • full alignment required ALW=$clog2(DBW/SLW)=clog2(BEW)
  • transfer size from byte to data bus width 0<=siz<=$clog2(ALW),
  • address aligned to transfer size adr[siz-1:0]=='0,
  • the transfer endianness ndn is ignored.

The following table lists such transfers for a 32-bit data bus.

size adr[1:0] siz[1:0] wdt[31:00]/rdt[31:00]
byte 2'd0 2'd0 { 8'bXX, 8'bXX, 8'bXX, [07:00]}
byte 2'd1 2'd0 { 8'bXX, 8'bXX, 8'bXX, [07:00]}
byte 2'd2 2'd0 { 8'bXX, 8'bXX, 8'bXX, [07:00]}
byte 2'd3 2'd0 { 8'bXX, 8'bXX, 8'bXX, [07:00]}
half 2'd0 2'd1 { 8'bXX, 8'bXX, [15:08], [07:00]}
half 2'd2 2'd1 { 8'bXX, 8'bXX, [15:08], [07:00]}
word 2'd0 2'd2 {[31:24], [23:16], [15:08], [07:00]}

Such a configuration is also appropriate for a load/store CPU interface, since it covers all aligned memory accesses. An actual connection to a memory would require a conversion module from REFERENCE to MEMORY mode, such a conversion module would have to also handle the endianness signal ndn.

A further generalization would entirely remove the alignment restriction to enable access to memories which support unaligned accesses.

Variable data width, misalignment support

This case would specify the following parameter values and signal restrictions:

  • reference mode MOD=REFERENCE,
  • relaxed alignment ALW=0
  • transfer size from byte to data bus width 0<=siz<=$clog2(ALW),
  • address aligned to transfer size adr[siz-1:0]=='0,
  • the transfer endianness ndn is ignored.

The following table lists such transfers for a 32-bit data bus.

size alignment adr[1:0] siz[1:0] wdt[31:00]/rdt[31:00]
byte aligned 2'd0 2'd0 { 8'bXX, 8'bXX, 8'bXX, [07:00]}
byte aligned 2'd1 2'd0 { 8'bXX, 8'bXX, 8'bXX, [07:00]}
byte aligned 2'd2 2'd0 { 8'bXX, 8'bXX, 8'bXX, [07:00]}
byte aligned 2'd3 2'd0 { 8'bXX, 8'bXX, 8'bXX, [07:00]}
half aligned 2'd0 2'd1 { 8'bXX, 8'bXX, [15:08], [07:00]}
half misaligned 2'd1 2'd1 { 8'bXX, 8'bXX, [15:08], [07:00]}
half aligned 2'd2 2'd1 { 8'bXX, 8'bXX, [15:08], [07:00]}
half misaligned 2'd3 2'd1 { 8'bXX, 8'bXX, [15:08], [07:00]}
word aligned 2'd0 2'd2 {[31:24], [23:16], [15:08], [07:00]}
word misaligned 2'd1 2'd2 {[31:24], [23:16], [15:08], [07:00]}
word misaligned 2'd2 2'd2 {[31:24], [23:16], [15:08], [07:00]}
word misaligned 2'd3 2'd2 {[31:24], [23:16], [15:08], [07:00]}
RISC-V with C extension instruction fetch

This case would specify the following parameter values and signal restrictions:

  • reference mode MOD=REFERENCE,
  • relaxed alignment ALW=1
  • always attempt to fetch a 32-bit instruction siz=2'd2,
  • address aligned to transfer size adr[0]==1'b0,
  • only little endian support ndn=1'b0.

The following table lists such transfers.

size alignment adr[1:0] siz[1:0] wdt[31:00]/rdt[31:00]
word aligned 2'd0 2'd2 {[31:24], [23:16], [15:08], [07:00]}
word misaligned 2'd2 2'd2 {[31:24], [23:16], [15:08], [07:00]}

Memory mode

Examples for the following memory mode configurations are provided:

  • any size transfers with size aligned address,
  • any size transfers with no alignment restrictions address. Both configurations are documented for big and little endianness.

The configuration with data bus width sized transfers with size aligned address, is functionally identical to the reference mode with the same configuration.

Endianness and data alignment

The following table defines when an access is aligned depending on data transfer size and byte address LSB bits.

transfer size condition
byte (8-bit) none
half (16-bit) $clog2(adr[0:0]) == 0
word (32-bit) $clog2(adr[1:0]) == 0
dble (64-bit) $clog2(adr[2:0]) == 0
quad (128-bit) $clog2(adr[2:0]) == 0

The protocol endianness can be either:

  • endianness agnostic, only supporting aligned transfers,
  • little endian,
  • big endian,
  • a special case is defined for RISC-V instruction fetch of compressed instructions.

Endianness agnostic (aligned)

The TCB protocol can be endianness agnostic, as long as the address is aligned to the data width.

In this mode, address LSB bits adr[$clog2(BEW)-1:0] are zero while driven by a manager and ignored while sampled by a subordinate. For consistency they should still be part of the address vector.

The manager encodes the address of data transfers smaller than the full data bus width (DBW) using only byte enable (BEN). The mapping of aligned accesses for little/big endian managers is shown in the following chapters.

Little endian (any alignment)

size alignment adr[1:0] ben[3:0] wdt[31:00]/rdt[31:00]
byte aligned 2'd0 4'b0001 { 8'bXX, 8'bXX, 8'bXX, [07:00]}
byte aligned 2'd1 4'b0010 { 8'bXX, 8'bXX, [07:00], 8'bXX}
byte aligned 2'd2 4'b0100 { 8'bXX, [07:00], 8'bXX, 8'bXX}
byte aligned 2'd3 4'b1000 {[07:00], 8'bXX, 8'bXX, 8'bXX}
half aligned 2'd0 4'b0011 { 8'bXX, 8'bXX, [15:08], [07:00]}
half misaligned 2'd1 4'b0110 { 8'bXX, [15:08], [07:00], 8'bXX}
half aligned 2'd2 4'b1100 {[15:08], [07:00], 8'bXX, 8'bXX}
half misaligned 2'd3 4'b1001 {[07:00], 8'bXX, 8'bXX, [15:08]}
word aligned 2'd0 4'b1111 {[31:24], [23:16], [15:08], [07:00]}
word misaligned 2'd1 4'b1111 {[23:16], [15:08], [07:00], [31:24]}
word misaligned 2'd2 4'b1111 {[15:08], [07:00], [31:24], [23:16]}
word misaligned 2'd3 4'b1111 {[07:00], [31:24], [23:16], [15:08]}

Big endian (any alignment)

size alignment adr[1:0] ben[0:3] wdt[00:31]/rdt[00:31]
byte aligned 2'd0 4'b1000 {[00:07], 8'bXX, 8'bXX, 8'bXX}
byte aligned 2'd1 4'b0100 { 8'bXX, [00:07], 8'bXX, 8'bXX}
byte aligned 2'd2 4'b0010 { 8'bXX, 8'bXX, [00:07], 8'bXX}
byte aligned 2'd3 4'b0001 { 8'bXX, 8'bXX, 8'bXX, [00:07]}
half aligned 2'd0 4'b1100 {[00:07], [08:15], 8'bXX, 8'bXX}
half misaligned 2'd1 4'b0110 { 8'bXX, [00:07], [08:15], 8'bXX}
half aligned 2'd2 4'b0011 { 8'bXX, 8'bXX, [00:07], [08:15]}
half misaligned 2'd3 4'b1001 {[08:15], 8'bXX, 8'bXX, [00:07]}
word aligned 2'd0 4'b1111 {[00:07], [08:15], [16:23], [24:31]}
word misaligned 2'd1 4'b1111 {[24:31], [00:07], [08:15], [16:23]}
word misaligned 2'd2 4'b1111 {[16:23], [24:31], [00:07], [08:15]}
word misaligned 2'd3 4'b1111 {[08:15], [16:23], [24:31], [00:07]}

Misalignment handler

Two different implementations

  1. Performs 2 accesses and stitches them together, optionally caches one or more unused parts of previous accesses.
  2. Splits the bus into narrower busses, and increments the address.

Access cycles

Read/write transfer cycles are shown with common response delays (parameter DLY) of 0, 1 and 2 clock periods.

  • DLY=0 is the case with a combinational response to a request. This can be used in case multiple simple subordinates are combined into an interconnect segment. Such a segment can then be combined with a TCB register slice tcb_register_slice to break long timing paths at either the request path, response path or both to improve timing. Such collections can be used to achieve better area timing compromises, compared to using subordinates with integrated registers.
  • DLY=1 is the most common delay for subordinates with SRAM as an example, this is also the HDL default.
  • DLY=2 is the case where a single subordinate or a segment of the interconnect with DLY=1 would have an extra register added to the request path (address decoder) or response path (read data multiplexer) to improve timing.

Write transfer

A write transfer is performed when both handshake signals vld and rdy are simultaneously active and the write enable signal wen is also active.

Only bytes with an active corresponding byte enable bit in ben are written. The other bytes can be optimized to unchanged value, zeros or just undefined, depending what brings the preferred optimization for area timing, power consumption, ... The same optimization principle can be applied to all signals when valid is not active.

There are no special pipelining considerations for write transfers, all signals shall be propagated through a pipeline, similar to a single direction data stream

The base protocol does not have a mechanism for confirming write transfers reached their destination and were successfully applied.

Write transfer

Read transfer

A read transfer is performed when both handshake signals vld and rdy are simultaneously active and the write enable signal wen is not active.

The handshake is done during the arbitration phase, it is primarily about whether the address adr from the manager can reach the subordinate.

Read data is available on rdt after a fixed delay of 1 clock cycle from the transfer.

NOTE: in contrast to most interconnect standards, TCB specifies the use of byte enable signals ben to enable or disable read from each byte.

Read transfer

Repeat access transfer

TODO: think this through.

The basic idea behind the repeat access transfer is to avoid repeated reads from the same SRAM address. During a pipeline stall the CPU instruction fetch interface must remember the instruction by keeping it in a fetch register. A fetch register affects area and timing (admittedly not very much).

The fetch register can be avoided by repeating the instruction read from the SRAM. This redundant read can be avoided by taking advantage of SRAM functionality, where the last data read remains available on the read data port till the next read or a power cycle.

The repeat access signal rpt is intended to tell the SRAM to not perform another read from the same address. The interconnect would propagate the rpt as active only in case

Arbitration locking mechanism

Arbitration locking is used in the TCB reference implementation library to:

  • Keep atomicity in data bus width conversion from a wider manager to a narrower subordinate. For example an atomic 64-bit read/write access over a 32-bit interconnect.
  • Keep atomicity while converting a misaligned access into multiple aligned accesses.

It can also be used for read modify write, and similar operations and for QoS control.

Signal timing

While timing is not strictly part of the protocol, following recommendations across the entire design allows for optimizing the compromise between high clock speed and low latency.

It is important to note the recommended timing is somehow opposite to what is usually recommended for RTL modules. The common recommendation is to place registers at module (hierarchical boundary) outputs and optionally at module inputs.

For TCB it is recommended to place registers on the request signal path and keep the response path combinational.

The recommendation is intended to match the timing of SRAM memories common in FPGA and ASIC designs. SRAM memories usually have registers on all input signals (TCB request), giving inputs a low setup time. The read data output path is a mixed signal (analog+digital) combinational logic with a high clock to output delay.

This are a few SRAM examples:

As an example when a TCB peripheral is placed in the same address space as a SRAM block. Placing a register at the peripheral request inputs matches the low setup time of SRAM. On the peripheral response output combinational logic can add as much clock to output delay as specified for SRAM, without affecting overall interconnect timing.

Request/response signal timing

Limitations and undefined features

There are some generalizations and additional features that can be implemented, but were not researched well enough to be fully defined.

Data output hold

SRAM usually holds the data output from the last read request, till a new request is processed. In a similar fashion, the entire bus could hold the last read value, this means read data multiplexers in decoder modules have to hold. The held data can be lost if a subordinate is accessed by another manager.

Read data hold can be useful during CPU stalls. Either there is no need to repeat a read or a temporary buffer for read data can be avoided.

Out of order transfers

Out of order reads are not supported.

Generalized read delay

The delay of 0 would be an asynchronous read, a delay of 1 is equal to a common SRAM read cycle, longer delays can be caused by registers in the system bus interconnect.

Integration with standard system busses

It is possible to translate between the processor native system bus and standard system busses like APB, AHB, AXI4-Lite, Wishbone, ...

Such translation could compromise the performance, so it might make sense to implement a standard bus interface unit (BIU) separately inside the processor core, instead of attaching translators to the optimized native bus.

Write confirmation

Write confirmation is returned with the same timing as read data.

In case the native system bus is only used for the intend purpose of connecting tightly coupled memories, writes can be assumed to always succeed.

Write through cache access was not yet researched.

Atomic access

TODO, on some implementations it might be possible to simultaneously perform both read and write.