From 2f9f0111edf955a9263308a95b49912b71f742c4 Mon Sep 17 00:00:00 2001 From: Jerry Zhao Date: Mon, 18 Nov 2024 07:26:03 -0800 Subject: [PATCH] Update docs --- docs/background.adoc | 15 ++++++++++----- docs/index.html | 33 +++++++++++++++++++++------------ docs/memory.adoc | 2 +- docs/programming.adoc | 2 -- docs/system.adoc | 8 +++++++- docs/tex/.gitignore | 6 ++++++ docs/tex/background.tex | 12 ++++++++---- docs/tex/memory.tex | 2 +- docs/tex/programming.tex | 2 -- docs/tex/system.tex | 9 ++++++++- 10 files changed, 62 insertions(+), 29 deletions(-) create mode 100644 docs/tex/.gitignore diff --git a/docs/background.adoc b/docs/background.adoc index d4a8467..de82df3 100644 --- a/docs/background.adoc +++ b/docs/background.adoc @@ -109,7 +109,7 @@ Furthermore, DSP applications often require more regularly behaved memory system Applications and microarchitectures which prefer statically predictable memory systems are especially well-suited for VLIW ISAs. However, VLIW-based ISAs are notoriously difficult to program compared to general-purpose ISAs or vector ISAs. -Performant VLIW code can also suffer from issues such as large static code size due to the need for extensive static scheduling and software-pipelned loops. +Performant VLIW code can also suffer from issues such as large static code size due to the need for extensive static scheduling and software-pipelined loops. Nonetheless, specialized VLIW ISAs provide a microarchitecturally simple and efficient programmer-visible mechanism for maintaining high instruction throughput into SIMD functional units. Cadence, Synopsys, CEVA, and Qualcomm all ship commercial VLIW DSPs with SIMD extensions. @@ -185,6 +185,8 @@ The `LMUL` (length multiplier) register grouping field of `vtype` enables groupi In addition to enabling mixed-precision operations, this feature allows kernels that do not induce vector register pressure to access an effectively longer hardware vector length. Generally, performance programmers for RISC-V will use this feature to reduce the dynamic instruction count of their loops and potentially improve the utilization of hardware compute resources. For example, vector `memcpy` induces no register pressure and can trivially set a high `LMUL` to reduce dynamic instruction count. +Since higher `LMUL` settings will unroll instructions in hardware, `LMUL` also reduces static code size byreducing the need for unrolling loops in in software. + Thus, implementations should not penalize code which uses high `LMUL` to reduce instruction fetch pressure. The general intuition around vector code should be to use the highest `LMUL` setting while avoiding register spills. @@ -200,10 +202,13 @@ Alternatively, the addition of queueing resources to reduce this pressure would Segmented memory instructions enable a "transpose" of an "array-of-structs" data representation in memory into a "struct-of-arrays" in consecutive vector registers. Such instructions, while very complex behaviorally, are fundamental to many algorithms and datatypes. For instance, complex numbers and image pixel data are conventionally stored in memory as "arrays-of-structs". -//Segmented memory access instructions can also be used to perform on-the-fly reformatting into vector registers. -These instructions can significantly reduce programmer burden, and thus performant RVV implementations should not impose an excess performance overhead from their execution. -Vector code which uses these memory operations to reduce dynamic instruction count should perform no worse than the equivalent code which explicitly transforms the data over many vector instructions. + +The instructions are critical for repacking data in memory into element-wise format for vector instructions. +Compared to other vector or SIMD ISAs, RVV provides few facilities for register-register repacking, instead relying on segmented memory instructions to perform ``on-the-fly'' repacking between memory and registers. + +Given the importance of these instructions, performant RVV implementations should not impose an excess performance overhead from their execution. +Vector codes which use these memory operations should perform no worse than the equivalent code which explicitly transforms the data over many vector instructions. === Short-Vector Execution @@ -291,7 +296,7 @@ image::diag/ooo-simd.png[OOO SIMD Pipeline,width=40%,align=center,title-align=ce Notably, as these machines are typically designed with single-chime instruction execution, high instruction throughput is necessary to maintain high utilization of multiple datapaths. Furthermore, register renaming is required to enable execution past the WAW and WAR hazards in this example loop. -Unlike these cores, a Saturn-like short-vector design does not rely on costly features like high-throughput instruction fetch, out-of-order execution, or speculative execution, or register-renaming. +Unlike these cores, a Saturn-like short-vector design does not rely on costly features like high-throughput instruction fetch, out-of-order execution, speculative execution, or register-renaming. Efficient scheduling of short-chime vector instructions with a limited capability for out-of-order execution is sufficient for maintaining datapath utilization on memory workloads, even with a minimal in-order scalar core. //Efficient and precise vector operation scheduling, rather than high instruction throughput, is key to maintaining SIMD datapath utilization. diff --git a/docs/index.html b/docs/index.html index b548264..6a07da1 100644 --- a/docs/index.html +++ b/docs/index.html @@ -4,7 +4,7 @@ - + The Saturn Microarchitecture Manual @@ -451,7 +451,7 @@

The Saturn Microarchitecture Manual

Authors: Jerry Zhao, Daniel Grubb, Miles Rusch, Tianrui Wei, Kevin Anderson, Borivoje Nikolic, Krste Asanovic
version v1.0.0, -2024-11-12 +2024-11-17
Release
@@ -730,7 +730,7 @@

1.2.4. VLIW ISAs with SIMD

However, VLIW-based ISAs are notoriously difficult to program compared to general-purpose ISAs or vector ISAs. -Performant VLIW code can also suffer from issues such as large static code size due to the need for extensive static scheduling and software-pipelned loops. +Performant VLIW code can also suffer from issues such as large static code size due to the need for extensive static scheduling and software-pipelined loops. Nonetheless, specialized VLIW ISAs provide a microarchitecturally simple and efficient programmer-visible mechanism for maintaining high instruction throughput into SIMD functional units.

@@ -817,7 +817,8 @@

1.3.4. Vector Register Grouping

The LMUL (length multiplier) register grouping field of vtype enables grouping consecutive vector registers into a single longer vector register. In addition to enabling mixed-precision operations, this feature allows kernels that do not induce vector register pressure to access an effectively longer hardware vector length. Generally, performance programmers for RISC-V will use this feature to reduce the dynamic instruction count of their loops and potentially improve the utilization of hardware compute resources. -For example, vector memcpy induces no register pressure and can trivially set a high LMUL to reduce dynamic instruction count.

+For example, vector memcpy induces no register pressure and can trivially set a high LMUL to reduce dynamic instruction count. +Since higher LMUL settings will unroll instructions in hardware, LMUL also reduces static code size byreducing the need for unrolling loops in in software.

Thus, implementations should not penalize code which uses high LMUL to reduce instruction fetch pressure. @@ -838,8 +839,12 @@

1.3.5. Segmented Memory Instructions

-

These instructions can significantly reduce programmer burden, and thus performant RVV implementations should not impose an excess performance overhead from their execution. -Vector code which uses these memory operations to reduce dynamic instruction count should perform no worse than the equivalent code which explicitly transforms the data over many vector instructions.

+

The instructions are critical for repacking data in memory into element-wise format for vector instructions. +Compared to other vector or SIMD ISAs, RVV provides few facilities for register-register repacking, instead relying on segmented memory instructions to perform ``on-the-fly'' repacking between memory and registers.

+
+
+

Given the importance of these instructions, performant RVV implementations should not impose an excess performance overhead from their execution. +Vector codes which use these memory operations should perform no worse than the equivalent code which explicitly transforms the data over many vector instructions.

@@ -955,7 +960,7 @@

1.4.2. Compared to General-purp Furthermore, register renaming is required to enable execution past the WAW and WAR hazards in this example loop.

-

Unlike these cores, a Saturn-like short-vector design does not rely on costly features like high-throughput instruction fetch, out-of-order execution, or speculative execution, or register-renaming. +

Unlike these cores, a Saturn-like short-vector design does not rely on costly features like high-throughput instruction fetch, out-of-order execution, speculative execution, or register-renaming. Efficient scheduling of short-chime vector instructions with a limited capability for out-of-order execution is sufficient for maintaining datapath utilization on memory workloads, even with a minimal in-order scalar core.

@@ -1015,7 +1020,7 @@

2.1. Organization

The load/store paths within the VLSU execute independently and communicate with the VU through load-response and store-data ports.

-

The Vector Datapath (VU) contains instruction issue queues (VIQs), vector sequencers (VXS/VLS/VSS), and the vector register file (VRF), and the SIMD arithmetic functional units. +

The Vector Datapath (VU) contains instruction issue queues (VIQs), vector sequencers (VXS/VLS/VSS), the vector register file (VRF), and the SIMD arithmetic functional units. The functional units (VFUs) are arranged in execution unit clusters (VEUs), where each VEU is fed by one sequencer. The sequencers schedule register read/write and issue operations into the VEUs, while interlocking on structural and data hazards. The VU is organized as a unified structure with a SIMD datapath, instead of distributing the VRF and VEUs across vector lanes. @@ -1048,6 +1053,13 @@

2.2. Key Ideas

This approach can tolerate high memory latencies with minimal hardware cost.

+

Saturn is designed around two key parameters. VLEN and DLEN. +VLEN is the vector length of each register file, as defined in the architecture specification. +DLEN is a micro-architectural detail that describes the datapath width for each of the SIMD-stype datapaths in Saturn. +Specifically, the load pipe, store pipe, and SIMD arithmetic pipes are all designed to fulfill DLEN bits per cycle, regardless of element width. +Future versions of Saturn may allow a narrower memory interface width (MLEN) than DLEN.

+
+

Saturn still supports a limited, but sufficient capability for out-of-order execution. The load, store, and execute paths in the VU execute independently, dynamically stalling for structural and data hazards without requiring full in-order execution. Allowing dynamic "slip" between these paths naturally implies out-of-order execution. @@ -1366,7 +1378,7 @@

4.1. Memory System

Saturn configurations with high DLEN would generally require higher memory bandwidth. However, scaling up the system-level interconnect to meet Saturn’s bandwidth demands may be prohibitively costly. -Instead, the preferred approach for high-DLEN Saturn configs is to integrate a high-bandwidth local TCM (tightly-coupled-memory), which software should treat as a software-managed cache for vector accesses. +Instead, the preferred approach for high-DLEN Saturn configs is to integrate a high-bandwidth local TCM (tightly-coupled memory), which software should treat as a software-managed cache for vector accesses. This TCM should be tile-local and globally addressable, but not necessarily cacheable. Figure 17 depicts a Saturn configuration with a high-bandwidth TCM, but a reduced-bandwidth system interconnect.

@@ -2002,9 +2014,6 @@

6.4. Optimizing Around Pipeline L

To saturate the FMA units in this scenario, either a longer LMUL should be used, or independent FMAs must be scheduled back-to-back. Generally, performant code should use the highest LMUL possible that avoids vector register spilling.

-
-

Refer to Chapter 5 for details on each of the vector functional units and their default pipeline depths.

-

6.5. Optimizing Segmented Memory Accesses

diff --git a/docs/memory.adoc b/docs/memory.adoc index 4babd9c..d776637 100644 --- a/docs/memory.adoc +++ b/docs/memory.adoc @@ -63,7 +63,7 @@ image::diag/memtcm.png[TCM memory system,width=55%,align=center,title-align=cent Saturn configurations with high `DLEN` would generally require higher memory bandwidth. However, scaling up the system-level interconnect to meet Saturn's bandwidth demands may be prohibitively costly. -Instead, the preferred approach for high-`DLEN` Saturn configs is to integrate a high-bandwidth local TCM (tightly-coupled-memory), which software should treat as a software-managed cache for vector accesses. +Instead, the preferred approach for high-`DLEN` Saturn configs is to integrate a high-bandwidth local TCM (tightly-coupled memory), which software should treat as a software-managed cache for vector accesses. This TCM should be tile-local and globally addressable, but not necessarily cacheable. <> depicts a Saturn configuration with a high-bandwidth TCM, but a reduced-bandwidth system interconnect. diff --git a/docs/programming.adoc b/docs/programming.adoc index 4886ba5..9a3595a 100644 --- a/docs/programming.adoc +++ b/docs/programming.adoc @@ -67,8 +67,6 @@ This situation is rare due to the support for chaining, but might still appear i To saturate the FMA units in this scenario, either a longer `LMUL` should be used, or independent FMAs must be scheduled back-to-back. Generally, performant code should use the highest `LMUL` possible that avoids vector register spilling. -Refer to <> for details on each of the vector functional units and their default pipeline depths. - === Optimizing Segmented Memory Accesses diff --git a/docs/system.adoc b/docs/system.adoc index ae9e6f1..2cda7ac 100644 --- a/docs/system.adoc +++ b/docs/system.adoc @@ -22,7 +22,7 @@ The *Vector Load-Store Unit (VLSU)* performs vector address generation and memor Inflight vector memory instructions are tracked in the vector load-instruction-queue (VLIQ) and store-instruction-queue (VSIQ). The load/store paths within the VLSU execute independently and communicate with the VU through load-response and store-data ports. -The *Vector Datapath (VU)* contains instruction issue queues (VIQs), vector sequencers (VXS/VLS/VSS), and the vector register file (VRF), and the SIMD arithmetic functional units. +The *Vector Datapath (VU)* contains instruction issue queues (VIQs), vector sequencers (VXS/VLS/VSS), the vector register file (VRF), and the SIMD arithmetic functional units. The functional units (VFUs) are arranged in execution unit clusters (VEUs), where each VEU is fed by one sequencer. The sequencers schedule register read/write and issue operations into the VEUs, while interlocking on structural and data hazards. The VU is organized as a unified structure with a SIMD datapath, instead of distributing the VRF and VEUs across vector lanes. @@ -49,6 +49,12 @@ Shallow instruction queues in the VU act as "decoupling" queues, enabling the VL Similarly, the VLSU's store path can run many cycles behind the VU through the decoupling enabled by the VSIQ. This approach can tolerate high memory latencies with minimal hardware cost. +Saturn is designed around two key parameters. *`VLEN` and `DLEN`*. +`VLEN` is the vector length of each register file, as defined in the architecture specification. +`DLEN` is a micro-architectural detail that describes the datapath width for each of the SIMD-stype datapaths in Saturn. +Specifically, the load pipe, store pipe, and SIMD arithmetic pipes are all designed to fulfill `DLEN` bits per cycle, regardless of element width. +Future versions of Saturn may allow a narrower memory interface width (`MLEN`) than `DLEN`. + Saturn still supports a limited, but sufficient capability for *out-of-order execution*. The load, store, and execute paths in the VU execute independently, dynamically stalling for structural and data hazards without requiring full in-order execution. Allowing dynamic "slip" between these paths naturally implies out-of-order execution. diff --git a/docs/tex/.gitignore b/docs/tex/.gitignore new file mode 100644 index 0000000..f1ae721 --- /dev/null +++ b/docs/tex/.gitignore @@ -0,0 +1,6 @@ +*.aux +*.bbl +*.log +*.pdf +*.blg +*.toc \ No newline at end of file diff --git a/docs/tex/background.tex b/docs/tex/background.tex index 68eae93..94b466e 100644 --- a/docs/tex/background.tex +++ b/docs/tex/background.tex @@ -95,7 +95,7 @@ \subsubsection{VLIW ISAs with SIMD} Applications and microarchitectures which prefer statically predictable memory systems are especially well-suited for VLIW ISAs. However, VLIW-based ISAs are notoriously difficult to program compared to general-purpose ISAs or vector ISAs. -Performant VLIW code can also suffer from issues such as large static code size due to the need for extensive static scheduling and software-pipelned loops. +Performant VLIW code can also suffer from issues such as large static code size due to the need for extensive static scheduling and software-pipelined loops. Nonetheless, specialized VLIW ISAs provide a microarchitecturally simple and efficient programmer-visible mechanism for maintaining high instruction throughput into SIMD functional units. Cadence, Synopsys, CEVA, and Qualcomm all ship commercial VLIW DSPs with SIMD extensions. @@ -166,6 +166,7 @@ \subsubsection{Vector Register Grouping} In addition to enabling mixed-precision operations, this feature allows kernels that do not induce vector register pressure to access an effectively longer hardware vector length. Generally, performance programmers for RISC-V will use this feature to reduce the dynamic instruction count of their loops and potentially improve the utilization of hardware compute resources. For example, vector \texttt{memcpy} induces no register pressure and can trivially set a high \texttt{LMUL} to reduce dynamic instruction count. +Since higher \texttt{LMUL} settings will unroll instructions in hardware, \texttt{LMUL} also reduces static code size byreducing the need for unrolling loops in in software. Thus, implementations should not penalize code which uses high \texttt{LMUL} to reduce instruction fetch pressure. The general intuition around vector code should be to use the highest \texttt{LMUL} setting while avoiding register spills. @@ -182,8 +183,11 @@ \subsubsection{Segmented Memory Instructions} Such instructions, while very complex behaviorally, are fundamental to many algorithms and datatypes. For instance, complex numbers and image pixel data are conventionally stored in memory as "arrays-of-structs". -These instructions can significantly reduce programmer burden, and thus performant RVV implementations should not impose an excess performance overhead from their execution. -Vector code which uses these memory operations to reduce dynamic instruction count should perform no worse than the equivalent code which explicitly transforms the data over many vector instructions. +The instructions are critical for repacking data in memory into element-wise format for vector instructions. +Compared to other vector or SIMD ISAs, RVV provides few facilities for register-register repacking, instead relying on segmented memory instructions to perform ``on-the-fly'' repacking between memory and registers. + +Given the importance of these instructions, performant RVV implementations should not impose an excess performance overhead from their execution. +Vector codes which use these memory operations should perform no worse than the equivalent code which explicitly transforms the data over many vector instructions. \newpage \subsection{Short-Vector Execution} @@ -287,7 +291,7 @@ \subsubsection{Compared to General-purpose SIMD Cores} Notably, as these machines are typically designed with single-chime instruction execution, high instruction throughput is necessary to maintain high utilization of multiple datapaths. Furthermore, register renaming is required to enable execution past the WAW and WAR hazards in this example loop. -Unlike these cores, a Saturn-like short-vector design does not rely on costly features like high-throughput instruction fetch, out-of-order execution, or speculative execution, or register-renaming. +Unlike these cores, a Saturn-like short-vector design does not rely on costly features like high-throughput instruction fetch, out-of-order execution, speculative execution, or register-renaming. Efficient scheduling of short-chime vector instructions with a limited capability for out-of-order execution is sufficient for maintaining datapath utilization on memory workloads, even with a minimal in-order scalar core. \subsubsection{Compared to VLIW + SIMD DSP Cores} diff --git a/docs/tex/memory.tex b/docs/tex/memory.tex index b456693..87d4500 100644 --- a/docs/tex/memory.tex +++ b/docs/tex/memory.tex @@ -72,7 +72,7 @@ \subsection{Memory System} Saturn configurations with high \texttt{DLEN} would generally require higher memory bandwidth. However, scaling up the system-level interconnect to meet Saturn's bandwidth demands may be prohibitively costly. -Instead, the preferred approach for high-\texttt{DLEN} Saturn configs is to integrate a high-bandwidth local TCM (tightly-coupled-memory), which software should treat as a software-managed cache for vector accesses. +Instead, the preferred approach for high-\texttt{DLEN} Saturn configs is to integrate a high-bandwidth local TCM (tightly-coupled memory), which software should treat as a software-managed cache for vector accesses. This TCM should be tile-local and globally addressable, but not necessarily cacheable. Figure \ref{fig:mem-tcm} depicts a Saturn configuration with a high-bandwidth TCM, but a reduced-bandwidth system interconnect. diff --git a/docs/tex/programming.tex b/docs/tex/programming.tex index d597c2c..7771f38 100644 --- a/docs/tex/programming.tex +++ b/docs/tex/programming.tex @@ -68,8 +68,6 @@ \subsection{Optimizing Around Pipeline Latencies} To saturate the FMA units in this scenario, either a longer \texttt{LMUL} should be used, or independent FMAs must be scheduled back-to-back. Generally, performant code should use the highest \texttt{LMUL} possible that avoids vector register spilling. -Refer to <> for details on each of the vector functional units and their default pipeline depths. - \subsection{Optimizing Segmented Memory Accesses} diff --git a/docs/tex/system.tex b/docs/tex/system.tex index c7f4fce..5d8ef84 100644 --- a/docs/tex/system.tex +++ b/docs/tex/system.tex @@ -27,7 +27,7 @@ \subsection{Organization} Inflight vector memory instructions are tracked in the vector load-instruction-queue (VLIQ) and store-instruction-queue (VSIQ). The load/store paths within the VLSU execute independently and communicate with the VU through load-response and store-data ports. -The \textbf{Vector Datapath (VU)} contains instruction issue queues (VIQs), vector sequencers (VXS/VLS/VSS), and the vector register file (VRF), and the SIMD arithmetic functional units. +The \textbf{Vector Datapath (VU)} contains instruction issue queues (VIQs), vector sequencers (VXS/VLS/VSS), the vector register file (VRF), and the SIMD arithmetic functional units. The functional units (VFUs) are arranged in execution unit clusters (VEUs), where each VEU is fed by one sequencer. The sequencers schedule register read/write and issue operations into the VEUs, while interlocking on structural and data hazards. The VU is organized as a unified structure with a SIMD datapath, instead of distributing the VRF and VEUs across vector lanes. @@ -61,6 +61,13 @@ \subsection{Key Principles} To track data hazards, all vector instructions in the VU and VLSU are tagged with a "vector age tag (VAT)". The VATs are eagerly allocated and freed, and referenced in the machine wherever the relative age of two instructions is ambiguous. +Saturn is designed around two key parameters. \textbf{\texttt{VLEN} and \texttt{DLEN}}. +\texttt{VLEN} is the vector length of each register file, as defined in the architecture specification. +\texttt{DLEN} is a micro-architectural detail that describes the datapath width for each of the SIMD-stype datapaths in Saturn. +Specifically, the load pipe, store pipe, and SIMD arithmetic pipes are all designed to fulfill \texttt{DLEN} bits per cycle, regardless of element width. +Future versions of Saturn may allow a narrower memory interface width (\texttt{MLEN}) than \texttt{DLEN}. + + To decode vector instructions, the Saturn generator implements a \textbf{decode-database}-driven methodology for vector decode. The Saturn generator tabularly describes a concise list of all vector control signals for all vector instructions. Within the generator of the VU, control signals are extracted from the pipeline stages using a generator-time query into the instruction listings.