Update docs

ucb-bar · Nov 18, 2024 · 2f9f011 · 2f9f011
1 parent d4135bc
commit 2f9f011
Show file tree

Hide file tree

Showing 10 changed files with 62 additions and 29 deletions.
diff --git a/docs/background.adoc b/docs/background.adoc
@@ -109,7 +109,7 @@ Furthermore, DSP applications often require more regularly behaved memory system
 Applications and microarchitectures which prefer statically predictable memory systems are especially well-suited for VLIW ISAs.
 
 However, VLIW-based ISAs are notoriously difficult to program compared to general-purpose ISAs or vector ISAs.
-Performant VLIW code can also suffer from issues such as large static code size due to the need for extensive static scheduling and software-pipelned loops.
+Performant VLIW code can also suffer from issues such as large static code size due to the need for extensive static scheduling and software-pipelined loops.
 Nonetheless, specialized VLIW ISAs provide a microarchitecturally simple and efficient programmer-visible mechanism for maintaining high instruction throughput into SIMD functional units.
 
 Cadence, Synopsys, CEVA, and Qualcomm all ship commercial VLIW DSPs with SIMD extensions.
@@ -185,6 +185,8 @@ The `LMUL` (length multiplier) register grouping field of `vtype` enables groupi
 In addition to enabling mixed-precision operations, this feature allows kernels that do not induce vector register pressure to access an effectively longer hardware vector length.
 Generally, performance programmers for RISC-V will use this feature to reduce the dynamic instruction count of their loops and potentially improve the utilization of hardware compute resources.
 For example, vector `memcpy` induces no register pressure and can trivially set a high `LMUL` to reduce dynamic instruction count.
+Since higher `LMUL` settings will unroll instructions in hardware, `LMUL` also reduces static code size byreducing the need for unrolling loops in in software.
+
 
 Thus, implementations should not penalize code which uses high `LMUL` to reduce instruction fetch pressure.
 The general intuition around vector code should be to use the highest `LMUL` setting while avoiding register spills.
@@ -200,10 +202,13 @@ Alternatively, the addition of queueing resources to reduce this pressure would
 Segmented memory instructions enable a "transpose" of an "array-of-structs" data representation in memory into a "struct-of-arrays" in consecutive vector registers.
 Such instructions, while very complex behaviorally, are fundamental to many algorithms and datatypes.
 For instance, complex numbers and image pixel data are conventionally stored in memory as "arrays-of-structs".
-//Segmented memory access instructions can also be used to perform on-the-fly reformatting into vector registers.
 
-These instructions can significantly reduce programmer burden, and thus performant RVV implementations should not impose an excess performance overhead from their execution.
-Vector code which uses these memory operations to reduce dynamic instruction count should perform no worse than the equivalent code which explicitly transforms the data over many vector instructions.
+
+The instructions are critical for repacking data in memory into element-wise format for vector instructions.
+Compared to other vector or SIMD ISAs, RVV provides few facilities for register-register repacking, instead relying on segmented memory instructions to perform ``on-the-fly'' repacking between memory and registers.
+
+Given the importance of these instructions, performant RVV implementations should not impose an excess performance overhead from their execution.
+Vector codes which use these memory operations should perform no worse than the equivalent code which explicitly transforms the data over many vector instructions.
 
 === Short-Vector Execution
 
@@ -291,7 +296,7 @@ image::diag/ooo-simd.png[OOO SIMD Pipeline,width=40%,align=center,title-align=ce
 Notably, as these machines are typically designed with single-chime instruction execution, high instruction throughput is necessary to maintain high utilization of multiple datapaths.
 Furthermore, register renaming is required to enable execution past the WAW and WAR hazards in this example loop.
 
-Unlike these cores, a Saturn-like short-vector design does not rely on costly features like high-throughput instruction fetch, out-of-order execution, or speculative execution, or register-renaming.
+Unlike these cores, a Saturn-like short-vector design does not rely on costly features like high-throughput instruction fetch, out-of-order execution, speculative execution, or register-renaming.
 Efficient scheduling of short-chime vector instructions with a limited capability for out-of-order execution is sufficient for maintaining datapath utilization on memory workloads, even with a minimal in-order scalar core.
 //Efficient and precise vector operation scheduling, rather than high instruction throughput, is key to maintaining SIMD datapath utilization.
 

diff --git a/docs/index.html b/docs/index.html
@@ -4,7 +4,7 @@
 <meta charset="UTF-8">
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
-<meta name="generator" content="Asciidoctor 2.0.23">
+<meta name="generator" content="Asciidoctor 2.0.17">
 <meta name="description" content="Micro-architecture of the Saturn Vector Unit">
 <meta name="author" content="Authors: Jerry Zhao, Daniel Grubb, Miles Rusch, Tianrui Wei, Kevin Anderson, Borivoje Nikolic, Krste Asanovic">
 <title>The Saturn Microarchitecture Manual</title>
@@ -451,7 +451,7 @@ <h1>The Saturn Microarchitecture Manual</h1>
 <div class="details">
 <span id="author" class="author">Authors: Jerry Zhao, Daniel Grubb, Miles Rusch, Tianrui Wei, Kevin Anderson, Borivoje Nikolic, Krste Asanovic</span><br>
 <span id="revnumber">version v1.0.0,</span>
-<span id="revdate">2024-11-12</span>
+<span id="revdate">2024-11-17</span>
 <br><span id="revremark">Release</span>
 </div>
 <div id="toc" class="toc2">
@@ -730,7 +730,7 @@ <h4 id="_vliw_isas_with_simd">1.2.4. VLIW ISAs with SIMD</h4>
 </div>
 <div class="paragraph">
 <p>However, VLIW-based ISAs are notoriously difficult to program compared to general-purpose ISAs or vector ISAs.
-Performant VLIW code can also suffer from issues such as large static code size due to the need for extensive static scheduling and software-pipelned loops.
+Performant VLIW code can also suffer from issues such as large static code size due to the need for extensive static scheduling and software-pipelined loops.
 Nonetheless, specialized VLIW ISAs provide a microarchitecturally simple and efficient programmer-visible mechanism for maintaining high instruction throughput into SIMD functional units.</p>
 </div>
 <div class="paragraph">
@@ -817,7 +817,8 @@ <h4 id="_vector_register_grouping">1.3.4. Vector Register Grouping</h4>
 <p>The <code>LMUL</code> (length multiplier) register grouping field of <code>vtype</code> enables grouping consecutive vector registers into a single longer vector register.
 In addition to enabling mixed-precision operations, this feature allows kernels that do not induce vector register pressure to access an effectively longer hardware vector length.
 Generally, performance programmers for RISC-V will use this feature to reduce the dynamic instruction count of their loops and potentially improve the utilization of hardware compute resources.
-For example, vector <code>memcpy</code> induces no register pressure and can trivially set a high <code>LMUL</code> to reduce dynamic instruction count.</p>
+For example, vector <code>memcpy</code> induces no register pressure and can trivially set a high <code>LMUL</code> to reduce dynamic instruction count.
+Since higher <code>LMUL</code> settings will unroll instructions in hardware, <code>LMUL</code> also reduces static code size byreducing the need for unrolling loops in in software.</p>
 </div>
 <div class="paragraph">
 <p>Thus, implementations should not penalize code which uses high <code>LMUL</code> to reduce instruction fetch pressure.
@@ -838,8 +839,12 @@ <h4 id="_segmented_memory_instructions">1.3.5. Segmented Memory Instructions</h4
 For instance, complex numbers and image pixel data are conventionally stored in memory as "arrays-of-structs".</p>
 </div>
 <div class="paragraph">
-<p>These instructions can significantly reduce programmer burden, and thus performant RVV implementations should not impose an excess performance overhead from their execution.
-Vector code which uses these memory operations to reduce dynamic instruction count should perform no worse than the equivalent code which explicitly transforms the data over many vector instructions.</p>
+<p>The instructions are critical for repacking data in memory into element-wise format for vector instructions.
+Compared to other vector or SIMD ISAs, RVV provides few facilities for register-register repacking, instead relying on segmented memory instructions to perform ``on-the-fly'' repacking between memory and registers.</p>
+</div>
+<div class="paragraph">
+<p>Given the importance of these instructions, performant RVV implementations should not impose an excess performance overhead from their execution.
+Vector codes which use these memory operations should perform no worse than the equivalent code which explicitly transforms the data over many vector instructions.</p>
 </div>
 </div>
 </div>
@@ -955,7 +960,7 @@ <h4 id="_compared_to_general_purpose_simd_cores">1.4.2. Compared to General-purp
 Furthermore, register renaming is required to enable execution past the WAW and WAR hazards in this example loop.</p>
 </div>
 <div class="paragraph">
-<p>Unlike these cores, a Saturn-like short-vector design does not rely on costly features like high-throughput instruction fetch, out-of-order execution, or speculative execution, or register-renaming.
+<p>Unlike these cores, a Saturn-like short-vector design does not rely on costly features like high-throughput instruction fetch, out-of-order execution, speculative execution, or register-renaming.
 Efficient scheduling of short-chime vector instructions with a limited capability for out-of-order execution is sufficient for maintaining datapath utilization on memory workloads, even with a minimal in-order scalar core.</p>
 </div>
 </div>
@@ -1015,7 +1020,7 @@ <h3 id="_organization">2.1. Organization</h3>
 The load/store paths within the VLSU execute independently and communicate with the VU through load-response and store-data ports.</p>
 </div>
 <div class="paragraph">
-<p>The <strong>Vector Datapath (VU)</strong> contains instruction issue queues (VIQs), vector sequencers (VXS/VLS/VSS), and the vector register file (VRF), and the SIMD arithmetic functional units.
+<p>The <strong>Vector Datapath (VU)</strong> contains instruction issue queues (VIQs), vector sequencers (VXS/VLS/VSS), the vector register file (VRF), and the SIMD arithmetic functional units.
 The functional units (VFUs) are arranged in execution unit clusters (VEUs), where each VEU is fed by one sequencer.
 The sequencers schedule register read/write and issue operations into the VEUs, while interlocking on structural and data hazards.
 The VU is organized as a unified structure with a SIMD datapath, instead of distributing the VRF and VEUs across vector lanes.
@@ -1048,6 +1053,13 @@ <h3 id="_key_ideas">2.2. Key Ideas</h3>
 This approach can tolerate high memory latencies with minimal hardware cost.</p>
 </div>
 <div class="paragraph">
+<p>Saturn is designed around two key parameters. <strong><code>VLEN</code> and <code>DLEN</code></strong>.
+<code>VLEN</code> is the vector length of each register file, as defined in the architecture specification.
+<code>DLEN</code> is a micro-architectural detail that describes the datapath width for each of the SIMD-stype datapaths in Saturn.
+Specifically, the load pipe, store pipe, and SIMD arithmetic pipes are all designed to fulfill <code>DLEN</code> bits per cycle, regardless of element width.
+Future versions of Saturn may allow a narrower memory interface width (<code>MLEN</code>) than <code>DLEN</code>.</p>
+</div>
+<div class="paragraph">
 <p>Saturn still supports a limited, but sufficient capability for <strong>out-of-order execution</strong>.
 The load, store, and execute paths in the VU execute independently, dynamically stalling for structural and data hazards without requiring full in-order execution.
 Allowing dynamic "slip" between these paths naturally implies out-of-order execution.
@@ -1366,7 +1378,7 @@ <h3 id="_memory_system">4.1. Memory System</h3>
 <div class="paragraph">
 <p>Saturn configurations with high <code>DLEN</code> would generally require higher memory bandwidth.
 However, scaling up the system-level interconnect to meet Saturn&#8217;s bandwidth demands may be prohibitively costly.
-Instead, the preferred approach for high-<code>DLEN</code> Saturn configs is to integrate a high-bandwidth local TCM (tightly-coupled-memory), which software should treat as a software-managed cache for vector accesses.
+Instead, the preferred approach for high-<code>DLEN</code> Saturn configs is to integrate a high-bandwidth local TCM (tightly-coupled memory), which software should treat as a software-managed cache for vector accesses.
 This TCM should be tile-local and globally addressable, but not necessarily cacheable.
 <a href="#mem-tcm">Figure 17</a> depicts a Saturn configuration with a high-bandwidth TCM, but a reduced-bandwidth system interconnect.</p>
 </div>
@@ -2002,9 +2014,6 @@ <h3 id="_optimizing_around_pipeline_latencies">6.4. Optimizing Around Pipeline L
 <p>To saturate the FMA units in this scenario, either a longer <code>LMUL</code> should be used, or independent FMAs must be scheduled back-to-back.
 Generally, performant code should use the highest <code>LMUL</code> possible that avoids vector register spilling.</p>
 </div>
-<div class="paragraph">
-<p>Refer to <a href="#execute">Chapter 5</a> for details on each of the vector functional units and their default pipeline depths.</p>
-</div>
 </div>
 <div class="sect2">
 <h3 id="_optimizing_segmented_memory_accesses">6.5. Optimizing Segmented Memory Accesses</h3>

diff --git a/docs/memory.adoc b/docs/memory.adoc
@@ -63,7 +63,7 @@ image::diag/memtcm.png[TCM memory system,width=55%,align=center,title-align=cent
 
 Saturn configurations with high `DLEN` would generally require higher memory bandwidth.
 However, scaling up the system-level interconnect to meet Saturn's bandwidth demands may be prohibitively costly.
-Instead, the preferred approach for high-`DLEN` Saturn configs is to integrate a high-bandwidth local TCM (tightly-coupled-memory), which software should treat as a software-managed cache for vector accesses.
+Instead, the preferred approach for high-`DLEN` Saturn configs is to integrate a high-bandwidth local TCM (tightly-coupled memory), which software should treat as a software-managed cache for vector accesses.
 This TCM should be tile-local and globally addressable, but not necessarily cacheable.
 <<mem-tcm>> depicts a Saturn configuration with a high-bandwidth TCM, but a reduced-bandwidth system interconnect.
 

diff --git a/docs/programming.adoc b/docs/programming.adoc
@@ -67,8 +67,6 @@ This situation is rare due to the support for chaining, but might still appear i
 To saturate the FMA units in this scenario, either a longer `LMUL` should be used, or independent FMAs must be scheduled back-to-back.
 Generally, performant code should use the highest `LMUL` possible that avoids vector register spilling.
 
-Refer to <<execute>> for details on each of the vector functional units and their default pipeline depths.
-
 
 === Optimizing Segmented Memory Accesses
 

diff --git a/docs/system.adoc b/docs/system.adoc
@@ -22,7 +22,7 @@ The *Vector Load-Store Unit (VLSU)* performs vector address generation and memor
 Inflight vector memory instructions are tracked in the vector load-instruction-queue (VLIQ) and store-instruction-queue (VSIQ).
 The load/store paths within the VLSU execute independently and communicate with the VU through load-response and store-data ports.
 
-The *Vector Datapath (VU)* contains instruction issue queues (VIQs), vector sequencers (VXS/VLS/VSS), and the vector register file (VRF), and the SIMD arithmetic functional units.
+The *Vector Datapath (VU)* contains instruction issue queues (VIQs), vector sequencers (VXS/VLS/VSS), the vector register file (VRF), and the SIMD arithmetic functional units.
 The functional units (VFUs) are arranged in execution unit clusters (VEUs), where each VEU is fed by one sequencer.
 The sequencers schedule register read/write and issue operations into the VEUs, while interlocking on structural and data hazards.
 The VU is organized as a unified structure with a SIMD datapath, instead of distributing the VRF and VEUs across vector lanes.
@@ -49,6 +49,12 @@ Shallow instruction queues in the VU act as "decoupling" queues, enabling the VL
 Similarly, the VLSU's store path can run many cycles behind the VU through the decoupling enabled by the VSIQ.
 This approach can tolerate high memory latencies with minimal hardware cost.
 
+Saturn is designed around two key parameters. *`VLEN` and `DLEN`*.
+`VLEN` is the vector length of each register file, as defined in the architecture specification.
+`DLEN` is a micro-architectural detail that describes the datapath width for each of the SIMD-stype datapaths in Saturn.
+Specifically, the load pipe, store pipe, and SIMD arithmetic pipes are all designed to fulfill `DLEN` bits per cycle, regardless of element width.
+Future versions of Saturn may allow a narrower memory interface width (`MLEN`) than `DLEN`.
+
 Saturn still supports a limited, but sufficient capability for *out-of-order execution*.
 The load, store, and execute paths in the VU execute independently, dynamically stalling for structural and data hazards without requiring full in-order execution.
 Allowing dynamic "slip" between these paths naturally implies out-of-order execution.

diff --git a/docs/tex/.gitignore b/docs/tex/.gitignore
@@ -0,0 +1,6 @@
+*.aux
+*.bbl
+*.log
+*.pdf
+*.blg
+*.toc