Skip to content

Commit

Permalink
Update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jerryz123 committed Nov 18, 2024
1 parent d4135bc commit 2f9f011
Show file tree
Hide file tree
Showing 10 changed files with 62 additions and 29 deletions.
15 changes: 10 additions & 5 deletions docs/background.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ Furthermore, DSP applications often require more regularly behaved memory system
Applications and microarchitectures which prefer statically predictable memory systems are especially well-suited for VLIW ISAs.

However, VLIW-based ISAs are notoriously difficult to program compared to general-purpose ISAs or vector ISAs.
Performant VLIW code can also suffer from issues such as large static code size due to the need for extensive static scheduling and software-pipelned loops.
Performant VLIW code can also suffer from issues such as large static code size due to the need for extensive static scheduling and software-pipelined loops.
Nonetheless, specialized VLIW ISAs provide a microarchitecturally simple and efficient programmer-visible mechanism for maintaining high instruction throughput into SIMD functional units.

Cadence, Synopsys, CEVA, and Qualcomm all ship commercial VLIW DSPs with SIMD extensions.
Expand Down Expand Up @@ -185,6 +185,8 @@ The `LMUL` (length multiplier) register grouping field of `vtype` enables groupi
In addition to enabling mixed-precision operations, this feature allows kernels that do not induce vector register pressure to access an effectively longer hardware vector length.
Generally, performance programmers for RISC-V will use this feature to reduce the dynamic instruction count of their loops and potentially improve the utilization of hardware compute resources.
For example, vector `memcpy` induces no register pressure and can trivially set a high `LMUL` to reduce dynamic instruction count.
Since higher `LMUL` settings will unroll instructions in hardware, `LMUL` also reduces static code size byreducing the need for unrolling loops in in software.


Thus, implementations should not penalize code which uses high `LMUL` to reduce instruction fetch pressure.
The general intuition around vector code should be to use the highest `LMUL` setting while avoiding register spills.
Expand All @@ -200,10 +202,13 @@ Alternatively, the addition of queueing resources to reduce this pressure would
Segmented memory instructions enable a "transpose" of an "array-of-structs" data representation in memory into a "struct-of-arrays" in consecutive vector registers.
Such instructions, while very complex behaviorally, are fundamental to many algorithms and datatypes.
For instance, complex numbers and image pixel data are conventionally stored in memory as "arrays-of-structs".
//Segmented memory access instructions can also be used to perform on-the-fly reformatting into vector registers.

These instructions can significantly reduce programmer burden, and thus performant RVV implementations should not impose an excess performance overhead from their execution.
Vector code which uses these memory operations to reduce dynamic instruction count should perform no worse than the equivalent code which explicitly transforms the data over many vector instructions.

The instructions are critical for repacking data in memory into element-wise format for vector instructions.
Compared to other vector or SIMD ISAs, RVV provides few facilities for register-register repacking, instead relying on segmented memory instructions to perform ``on-the-fly'' repacking between memory and registers.

Given the importance of these instructions, performant RVV implementations should not impose an excess performance overhead from their execution.
Vector codes which use these memory operations should perform no worse than the equivalent code which explicitly transforms the data over many vector instructions.

=== Short-Vector Execution

Expand Down Expand Up @@ -291,7 +296,7 @@ image::diag/ooo-simd.png[OOO SIMD Pipeline,width=40%,align=center,title-align=ce
Notably, as these machines are typically designed with single-chime instruction execution, high instruction throughput is necessary to maintain high utilization of multiple datapaths.
Furthermore, register renaming is required to enable execution past the WAW and WAR hazards in this example loop.

Unlike these cores, a Saturn-like short-vector design does not rely on costly features like high-throughput instruction fetch, out-of-order execution, or speculative execution, or register-renaming.
Unlike these cores, a Saturn-like short-vector design does not rely on costly features like high-throughput instruction fetch, out-of-order execution, speculative execution, or register-renaming.
Efficient scheduling of short-chime vector instructions with a limited capability for out-of-order execution is sufficient for maintaining datapath utilization on memory workloads, even with a minimal in-order scalar core.
//Efficient and precise vector operation scheduling, rather than high instruction throughput, is key to maintaining SIMD datapath utilization.

Expand Down
33 changes: 21 additions & 12 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="generator" content="Asciidoctor 2.0.23">
<meta name="generator" content="Asciidoctor 2.0.17">
<meta name="description" content="Micro-architecture of the Saturn Vector Unit">
<meta name="author" content="Authors: Jerry Zhao, Daniel Grubb, Miles Rusch, Tianrui Wei, Kevin Anderson, Borivoje Nikolic, Krste Asanovic">
<title>The Saturn Microarchitecture Manual</title>
Expand Down Expand Up @@ -451,7 +451,7 @@ <h1>The Saturn Microarchitecture Manual</h1>
<div class="details">
<span id="author" class="author">Authors: Jerry Zhao, Daniel Grubb, Miles Rusch, Tianrui Wei, Kevin Anderson, Borivoje Nikolic, Krste Asanovic</span><br>
<span id="revnumber">version v1.0.0,</span>
<span id="revdate">2024-11-12</span>
<span id="revdate">2024-11-17</span>
<br><span id="revremark">Release</span>
</div>
<div id="toc" class="toc2">
Expand Down Expand Up @@ -730,7 +730,7 @@ <h4 id="_vliw_isas_with_simd">1.2.4. VLIW ISAs with SIMD</h4>
</div>
<div class="paragraph">
<p>However, VLIW-based ISAs are notoriously difficult to program compared to general-purpose ISAs or vector ISAs.
Performant VLIW code can also suffer from issues such as large static code size due to the need for extensive static scheduling and software-pipelned loops.
Performant VLIW code can also suffer from issues such as large static code size due to the need for extensive static scheduling and software-pipelined loops.
Nonetheless, specialized VLIW ISAs provide a microarchitecturally simple and efficient programmer-visible mechanism for maintaining high instruction throughput into SIMD functional units.</p>
</div>
<div class="paragraph">
Expand Down Expand Up @@ -817,7 +817,8 @@ <h4 id="_vector_register_grouping">1.3.4. Vector Register Grouping</h4>
<p>The <code>LMUL</code> (length multiplier) register grouping field of <code>vtype</code> enables grouping consecutive vector registers into a single longer vector register.
In addition to enabling mixed-precision operations, this feature allows kernels that do not induce vector register pressure to access an effectively longer hardware vector length.
Generally, performance programmers for RISC-V will use this feature to reduce the dynamic instruction count of their loops and potentially improve the utilization of hardware compute resources.
For example, vector <code>memcpy</code> induces no register pressure and can trivially set a high <code>LMUL</code> to reduce dynamic instruction count.</p>
For example, vector <code>memcpy</code> induces no register pressure and can trivially set a high <code>LMUL</code> to reduce dynamic instruction count.
Since higher <code>LMUL</code> settings will unroll instructions in hardware, <code>LMUL</code> also reduces static code size byreducing the need for unrolling loops in in software.</p>
</div>
<div class="paragraph">
<p>Thus, implementations should not penalize code which uses high <code>LMUL</code> to reduce instruction fetch pressure.
Expand All @@ -838,8 +839,12 @@ <h4 id="_segmented_memory_instructions">1.3.5. Segmented Memory Instructions</h4
For instance, complex numbers and image pixel data are conventionally stored in memory as "arrays-of-structs".</p>
</div>
<div class="paragraph">
<p>These instructions can significantly reduce programmer burden, and thus performant RVV implementations should not impose an excess performance overhead from their execution.
Vector code which uses these memory operations to reduce dynamic instruction count should perform no worse than the equivalent code which explicitly transforms the data over many vector instructions.</p>
<p>The instructions are critical for repacking data in memory into element-wise format for vector instructions.
Compared to other vector or SIMD ISAs, RVV provides few facilities for register-register repacking, instead relying on segmented memory instructions to perform ``on-the-fly'' repacking between memory and registers.</p>
</div>
<div class="paragraph">
<p>Given the importance of these instructions, performant RVV implementations should not impose an excess performance overhead from their execution.
Vector codes which use these memory operations should perform no worse than the equivalent code which explicitly transforms the data over many vector instructions.</p>
</div>
</div>
</div>
Expand Down Expand Up @@ -955,7 +960,7 @@ <h4 id="_compared_to_general_purpose_simd_cores">1.4.2. Compared to General-purp
Furthermore, register renaming is required to enable execution past the WAW and WAR hazards in this example loop.</p>
</div>
<div class="paragraph">
<p>Unlike these cores, a Saturn-like short-vector design does not rely on costly features like high-throughput instruction fetch, out-of-order execution, or speculative execution, or register-renaming.
<p>Unlike these cores, a Saturn-like short-vector design does not rely on costly features like high-throughput instruction fetch, out-of-order execution, speculative execution, or register-renaming.
Efficient scheduling of short-chime vector instructions with a limited capability for out-of-order execution is sufficient for maintaining datapath utilization on memory workloads, even with a minimal in-order scalar core.</p>
</div>
</div>
Expand Down Expand Up @@ -1015,7 +1020,7 @@ <h3 id="_organization">2.1. Organization</h3>
The load/store paths within the VLSU execute independently and communicate with the VU through load-response and store-data ports.</p>
</div>
<div class="paragraph">
<p>The <strong>Vector Datapath (VU)</strong> contains instruction issue queues (VIQs), vector sequencers (VXS/VLS/VSS), and the vector register file (VRF), and the SIMD arithmetic functional units.
<p>The <strong>Vector Datapath (VU)</strong> contains instruction issue queues (VIQs), vector sequencers (VXS/VLS/VSS), the vector register file (VRF), and the SIMD arithmetic functional units.
The functional units (VFUs) are arranged in execution unit clusters (VEUs), where each VEU is fed by one sequencer.
The sequencers schedule register read/write and issue operations into the VEUs, while interlocking on structural and data hazards.
The VU is organized as a unified structure with a SIMD datapath, instead of distributing the VRF and VEUs across vector lanes.
Expand Down Expand Up @@ -1048,6 +1053,13 @@ <h3 id="_key_ideas">2.2. Key Ideas</h3>
This approach can tolerate high memory latencies with minimal hardware cost.</p>
</div>
<div class="paragraph">
<p>Saturn is designed around two key parameters. <strong><code>VLEN</code> and <code>DLEN</code></strong>.
<code>VLEN</code> is the vector length of each register file, as defined in the architecture specification.
<code>DLEN</code> is a micro-architectural detail that describes the datapath width for each of the SIMD-stype datapaths in Saturn.
Specifically, the load pipe, store pipe, and SIMD arithmetic pipes are all designed to fulfill <code>DLEN</code> bits per cycle, regardless of element width.
Future versions of Saturn may allow a narrower memory interface width (<code>MLEN</code>) than <code>DLEN</code>.</p>
</div>
<div class="paragraph">
<p>Saturn still supports a limited, but sufficient capability for <strong>out-of-order execution</strong>.
The load, store, and execute paths in the VU execute independently, dynamically stalling for structural and data hazards without requiring full in-order execution.
Allowing dynamic "slip" between these paths naturally implies out-of-order execution.
Expand Down Expand Up @@ -1366,7 +1378,7 @@ <h3 id="_memory_system">4.1. Memory System</h3>
<div class="paragraph">
<p>Saturn configurations with high <code>DLEN</code> would generally require higher memory bandwidth.
However, scaling up the system-level interconnect to meet Saturn&#8217;s bandwidth demands may be prohibitively costly.
Instead, the preferred approach for high-<code>DLEN</code> Saturn configs is to integrate a high-bandwidth local TCM (tightly-coupled-memory), which software should treat as a software-managed cache for vector accesses.
Instead, the preferred approach for high-<code>DLEN</code> Saturn configs is to integrate a high-bandwidth local TCM (tightly-coupled memory), which software should treat as a software-managed cache for vector accesses.
This TCM should be tile-local and globally addressable, but not necessarily cacheable.
<a href="#mem-tcm">Figure 17</a> depicts a Saturn configuration with a high-bandwidth TCM, but a reduced-bandwidth system interconnect.</p>
</div>
Expand Down Expand Up @@ -2002,9 +2014,6 @@ <h3 id="_optimizing_around_pipeline_latencies">6.4. Optimizing Around Pipeline L
<p>To saturate the FMA units in this scenario, either a longer <code>LMUL</code> should be used, or independent FMAs must be scheduled back-to-back.
Generally, performant code should use the highest <code>LMUL</code> possible that avoids vector register spilling.</p>
</div>
<div class="paragraph">
<p>Refer to <a href="#execute">Chapter 5</a> for details on each of the vector functional units and their default pipeline depths.</p>
</div>
</div>
<div class="sect2">
<h3 id="_optimizing_segmented_memory_accesses">6.5. Optimizing Segmented Memory Accesses</h3>
Expand Down
2 changes: 1 addition & 1 deletion docs/memory.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ image::diag/memtcm.png[TCM memory system,width=55%,align=center,title-align=cent

Saturn configurations with high `DLEN` would generally require higher memory bandwidth.
However, scaling up the system-level interconnect to meet Saturn's bandwidth demands may be prohibitively costly.
Instead, the preferred approach for high-`DLEN` Saturn configs is to integrate a high-bandwidth local TCM (tightly-coupled-memory), which software should treat as a software-managed cache for vector accesses.
Instead, the preferred approach for high-`DLEN` Saturn configs is to integrate a high-bandwidth local TCM (tightly-coupled memory), which software should treat as a software-managed cache for vector accesses.
This TCM should be tile-local and globally addressable, but not necessarily cacheable.
<<mem-tcm>> depicts a Saturn configuration with a high-bandwidth TCM, but a reduced-bandwidth system interconnect.

Expand Down
2 changes: 0 additions & 2 deletions docs/programming.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,6 @@ This situation is rare due to the support for chaining, but might still appear i
To saturate the FMA units in this scenario, either a longer `LMUL` should be used, or independent FMAs must be scheduled back-to-back.
Generally, performant code should use the highest `LMUL` possible that avoids vector register spilling.

Refer to <<execute>> for details on each of the vector functional units and their default pipeline depths.


=== Optimizing Segmented Memory Accesses

Expand Down
8 changes: 7 additions & 1 deletion docs/system.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ The *Vector Load-Store Unit (VLSU)* performs vector address generation and memor
Inflight vector memory instructions are tracked in the vector load-instruction-queue (VLIQ) and store-instruction-queue (VSIQ).
The load/store paths within the VLSU execute independently and communicate with the VU through load-response and store-data ports.

The *Vector Datapath (VU)* contains instruction issue queues (VIQs), vector sequencers (VXS/VLS/VSS), and the vector register file (VRF), and the SIMD arithmetic functional units.
The *Vector Datapath (VU)* contains instruction issue queues (VIQs), vector sequencers (VXS/VLS/VSS), the vector register file (VRF), and the SIMD arithmetic functional units.
The functional units (VFUs) are arranged in execution unit clusters (VEUs), where each VEU is fed by one sequencer.
The sequencers schedule register read/write and issue operations into the VEUs, while interlocking on structural and data hazards.
The VU is organized as a unified structure with a SIMD datapath, instead of distributing the VRF and VEUs across vector lanes.
Expand All @@ -49,6 +49,12 @@ Shallow instruction queues in the VU act as "decoupling" queues, enabling the VL
Similarly, the VLSU's store path can run many cycles behind the VU through the decoupling enabled by the VSIQ.
This approach can tolerate high memory latencies with minimal hardware cost.

Saturn is designed around two key parameters. *`VLEN` and `DLEN`*.
`VLEN` is the vector length of each register file, as defined in the architecture specification.
`DLEN` is a micro-architectural detail that describes the datapath width for each of the SIMD-stype datapaths in Saturn.
Specifically, the load pipe, store pipe, and SIMD arithmetic pipes are all designed to fulfill `DLEN` bits per cycle, regardless of element width.
Future versions of Saturn may allow a narrower memory interface width (`MLEN`) than `DLEN`.

Saturn still supports a limited, but sufficient capability for *out-of-order execution*.
The load, store, and execute paths in the VU execute independently, dynamically stalling for structural and data hazards without requiring full in-order execution.
Allowing dynamic "slip" between these paths naturally implies out-of-order execution.
Expand Down
6 changes: 6 additions & 0 deletions docs/tex/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
*.aux
*.bbl
*.log
*.pdf
*.blg
*.toc
Loading

0 comments on commit 2f9f011

Please sign in to comment.