- 1. ABSTRACT
- 2. Please install the following.
- 3. Introduction
- 3.1. Why Should One Design and Create Hardware?
- 3.2. Why Shouldn’t One Design and Create Hardware?
- 3.3. The flows/processes for developing hardware
- 3.4. What is an FPGA?
- 3.5. How/Where can I get an FPGA and FPGA/ASIC tools?
- 3.6. Plan for this tutorial
- 3.7. A note on the bsc tool and Bluespec BH and BSV code syntax
- 4. Limbering up: “Hello World!”
- 4.1. The code
- 4.1.1. STRUCTURE: executing a BH/BSV program describes a piece of hardware
- 4.1.2. BEHAVIOR: executing the hardware
- 4.1.3. Let’s do it! Bluesim compile, link and simulate
- 4.1.4. A (fanciful) visualization of Verilog hardware described by the code
- 4.1.5. Let’s do it! Verilog compile, link and simulate
- 4.2. Stepping back for some perspective: Haskell (or PLs in general) vs. BH/BSV
- 4.3. Hello World! Version (b): Packages and Modules
- 4.4. Hello World! Version (c): State and Time
- 4.4.1. Action, ActionValue and Value interface methods
- 4.4.2. Algebraic types
- 4.4.3. Registers: Instantiation, initializing, and operations
- 4.4.4. Local values, bit selection and combinational circuits
- 4.4.5. Interface method conditions and actions
- 4.4.6. Rule conditions
- 4.4.7. Let’s do it! Compile, link and simulate the example:
- 4.4.8. Visualizing the hardware for
mkTop
: rule-level view - 4.4.9. Visualizing the hardware for
mkTop
: Verilog RTL view - 4.4.10. Visualizing the hardware for
mkDeepThought
: rule-level view - 4.4.11. Visualizing the hardware for
mkDeepThought
: Verilog RTL view - 4.4.12. Waveforms: VCD files and Waveform viewers
- 4.5. Simulation and Synthesis semantics are the same in BH/BSV
- 4.6. HLHDLs vs. HLS: precise specification of micro-architecture
- 4.7. Numeric types and kinds in BH/BSV
- 4.1. The code
- 5. Introducing Concurrency with Bubblesort as Example
- 5.1. Bubblesort Version (a): Sequential (a finite state machine)
- 5.1.1.
mkBubblesort
: interface - 5.1.2.
mkBubblesort
: state elements - 5.1.3.
mkBubblesort
: input - 5.1.4.
mkBubblesort
: sorting - 5.1.5.
mkBubblesort
: termination - 5.1.6.
mkBubblesort
: output - 5.1.7.
mkTop
, the driver “testbench” (in fileTop.bs
) - 5.1.8. Let’s do it! Compile, link and simulate the example
- 5.1.9. Waveforms (
$ gtkwave dump.vcd
)
- 5.1.1.
- 5.2. Bubblesort Version (b): with concurrency
- 5.3. Bubblesort Versions (c), (d) and (e): generalizations
- 5.4. An Aside on Style
- 5.1. Bubblesort Version (a): Sequential (a finite state machine)
- 6. Pipelining, memory access and accelerators (example: “Mergesort”)
- 6.1. Where do Accelerators Fit into a Computer System?
- 6.2. Our example: a memory-to-memory “mergesort” accelerator
- 6.3. Mergesort: source codes
- 6.4. Mergesort algorithm
- 6.5. Mergesort: visualizing the merge step
- 6.6. The Merge Engine (sorts two input segments into 1 output segment)
- 6.7. The Merge Engine: interface
- 6.8. The Merge Engine: Talking to the AXI4 interconnect (bus)
- 6.9. The Merge Engine: Flow
- 6.10. The Merge Engine: Avoiding Deadlock in the Pipeline
- 6.11. The Mergesort Module (
mkAXI4_Accel
in fileAXI4_Accel.bs
) - 6.12. The Mergesort Module Code (
mkAXI4_Accel
in fileAXI4_Accel.bs
) - 6.13. Let’s do it! Compile, link the example (but don’t try simulating yet)
- 6.14. For simulation: RISC-V machine-code for Flute execution
- 6.15. Let’s do it! Run the “Hello, World!” RISC-V binary.
- 6.16. C code to test the mergesort accelerator
- 6.17. Let’s do it! Run the
mergesort
RISC-V binary. - 6.18. Mergesort: Possible Improvements (suggested exercise!)
- 7. Clocks, Concurrency Semantics, Performance, and Concurrent Registers
- 7.1. Concurrency and Parallelism in bsc-generated Hardware for Rules
- 7.2. Parallelism (Simultaneity) of Actions in a Rule
- 7.3. Concurrency of Rules Within a Clock
- 7.4. Method-ordering Constraints (Inducing “Conflicts”)
- 7.5. Example of Method-ordering Constraints on Primitive Registers
- 7.6. From Method Ordering to Rule Conflicts
- 7.7. For Conflicts, the Earlier Rule Disables the Later Rule
- 7.8. Summary of Rule Semantics
- 7.9. Concurrent Registers (CRegs): Motivation
- 7.10. Concurrent Registers (CRegs)
- 7.11. Concurrent Registers (CRegs): Hardware Implemntation
- 8. A Word About Multiple Clocks (and Resets) and Clock Domains (and Reset Domains)
- 9. Quality metrics for hardware
- 10. Endnote: Comparison with other approaches to hardware-design
- 11. Bibliography
Tutorial at ICFP 2020 (25th ACM SIGPLAN International Conference on Functional Programming)
Online, August 23, 2020, 14:00h-17:00h EDT
Tutorial materials at: https://github.com/rsnikhil/ICFP2020_Bluespec_Tutorial
Note
|
Viewing this document
The source for this document is:ICFP2020_Bluespec_Tutorial.adoc and there is also an HTML version: ICFP2020_Bluespec_Tutorial_BH.html The latter is produced from the former using asciidoctor , which is
available in most package installers (e.g., apt-get install asciidoctor ).
GitHub and some browsers with asciidoc plug-ins will automatically
process and display the .adoc version when you click to open.
Otherwise, please have your browser display the .html version.
|
Have you ever considered designing and implementing a piece of digital hardware, such as an accelerator on an attached FPGA for a complex computation (Machine Learning, Computer Vision, …), or just for fun, but were afraid that hardware design was too far from your wheelhouse?
In this tutorial we hope to leverage your knowledge of Haskell to
demystify this, using a lecture + demonstration format which you can
follow along on your laptop during the tutorial. We will use Bluespec
BH, which is a high-level hardware design language using Haskell
syntax and types. BH and its sibling BSV are mature,
industrial-strength hardware design languages, with over 20 years of
development; they have been used to design complex components in
commercial shipping chips. The bsc
tool for compiling BH and BSV to
Verilog was open-sourced in January 2020.
For background preparation we only assume you know Haskell, and perhaps nothing about hardware design.
We will start with a simple “Hello World!” example, but rapidly climb through the gears to end with a complete, Linux-capable RISC-V CPU controlling a pipelined memory-to-memory array-sorting accelerator (all the materials for this are open-source).
At the end of the tutorial, we hope you will:
-
Understand how a Haskell-based language (Haskell syntax, Haskell types) can be used to describe complex hardware system STRUCTURE, and how Guarded Atomic Actions (rewrite rules) can be used to describe complex hardware system BEHAVIOR in a composable way (and also enabling formal verification).
-
Feel: “I can do this!”, i.e., that you can read and modify the open-source designs shown in the tutorial, or even create your own hardware designs.
To study the code and run all the examples in this tutorial, please install the following (everything here is free and open-source):
-
The
bsc
compiler from https://github.com/B-Lang-org/bsc. -
The open-source Verilog simulator
verilator
from https://www.veripool.org-
Alternatively, you can install the open-source Verilog simulator
iverilog
, available standardly in common package installers. IVerilog is much slower; it is fine for small examples, but simulation of large designs is not so pleasant.
-
-
The open-source waveform viewer
gtkwave
(available standardly in common package installers) -
Materials for this tutorial, from https://github.com/rsnikhil/ICFP2020_Bluespec_Tutorial.
-
The Bluespec Flute open-source RISC-V CPU and WindSoC SoC, from https://github.com/bluespec/Flute.
Ubuntu and Debian Linux are the preferred platforms, but people also use MacOS.
Any computation can be coded in a programming language and executed on a general-purpose computing platform (from small single-chip embedded computers to rack-mounted servers). Much of the world’s hardware design activity is perhaps indeed about general-purpose computing platforms.
But any computation can also be implemented directly in hardware, typically with orders of magnitude (often several) advantage in speed and energy efficiency. Why?
-
Removal of one or more layers of interpretation (starting with the fetch-execute loop of a general-purpose CPU, and possibly more layers above that)
-
Exploit massive, fine-grain (temporal and spatial) parallelism
-
Exploit massive memory bandwidth due to multiple, distributed, custom memories
Why Should One Design and Create Hardware? (Contd.)
You servers, laptops and mobile devices already exploit this. Examples:
-
Virtual memory address translation and protection
-
Floating point units
-
Graphics
-
Video encoders/decoders
-
Network protocol offload engines (TCP, UDP, …)
-
Signal processing for wireless communcation, GPS
Each of these computations can of course be coded in software, but that would not meet speed and/or energy targets.
More recently, as it becomes increasingly difficult to maintain Moore’s Law and Dennard Law scaling in silicon we are seeing more and more “accelerators”:
-
Neural nets, Computer Vision, Crypto, Radar/Lidar, …
It takes a lot of time and effort, and possibly expense; and they’re a lot less flexible:
-
ASIC1 designs take months-to-years to develop, with multi-person teams and millions of dollars.
-
Only high volume and frequent use justify this.
-
-
FPGA1 designs can take hours-to-months to develop.
-
(Mostly because their tools are not as highly developed as software compilers and debuggers.)
-
-
Hardware designs are typically single-purpose; they’re wasted if not in (frequent) use!
1 We’ll demystify “ASIC” and “FPGA” shortly.
Imagine an FPGA as a pre-built set of components laid out like “Manhattan city blocks”.
-
Each block has a collection of AND gates, OR gates, multiplexers (“muxes”), …1
-
The “streets” and “avenues” are laid out with wires (“utility cables and pipes”)
-
By making certain connections (red dots in figure), we can connect an output from one city block to an input in another city block.
-
By choosing these connections, we can make the FPGA look like any circuit we want
FPGA tools produce a “bitfile”; each bit is a red dot, i.e., it
controls whether connection is made or not.
1But don’t worry: we are not going to descend to this ``machine language'' level; bitfiles are produced for us by FPGA tools (compilers)!
2More generally, rather than gates, FPGAs provide “LUTs” (Lookup-Tables) which can themselves be programmed to act like specific gates or combinational circuits. Other resources include registers, SRAMs (memories), DSPs (digital signal processors), and more.
-
You can buy off-the-shelf “development boards” from FPGA vendors, which are complete circuit boards containing an FPGA and lots of surrounding circuitry (DRAM, connectivity such as USB cables, PCIe connectors, Ethernet, …)
-
Boards range from a few tens of dollars to thousands of dollars, depending on capacity of the FPGA, speed of the FPGA, resources on the FPGA, resources on the board, etc.
-
-
Nowadays cloud-computing providers (e.g., Amazon AWS, Google) allow you to select “FPGA instances” in the cloud, so that you are connected to servers that have attached FPGA boards (typically over PCIe connectors).
-
Sadly, FPGA and ASIC tools are still mostly vendor-proprietary
-
FPGA tools are often available for free for evaluation, and with FPGA purchases, and free on cloud platforms.
-
ASIC tools and libraries are expensive, but can be free/cheap for academia.
-
Symbiotic EDA is one company that trying to change this (free and open-source tools)
-
They also do tools and service for formal verification of hardware systems.
-
-
-
We will start with a simple "Hello World!" example. Like all Hello-World examples, the purpose is actually to familiarize you with the “look and feel” of the language, tools and flow.
-
We’ll do a hardware concurrent “Bubblesort” example. Bubblesort is not a good sorting algorithm, but the purpose here is to familiarize you with parallelism and concurrency in hardware.
-
We’ll do a hardware, concurrent, highly pipelined, memory-to-memory “Mergesort” example. Being memory-to-memory, it is capable of sorting large arrays. We’ll connect this as an “accelerator” in a small, but complete, Linux-capable, RISC-V CPU-based, “SoC” (System on a Chip). The CPU is Flute and the SoC is WindSoC.
All the tools and materials in this tutorial are free and open-source.
The central tool we use is bsc, a compiler that converts our source code into Verilog RTL. It was released as a free and open-source tool in January 2020..
The bsc compiler supports two equivalent and interchangeable source-code syntaxes:
- BH
-
Bluespec Haskell, also known as Bluesepec Classic
- BSV
-
Bluespec SystemVerilog
In this tutorial we use BH, since the ICFP audience is familiar with Haskell. All the code we show can also be done in BSV.
Some of the code we use (in particular the RISC-V CPU and components) were written in BSV.
Please visit this directory
$ cd Examples/Eg020a_HelloWorld
And bring up the file src/Top.bs
in an editor, preferably in a
separate editor window. Here’s the source:
mkTop :: Module Empty
mkTop =
module
rules
"rl_print_answer": when True ==> do
$display "\n\n***** Deep Thought says: Hello, World! *****"
$display " And the answer is: %0d (or, in hex: 0x%0h)\n" 42 42
$finish
-
We define a single top-level identifier
mkTop
, giving it a type declaration (Module Empty
) and a value binding. A value of typeModule t
is a generator of hardware objects (it can be “instantiated” multiple times). -
The keyword
module
introduces a monadic expression, similar to Haskell’sdo
. The monad “collects” various things like sub-modules, rules and interfaces-
In this case it contains a single
rules
expression.
-
-
The
rules
keyword introduces one or more rules (here, just one).-
A rule has a condition expression (here,
True
), of typeBool
-
A rule has an
Action
body (here, two$display
actions and a$finish
action)
-
-
A rule is like an infinite process:
-
It can “execute” (or “fire”) whenever its Bool condition is true.1
-
A rule body is a recursive composition of Actions (and the rule body itself has type Action).
-
When a rule fires, all its Actions are performed “simultaneously” and “instantaneously” (there is no temporal ordering amongst the Actions in a rule).
-
In this example, as soon as we begin execution, the rule fires (because its condition is True), and it performs the three primitive Actions in its body, “simultaneously”.2
1 This is a first approximation; qualifications to come later.
2 These primitive actions have intuitive interpretations when we run
a simulation: $display
will output the argument string on the
terminal, and $finish
will terminate the simulation. What can they
mean in real hardware (we’ll have a fanciful interpretation in a later
section).
$pwd
... ICFP2020_Bluespec_Tutorial/Examples/Eg020a_HelloWorld
$ make b_compile
mkdir -p build_b_sim
Compiling for Bluesim ...
bsc -u -sim -simdir build_b_sim -bdir build_b_sim -info-dir build_b_sim -keep-fires -aggressive-conditions -no-warn-action-shadowing -check-assert -cpp +RTS -K128M -RTS -show-range-conflict -p src:../Resources:+ -g mkTop src/Top.bs
checking package dependencies
compiling src/Top.bs
code generation for mkTop starts
Elaborated module file created: build_b_sim/mkTop.ba
All packages are up to date.
Compiling for Bluesim finished
bsc
is the Bluespec compiler for BH and BSV. Its command-line flags can be listed with bsc -help
; they are described in detail in [UserGuide]. Here,
-
-sim
: compile for Bluesim (native-compiled simulation) -
src/Top.bs
: Top-level file (which, in general, may contain multiple modules) -
-g
: Name of top-level module -
-p
: Search path for source files.+
is shorthand for standardbsc
libraries. -
-u
: Compile imported packages as necessary (this example has none)
On compiling, bsc
produces various intermediate files with
extensions like .bo
, .ba
, .sched
, … which can be placed into
separate directories to avoid clutter.
$ make b_link
Linking for Bluesim ...
bsc -e mkTop -sim -o mkTop_b_sim -simdir build_b_sim -bdir build_b_sim -info-dir build_b_sim -keep-fires -p src:../Resources:+
Bluesim object created: build_b_sim/mkTop.{h,o}
Bluesim object created: build_b_sim/model_mkTop.{h,o}
Simulation shared library created: mkTop_b_sim.so
Simulation executable created: mkTop_b_sim
Linking for Bluesim finished
Here,
-
-e
: name of top-level module to be linked (bsc
will chase modules that it depends on) -
-o
: name of executable produced -
-sim
: link for Bluesim
Produces mkTop_b_sim.so
, a standard ELF shared object, and mkTop_b_sim
, a short shell script that loads and runs it.
(Relax, don’t worry, this Verilog detail is just to develop intuition about what’s under the covers, we won’t be coding or thinking at this level.)
-
Clocked digital circuit modules usually have “reset” (
RST_N
) and “clock” (CLK
) input signals. These are digital signals, i.e., their analogue voltages sit at one of two fixed values, variously called “low” and “high”, or “0” and “1”, or “false” and “true”, or “unasserted” and “asserted”.-
Reset signals are often inverted, i.e., asserted when low, hence the name
RST_N
(for negative or negated).
-
-
A clock oscillates between 0 and 1; the transitions are often called “posedge” and “negedge”.
-
The reset signal is usually asserted only once, at the start of execution, allowing all the circuits to enter into a known initial state. Subsequently, the circuit’s behavior is determined by the actions at the clock edges.
-
All actions are conceptually performed at a clock edge. Inputs to an action are values on wires just before the clock edge. The action may change the state of various entities, which are visible just after the clock edge.
For our example,
-
Imagine
$display
hardware modules with displays that that flash a message when theirENA
(enable) input signal is asserted. -
Imagine a
$finish
hardware module that switches off the system when itsENA
(enable) input signal is asserted.
$ make v_compile
mkdir -p build_v
mkdir -p verilog_RTL
Compiling for Verilog ...
bsc -u -verilog -vdir verilog_RTL -bdir build_v -info-dir build_v -keep-fires -aggressive-conditions -no-warn-action-shadowing -check-assert -cpp +RTS -K128M -RTS -show-range-conflict -p src:../Resources:+ -g mkTop src/Top.bs
checking package dependencies
compiling src/Top.bs
code generation for mkTop starts
Verilog file created: verilog_RTL/mkTop.v
All packages are up to date.
Compiling for Verilog finished
-
-verilog
instructsbsc
to generate Verilog, instead of compiling for Bluesim.
$ make v_link
Linking for Verilog sim ...
bsc -e mkTop -verilog -o ./mkTop_v_sim -vdir verilog_RTL -bdir build_v -info-dir build_v -vsim iverilog verilog_RTL/mkTop.v
Verilog binary file created: ./mkTop_v_sim
Linking for Verilog sim finished
-
-vsim iverilog
instructsbsc
to link for the IVerilog simulator.
To become a proficient Haskell programmer, in particular to write efficient programs, you have to understand its dynamic model (to varions levels of abstraction).
Similarly, to become a proficient BH/BSV programmer, in particular to produce efficient hardware, you have to understand its dynamic model (to various levels of abstraction).
This difference in dynamic models is the biggest jump in reorienting yourself into a designer of good hardware.
Please visit this directory
$ cd Examples/Eg020b_HelloWorld
And bring up the files src/Top.bs
and src/DeepThought.bs
in an
editor, preferably in a separate editor window.
In the DeepThought.bs
we see this module type declaration:
mkDeepThought :: Module DeepThought_IFC
This says the module’s interface has the type DeepThought_IFC
. An
interface of a module defines the methods by which the environment
interacts with the module (similar to an object interface in an
object-oriented programming language).
Note
|
In the first version of this Hello World example, the interface
type was the predefined type Empty , which is an interface with no
methods. This is typically used in top-level modules.
|
This particular interface is defined a few lines earlier:
interface DeepThought_IFC =
getAnswer :: ActionValue (Int 32)
It has a single method, getAnswer
whose type is the monadic type
ActionValue (Int 32)
it is an Action that, when performed, returns a
value of type Int 32
(which is the type of 32-bit signed integers).
Returning to the module value definition:
mkDeepThought =
module
interface DeepThought_IFC
getAnswer = return 42
we see the implementation of the getAnswer
method: it is simply the
monadic expression return 42
.
Just preceding the module definition we have the following bsc
compiler directive:
{-# verilog mkDeepThought #-}
Both the source code (BH/BSV) and bsc
's target code (Verilog) have
the concept of “modules” and “module hierarchy”, and they are in
fact congruent, i.e., each of them has the same concept of module
instantiations and module hierarchy (about which we’ll see more in a
moment). The above directive instructs bsc
please to preserve
this module boundary into the generated Verilog when compiling this
module. Without this directive, a BH/BSV module is in-lined wherever
it is instantiated.
Note
|
This directive is not universally applicable to any BH/BSV module. A Verilog module’s interface consists of wires (individual wires, or bundles of wires called buses), i.e., they can only carry values for which we’ve defined a concrete hardware representation in bits. BH/BSV modules interfaces can carry polymorphic values, higher-order functions etc., for which we typically do not have a concrete representation in bits; such modules cannot carry this directive. Thus, the Verilog module hierarchy produced from a BH/BSV program is typically a coarser projection of the source hierarchy. |
At the top of DeepThought.bs
we have:
package DeepThought where
and in Top.bs
we have:
import DeepThought
BH/BSV’s package
construct plays the same role module
in Haskell.
-
Every file in
Foo.bs
(BH) orFoo.bsv
(BSV) is a package, and must be introduced by thepackage Foo
construct-
A file can only contain one package.
-
-
A package can import a package
Foo
by sayingimport Foo
(we’ll see that shortly inTop.bs
). The reason for the identity between the package name and the file name is thatbsc
uses the package name to search the file system for a file with that name with extension.bs
of.bsv
. -
When package
Foo
is imported, the identifiers defined there become visible and usable in the importing package.-
The vocabulary for selective export and import is not as rich as in Haskell; please see [RefGuide] for details.
-
-
As in Haskell, if you import two packages which both define the same identifier
x
, you can disambiguate them using the package name.
Note
|
The package vs. module dissonance with Haskell is
unfortunate. In the hardware-design universe, the word module has
long been used in Verilog and SystemVerilog with the meaning you see
here; we decided to keep this convention because of familiarity for
hardware designers (who typically have no exposure to Haskell.)
|
Earlier, when compiling a BH program, the command we invoked had the -u
flag:
$ bsc -u -sim ...
This flag tells bsc
to chase the “import” s transitively from the
specified top-level file/package, and compile each such file if it is
out-of-date, i.e., if the source has changed since the
compiler-intermediate files were last created. This is a bit of
Makefile
-like dependency-tracking built into bsc
.
Note
|
BH and BSV are completely interoperable at package granularity, i.e., a package written in BH can import a package written in BSV and vice versa. We will make use of this in later examples when we import, from BH, code for a RISC-V CPU that happens to be written in BSV. The Bluespec standard libraries are a mix: some packages are written in BH, some in BSV. |
In Top.bs
the module is now defined like this:
mkTop =
module
deepThought <- mkDeepThought -- (A)
rules
"rl_print_answer": when True ==> do
x <- deepThought.getAnswer
$display "\n\n***** Deep Thought says: Hello, World! *****"
$display " And the answer is: %0d (or, in hex: 0x%0h)\n" x x
$finish
In line (A), we instantiate the module mkDeepThought. As the ←
notation suggests, this is a monadic statement. The right-hand-side
is an expression of type Module DeepThought
; the action performed is
to instantiate the module and return it’s interface, of type
DeepThought_IFC
. Thus, deepThought::DeepThought_IFC
.
In the first line of the rule, the ←
again suggests a monadic
statement. The right-hand side is of type ActionValue (Int 32)
; it
performs the action and returns a value of type Int 32
; thus x::Int 32
.
The two major monads used in BH/BSV are the module monad and the action monad (both of which you see in the above example):
-
The module monad concerns STRUCTURE, or circuit description. It is evaluated by
bsc
to elaborate a module hierarchy, starting with a top-level module, elaborating the modules it instantiates, elaborating the modules that those modules instantiate, and so on, recursively. -
The action monad concerns BEHAVIOR--- it’s action is performed during circuit execution.
Note
|
The module monad is actually a polymorphic IsModule typeclass,
so you can customize the module syntax to collect items other than
just sub-module instances, rules and interfaces. For example, you
can collect trace output to be captured in modules deep in the
module hierarchcy and streamed out to be recorded somewhere. Or
you could collect “control and status” registers from deep
within the module hierarchy, for dynamic configuration and
monitoring of a hardware design.
|
Please visit this directory
$ cd Examples/Eg020c_HelloWorld
And bring up the files src/Top.bs
and src/DeepThought.bs
in an
editor, preferably in a separate editor window.
This version of Hello World! adds a temporal dimension. The mkTop
module will request an answer from the mkDeepThought
module, which
now has a state machine so that it “thinks” for “7.5 Million
Years” before yielding its answer of 42. Meanwhile, mkTop
waits.1
1 We are of course paying homage to the book The Hitchhiker’s Guide to the Galaxy by Douglas Adams (1979). In the book, a supercomputer named Deep Thought is asked to calculate the Answer to the Ultimate Question of Life, the Universe, and Everything. After 7.5 million years, it answers: “42”.
In DeepThought.bs
we see a new interface definition:
interface DeepThought_IFC =
whatIsTheAnswer :: Action
getAnswer :: ActionValue (Int 32)
The new method whatIsTheAnswer
has type Action
, which is a pure
side-effect. You can think of it as equivalent to ActionValue Void
,
i.e., the return value carries no interesting information. mkTop
will use this method to request an answer.
The getAnswer
method has the same type as before, but we will see
that it won’t yield an answer immediately (not for “7.5 million
years”).
In general, all methods will have one of three return-types:
-
Action
: a pure side effect (an action) -
ActionValue
t : a side effect (an action) plus a returned value of type t -
t : a pure value with no side effect (no action)
Note
|
since a rule’s type is Bool , it is guaranteed by the type-system to have no actions/effects
|
In DeepThought.bs
we see a new type definition:
data State_DT = IDLE | THINKING | ANSWER_READY
deriving (Eq, Bits, FShow)
-
deriving Eq
is just like Haskell, i.e., the compiler defines theEq
typeclass operators==
and/=
for this newState_DT
type. -
The
Bits t n
typeclass has two functions to convert a type to bits and back.pack :: t -> Bit n unpack :: Bit n -> t
For each constructor of an algebraic type, this is canonically a concatenation of the bits for the component type with just enough bits to represent the constructor uniquely, possibly with some padding of “don’t care” bits so that each disjunct has exactly the same number of bits. Our example will pack into a
Bit 2
type with codings 0, 1 and 2 for the three disjuncts. -
The
FShow
typeclass is similar toShow
in Haskell, and provides one function:fshow :: t -> Fmt
A value of
Fmt
type is a pre-formatted object that can be used as an argument to$display
and$write
. For example,$display "Current state %0d, next state" IDLE (fshow THINKING)
will print “Current state 0, next state THINKING”. I.e., without fshow, it just prints the value of the bit reprenstation, but with fshow it prints the symbolic value.
Note
|
Instead of $display which actually prints things, you can use
$format which is a pure function with the same kind of arguments as
$display, and which returns a Fmt object. Further, values of Fmt
type can be concatenated with + to create new Fmt values. Thus,
you can write functions that nicely format complex types.
|
In DeepThought.bs
, in module mkDeepThought
we first see the instantiation of a “register”:
rg_state_dt :: Reg State_DT <- mkReg IDLE
This is a module instantiation, just like the one we saw earlier
(instantiation of mkDeepThought
in mkTop
).
Note
|
In BH/BSV, registers are not special, they are just pre-defined modules that are found at the leaves of the module hierarchy. |
The right-hand side has type Module (Reg State_DT)
; the monadic ←
instantiates the module and returns the interface of type Reg
State_DT
which is bound to the identifier rg_state_dt
. mkReg
has
type:
mkReg :: t -> Module (Reg t)
The argument is the “initial value” of the register, i.e., the value
it has when execution begins. Reg t
is just an interface type with two methods:
_read :: Reg t -> t
_write :: Reg t -> t -> Action
Normally invoking a method meth in interface ifc is written with
the traditional “dot” notation: ifc.meth . Writing ._read
and
._write
for registers can be quite tedious, so there are some
syntactic shorthands. For example,
rg_half_millenia._write (rg_half_millenia._read + 1)
can be written with this syntactic shorthand:
rg_half_millenia := rg_half_millenia + 1
In DeepThought.bs
, we next see instantiation of another register
which we use to count “half-millenium” steps, initialized to zero.
It is 4 bits wide, so it can count from 0 to 15 (= 7.5 millenia).
rg_half_millenia :: Reg (Bit 4) <- mkReg 0
Next we see some local definitions:
let millenia = rg_half_millenia [3:1]
let half_millenium = rg_half_millenia [0:0]
The notation e [3:1]
selects bits 1 through 3 of the value of e,
which must have Bit n
type, and has type Bit 3
. Bits are numbered
in “little-endian” order, with 0 being the least-significant bit.
The notation e [0:0]
can be used to select a single bit, and has type Bit 1
.
Note
|
Any pure expression (that is not an Action or ActionValue
type) represents a so-called “combinational circuit”, i.e., it
can be compiled to an acyclic circuit of AND, OR and NOT gates.
|
In DeepThought.bs
, the first interface method is implemented like this:
whatIsTheAnswer = rg_state_dt := THINKING
when (rg_state_dt == IDLE)
The second line is the “method condition”, which states when the method is eligible to be invoked by the environment. When it is invoked, it performs the action described in the first line (a register assignment).
Important
|
BH/BSV methods differ from methods in object-oriented languages precisely because of rule conditions, which act as “guards” on the methods. As we shall soon see, every method is invoked by a rule (either directly, or indirectly via other methods); rules are the fundamental “processes” of BH/BSV. When a method condition is false, it has the effect of “stalling the process”, i.e., the rule from which it is invoked cannot fire. |
In DeepThought.bs
, the module still has only one rule, called
rl_think
. It has a non-constant rule condition
(rg_state_dt == THINKING)
which limits when the rule is allowed to fire. It cannot fire when
execution begins, since rg_state_dt
is initialized to IDLE
.
However, when method whatIsTheAnswer
is invoked, the method sets the
state to THINKING
, which enables the rule condition and allows it to
fire.
When the rule does fire, it issues message:
$write " DeepThought: ... thinking ... (%0d" millenia
if (half_millenium == 1) then $write ".5" else noAction
$display " million years)"
The only difference between $write
and $display
is that the former
does not emit an end-of-line while the latter does (these are
long-standing in Verilog).
This conditional expression (both arms are of type Action) either
increments our counter or changes the state to ANSWER_READY
:
if (rg_half_millenia == 15) then
rg_state_dt := ANSWER_READY
else
rg_half_millenia := rg_half_millenia + 1
Thus, the rule will fire repeatedly, each time incrementing
rg_half_millenia
, until it finally assigns ANSWER_READY
to
rg_state_dt
. At this point the rule condition is false and so it
cannot fire any more.
At this point, the method condition of the second method is enabled:
getAnswer = do
rg_state_dt := IDLE
rg_half_millenia := 0
return 42
when (rg_state_dt == ANSWER_READY)
When the environment invokes this method, it performs its actions, namely restoring the registers to their initial values, and it returns the value 42.
Turning to Top.bs
, we see that it has two rules, the latter being
the same as in the previous version:
"rule rl_ask": when True ==> do
$display "Asking the Ultimate Question of Life, The Universe and Everything"
deepThought.whatIsTheAnswer
"rl_print_answer": when True ==> do
x <- deepThought.getAnswer
$display "Deep Thought says: Hello, World! The answer is %0d." x
$finish
Let us follow the “flow” through the program execution. The program
has a total of three rules, two in mkTop
and one in mkDeepThought
.
-
At start of execution, only one rule can fire:
-
rl_ask
can fire, because its own rule condition is true and the conditions on the method it invokes,deepThought.whatIsTheAnswer
, is true. -
rl_print_answer
cannot fire because even though the rule condition is true the method condition ondeepThought.getAnswer
is not. -
rl_think
cannot fire because its rule condition is false.
-
-
rl_ask
fires, and invokes thedeepThought.whatIsTheAnswer
method.
At this point, again, only one rule can fire:-
rl_think
can fire because its rule condition is true. -
The other two rules cannot because the method conditions of their methods invoked are false.
-
-
rl_think
fires, and it increments the counter. This repeats 15 times until, again, only one rule can fire:-
rl_ask
cannot fire (method condition false) -
rl_think
cannot fire (rule condition false) -
rl_print_answer
can fire (rule condition and method condtion true)
-
-
rl_print_answer
fires, prints the answer, and the execution terminates.
Using Bluesim or Verilog sim:
$ pwd
... ICFP2020_Bluespec_Tutorial/Examples/Eg020c_HelloWorld
$ make b_compile b_link b_sim
...
Asking the Ultimate Question of Life, The Universe and Everything
DeepThought: ... thinking ... (0 million years)
DeepThought: ... thinking ... (0.5 million years)
DeepThought: ... thinking ... (1 million years)
DeepThought: ... thinking ... (1.5 million years)
DeepThought: ... thinking ... (2 million years)
DeepThought: ... thinking ... (2.5 million years)
DeepThought: ... thinking ... (3 million years)
DeepThought: ... thinking ... (3.5 million years)
DeepThought: ... thinking ... (4 million years)
DeepThought: ... thinking ... (4.5 million years)
DeepThought: ... thinking ... (5 million years)
DeepThought: ... thinking ... (5.5 million years)
DeepThought: ... thinking ... (6 million years)
DeepThought: ... thinking ... (6.5 million years)
DeepThought: ... thinking ... (7 million years)
DeepThought: ... thinking ... (7.5 million years)
Deep Thought says: Hello, World! The answer is 42.
We continue to reinforce, as in Fig.Haskell vs. Bluespec BH/BSV, that we are describing actual hardware, not some representation in memory manipulated by a program on a conventional computer.
In the figure, we show module instances as sharp-cornered rectangles in blue, and rules and methods as rounded-cornered rectancles in yellow.
Important
|
Our choice of yellow round-cornered rectangles for depicting rules and methods is deliberate. Both have Bool conditions and bodies. Conceptually, the overall condition determining whether a rule fires is both its own condition as well as the condtions on the emthods it invokes. The overall body of a rule is both its own body as well as the bodies of the methods it invokes. You can think of the source code as a way to compose a logical composite rule from components: a syntactic rule in a module plus, the methods it invokes in other modules, and transitively, the methods that those methods may invoke, etc. |
-
The
RDY
signals of the methods are the method conditions -
Each rule combines the
RDY
signal with its own rule condition, resulting finally in aWILL_FIRE
signal indicating when the rule will fire. -
The
WILL_FIRE
signal of a rule activates all tehENA
signals of Action and ActionValue methods that it invokes.
Now let’s run the Bluesim simulation with the -V
flag:
$ ./mkTop_b_sim -V
or the IVerilog simulation using the +bscvcd
flag:
$ ./mkTop_v_sim +bscvcd
In each case, it will produce a file called dump.vcd
. VCD stands
for “Value Change Dump”, a standard feature of hardware simulations;
it is essentially a dump of all 0→1 and 1→0 transitions in the
circuit, and thus represents a detailed “time trace” of hardware
behavior.
If you have installed the GtkWave
application (free, open source,
available standardly with most package installers, such as apt-get
install gtkwave
on Linux), you can view the VCD file:
$ gtkwave dump.vcd
The GUI has various controls to select which waves and what time range you want to see. Here is a snapshot:
The top-left window shows the module hierarchy. The figure shows
deepThought
inside top
(which, in turn, is inside a boilerplate
main
provided by the bsc
libraries). When you click on a module
instance, the window below it shows the signals in that
module. Selecting a signal and clicking Append
brings up the signal
in the main signal window. We have selected a few signals of interest
for this example.
The top waveform is the clock signal CLK
.
The next two waves show the method condition of whatIsTheAnswer
(CAN_FIRE_…
) and overall condition of the rl_ask
rule
(WILL_FIRE_…
). You can see that they are enabled (true) on the
first clock, and false subsequently.
The next two waves show the value of the rg_state_dt
and
rg_half_millenia
registers. The from starts at 0 (encoding of
IDLE
), moves to 1 (THNKING
) and finally to 2 (ANSWER_READY
).
The latter starts at 0 and increments up to 15 (hex 0xF).
The final two waves show the method condition for getAnswer
and the
overall rule condition for rl_print_answer
). These are false until
rg_state_dt
becomes ANSWER_READY
,
-
You should see exactly the same waveforms whether you use Bluesim, IVerilog simulation, or any other Verilog simulator.
-
In traditional Hardware Design Languages (HDLs) like Verilog, SystemVerilog and VHDL, there can be a difference between what you see in simulation (VCD viewing) and what you see in real hardware (hooking up an oscilloscope to the actual electronic circuit). This is because these languages are defined based on a simulation model which is more expressive than actual hardware (you can specify arbitrary time intervals within clocks, signals can be sampled at arbitrary time points within clocks, etc.).
To eliminate nasty surprises, one is advised to stick to the so-called “synthesizable” subset of Verilog/ SystemVerilog/ VHDL, which avoids those tricky constructs.
In BH/BSV there is no such difference between simulation and hardware semantics.
-
Please observe, in the above diagrams, that the Rule-level view and the Verilog RTL view have the same broad structure, differing only in the level of detail.
-
This is why it is not hard to correlate Verilog simulation and VCD waves produced by Verilog simulation to BH/BSV source code.
-
This is quite different from the situation with so-called HLS (High Level Synthesis). In HLS, the source code is C/C (with the execution semantics of C/C), and the HLS tool produces a microarchitecture for you (modules, logic, connections). The designer can provide directives that influence the microarchitecture, but does not specify the microarchitecture precisely.
-
In HLHDLs like BH/BSV and Chisel, you specify the microarchitecture precisely.
In the previous examples, we saw types such as Int 32
and Bit 4
,
for 32-bit signed integers and 4-bit-wide bit vectors. In BH/BSV,
there are several kinds of types. Most types we use, such as
scalars, vectors, functions, and so on are of the ordinary kind
(notated *
in code).
Another kind is the numeric kind (notated #
in code), as in the
above examples. Though they look like numeric values, they only live
in type expressions, and should not be confused with numeric values.
Only fixed-width types are supported in hardware. BH/BSV has an
unbounded Integer
type, but they can only be used for static
elaboration, i.e., structural description, not for values represented
in hardware.
Arithmetic on fixed-width types wraps around silently, as it does in hardware, so you need to be careful about this.
BH/BSV has strong type-checking on numeric types, and there are no
automatic coercions to silently extend or truncate types to different
widths. Thus, if a context expects a Bit 4
and the expression there
has type Bit 5
, the compiler will report it as a type error. You
have to explicitly use the primitive functions zeroExtend
and
signExtend
and truncate
if you wish to convert the width of a
value.
Note
|
In software we are used to thinking of a few, fixed, byte-multiple sizes for integers, typically from 1 to 4 bytes, because that’s what our underlying processors and memories use. In hardware design, there is nothing special about byte-multiples, bytes, 64-bit width limits, etc.; one typically picks exactly the bit-width needed for purpose. In such a scenario, we are skating much closer to the edge concerning wraparounds, overflows etc., so one needs to be much more careful about this. BH/BSV’s strong type-checking on numeric types are a great benefit for this. |
Functions and types can be polymorphic on numeric types, and BH/BSV
also has typeclasses on numeric types that encode limited arithmetic
on such types. Thus, for example, we can express (and typecheck) a
function that concatenates values of type Bit m
and Bit n
to
produce a value of type Bit n+m
. Here, n
and m
are type
variables of numeric kind.
Everyone who’s learned programming is familiar with Bubblesort. It’s a very inefficient sorting algorithm, but is useful as a pedagogical tool to learn some basic programming constructs. Here, we’ll use it as a vehicle to introduce concurrency in BH/BSV.
We’ll start with a traditional sequential version, then add concurrency, and then generalize the code in various ways (polymorphism and type-dependent ordering, size of array, etc.).
Please visit this directory:
$ cd Examples/Eg030a_Bubblesort
and examine the source files in an editor:
src/Top.bs
src/Bubblesort.bs
The overall strategy (very artificial!) is the following:
-
In the top-level module
mkTop
(inTop.bs
), we will generate (sequentially) five random values of typeInt 32
and feed them (sequentially) into themkBubbleSort
module (inBubblesort.bs
). -
mkBubblesort
will then sort the numbers and indicate completion -
mkTop
will then drain the values (sequentially) in sorted order, and print the results.
We start by defining its interface:
interface Sort_IFC =
put :: (Int 32) -> Action
get :: ActionValue (Int 32)
-
The
put
method will be invoked repeatedly bymkTop
, supplying anInt 32
argument each time, to load the sorter. -
Finally, the
get
method will be invoked repeatedly bymkTop
, each time extracting anInt 32
value, in sorted order.
In the mkBubblesort
module, we instantiate a number of registers to
hold the “state” of the module.
This register will be used to count-in and count-out inputs, so we
know when we’ve received the correct number for inputs and have
delivered the correct number of outputs. The register contents has
type UInt 3
(unsigned 3-bit integer) which is adequate for our
purpose (or we could have used Bit 3
directly here):
rg_j :: Reg (UInt 3) <- mkReg 0
Note
|
The intitial value must have the register-content type UInt 3 ;
so what is the type of 0 ? It’s actually of type Integer
(unbounded integers), but BH/BSV plays the same Haskell trick of
applying fromInteger implicitly to integer literals, and typeclass
machinery will produce the required UInt 3 .
|
Next, because we’re going to imitate the classical sequential program, we’re going to use a “Program Counter” to sequence it through the steps:
rg_pc :: Reg (Bit 3) <- mkReg 0
This register will remember, in each sweep through the array, whether or not any swaps were performed, in order to detect termination (no swaps ⇒ all sorted):
rg_swapped :: Reg Bool <- mkRegU
Finally, the registers holding the values to be sorted.1 We don’t bother initializing them since we’ll be loading them before sorting.
x0 :: Reg (Int 32) <- mkRegU
x1 :: Reg (Int 32) <- mkRegU
x2 :: Reg (Int 32) <- mkRegU
x3 :: Reg (Int 32) <- mkRegU
x4 :: Reg (Int 32) <- mkRegU
1 If you are squirming at this tedious repetition: don’t worry, we’ll use a vector, soon.
As mkTop
repeatedly calls the put
method to deliver input values,
we’ll “shift them in” to the holding registers. This help-function
expresses the idea:
let shift :: Int 32 -> Action
shift y = action { x0 := x1; x1 := x2; x2 := x3; x3 := x4; x4 := y }
Note, these are five simultaneous actions; there is no semantic sequencing between them. Think of them as five parallel assignments.
Here is the put
method that delivers input values:
put x = do
shift x
rg_j := rg_j + 1
if1 (rg_j == 4)
action { rg_pc := 1; rg_swapped := False }
when (rg_pc == 0)
It is only enabled when rg_pc
is 0 (we’ll use non-zero as an
indication of “busy”). It shifts the new value x
in and
increments rg_j
. On the last value (rg_j==4
), we set rg_pc
to 1
and initialize rg_swapped
to False.
The sorting behavior is expressed by a series of 5 rules:
"rl_swap_0_1": when (rg_pc == 1)
==> do
if1 (x0 > x1)
action { x0 := x1; x1 := x0; rg_swapped := True }
rg_pc := 2
"rl_swap_1_2": when (rg_pc == 2)
==> do
if1 (x1 > x2)
action { x1 := x2; x2 := x1; rg_swapped := True }
rg_pc := 3
...
Each rule’s condition only enables it to fire when rg_pc
has a
particular value. It swaps an adjacent pair of registers if they are
in the wrong order, and advances rg_pc
to enable the next rule.
Note: in a traditional imperative language, a swap usually needs an extra temporary variable because assignment statements are sequential:
tmp := x1
x1 := x0
x0 := tmp
In BH/BSV, the actions in an Action block are all simultaneous (parallel assignments), so there is no need for an intermediate temporary variable.
If any rl_swap
actually performs a swap, it will have set
rg_swapped
true. We use this in the final rule to decide whether to
loop or signal completion (by setting rg_pc
to 6:
"rl_loop_or_exit": when (rg_pc == 5)
==> do
if (rg_swapped) then
action { rg_swapped := False; rg_pc := 1 }
else
rg_pc := 6
Output is delivered with repeated invocations of the get
method.
This is only enabled when rg_pc
is 6 and there are > 0 elements
remaining in the register array (rg_j /= 0
). It returns x0
(the
smallest of the remaining sorted values), but its actions also shift
the array, decrement rg_j
, and, on the last value, resets rg_pc
to
0 getting the module ready for its next use.
get = do
shift _
rg_j := rg_j - 1
if1 (rg_j == 1) (rg_pc := 0)
return x0
when ((rg_j /= 0) && (rg_pc == 6))
We import a standard BH/BSV library that implements a module generating pseudo-random numbers (linear feedback shift register):
import LFSR
whose interface looks like this:
Bits t n => interface LFSR t
value :: t
next :: Action
It can be used on any type t
that has a Bit n
representation. The
value
method yields the next random value, and the next
method
advances it to generate the next random value.
We instantiate two registers to count-off the input and output
sequences, instantiate the random-number generator (which, here
generates values of type Bit 8
, and instantiate the Bubblesort
module:
rg_j1 :: Reg (Int 32) <- mkReg 0
rg_j2 :: Reg (Int 32) <- mkReg 0
lfsr :: LFSR (Bit 8) <- mkLFSR_8
sorter :: Sort_IFC <- mkBubblesort
This rule feeds inputs. It has an explicit condition, but also it can
only fire when sorter.put
is enabled. The zeroExtend
typeclassed
function extends Bit 8
to Bit 32
. The unpack
function (in the
Bits
typeclass) converts Bit 32
to UInt 32
.
"rl_feed_inputs": when (rg_j1 < n)
==> do
let v :: Bit 32 = zeroExtend lfsr.value
lfsr.next
let x :: Int 32 = unpack v
sorter.put x
rg_j1 := rg_j1 + 1
$display "%0d: x_%0d = %0d" cur_cycle rg_j1 x
This rule drains and prints outputs. It has an explicit condition,
but also it can only fire when sorter.get
is enabled.
"rl_drain_outputs": when (rg_j2 < n)
==> do
y <- sorter.get
rg_j2 := rg_j2 + 1
$display " %0d: y_%0d = %0d" cur_cycle rg_j2 y
if1 (rg_j2 == n-1) $finish
Use Bluesim or Verilog sim, and generate a VCD waveform file:
$ pwd
... ICFP2020_Bluespec_Tutorial/Examples/Eg030a_Bubblesort
$ make b_compile b_link b_sim_vcd
...
Bluesim simulation ...
./mkTop_b_sim
1: x_0 = 1
2: x_1 = 142
3: x_2 = 71
4: x_3 = 173
5: x_4 = 216
16: y_0 = 1
17: y_1 = 71
18: y_2 = 142
19: y_3 = 173
20: y_4 = 216
Bluesim simulation and dumping VCD in dump.vcd finished
Please visit this directory:
$ cd Examples/Eg030b_Bubblesort
and examine the source file src/Bubblesort.bs
(the
top-level file src/Top.bs
is identical to the previous version, we’re
only changing the internal implementation of Bubblesort).
New strategy: the module will perform swaps concurrently. In fact, it starts swapping as inputs are streamed in.
The interface to mkBubblesort
remains unchanged. In the module, we
now provide an initial value for our data registers:
x0 :: Reg (Int 32) <- mkReg maxBound
As in Haskell, maxBound
is resolved using typeclasses to be the
maximum ordered value in the Int 32
type. We use maxBound
as a
“sentinel” value, and assume no actual input to be sorted has that
value (we’ll fix this limitation in a later version).
This expression continually tests whether the array is full and is sorted:1
let done :: Bool
done = ((rg_inj == 5) && (x0 <= x1) && (x1 <= x2) && (x2 <= x3) && (x3 <= x4))
What do we mean by “continually”? Remember that any pure value
expression in BH/BSV represents a combinational circuit, an acyclic
circuit of AND/OR/NOT gates, and is a pure function its inputs. Here,
each <=
is a comparator circuit (a combinational circuit defined
in terms of AND/OR/NOT) whose inputs are connected to the bits in the
various registers or fixed at a constant value and with a single
output wire carrying the comparison result. We AND all these together
to drive a single wire called done
. This wire continually carries
the value of this computation based on the current values in the
registers. This picture shows the hardware:
1 If you are squirming at this tedious repetition: don’t worry, we’ll use a fold
soon.
Our swap rules are no longer sequenced by any program counter. Here are the first two rules:
"rl_swap_0_1": when (x0 > x1) ==> do
x0 := x1
x1 := x0
"rl_swap_1_2": when (x1 > x2) ==> do
x1 := x2
x2 := x1
Each rule swaps its two registers whenever they’re mis-ordered.
This brings up a very important point: what if both rules are enabled?
I.e., what if x0
and x1
are in the wrong order, and x1
and x2
are also in the wrong order? Then, they’ll both try to read and write
the register they share, x1
. Do we have a race condition?
In BH/BSV semantics, rules are atomic by definition, so we don’t
have to worry about race conditions. It’s up to the implementation to
guarantee that they execute in some logical order (logically either
rl_swap_0_1
before rl_swap_1_2
or the other way around). bsc
will produce hardware that guarantees this.
The figure shows additional detail for the hardware for a swapping rule.
First, each register input has a multiplexer (which is basically an
if- or case- expression) to choose one of many possible inputs
(remember that register x1
is updated in both rules rl_swap_0_1
and in rule rl_swap_1_2
).
Second, a Control
block (also a combinational circuit) takes the
basic rule condition (CAN_FIRE
) but also combines it with the
CAN_FIRE
conditions of other rules to further ensure that firing
this rule will not violate atomicity (WILL_FIRE
). The WILL_FIRE
signal of a rule or method essentially drives “enable” signals
(EN
, ENA
) of state elements to control whether or not their state
is updated.
Note
|
Though we may leave atomicity of rules to the bsc compiler, we
may still be curious (and in some cases worried, due to considerations
of fairness, starvation, deadlock etc.) about what logical order is
chosen by the compiler. There are ways to control this, illustrated
in the source file later under ifdef OPTION2 , OPTION3 and
OPTION4 . Please feel free to study that code later.
|
The put
method for input now just records the new input in register
x4
, and does not do any shifting. Its rule condition checks that
x4
is “empty” (contains maxBound
); this prevents put
from
overwriting this value by the next value.
put x = do
x4 := x
rg_inj := rg_inj + 1
when ((rg_inj < 5) && (x4 == maxBound))
As soon as we’ve registered the first input, x3
and x4
will be
mis-ordered because x3
contains maxBound
, so rule rl_swap_3_4
will be enabled to swap it
Swapping can begin immediately the first input arrives, and can continue as more inputs arrive.
The get
method for output returns x0
and shifts as before, except
here we shift in maxBound
to prepare the module for its the next
use. On the last output (when x1 == maxBound
) we restore rg_inj
to 0, making the module ready for its next use (enabling get
again).
get = do
x0 := x1
x1 := x2
x2 := x3
x3 := x4
x4 := maxBound
if1 (x1 == maxBound) (rg_inj := 0)
return x0
when done
Use Bluesim or Verilog sim, and generate a VCD waveform file.
Observe that the text output on the terminal is exactly the same as Version (a). What’s different will be the internal concurrency and timing, which is visible in the VCD file.
Compare this with the waveform for the sequential version in
Waveforms from sequential Bubblesort. Here, the WILL_FIRE
signals show swap rules
firing concurrently, and firing even before all inputs have arrived.
The whole computation finishes in fewer cycles (see CLK
).
There are three more example directories in the Bubblesort group. These are mostly an exposition of using standard Haskell-ry (types, higher-order functions, lists) for more expressivity in the hardware description. They should be easy to follow by anyone who knows Haskell. We leave that as an exercise for the reader, but in the next section we briefly discuss a few highlights of the final version.
- Version (c)
-
Generalize from sorting 5 values to sorting n values, using a vector of n registers instead of explicitly listing out all the registers.
- Version (d)
-
Generalize from sorting values of
Int 32
to values of any typet
that can be represented in bits and on which we can define an ordering. - Version (e)
-
Use a
Maybe
type for register contents so that properly identify “empty” registers instead encoding it usingmaxBound
.
Important
|
We remind the reader that all the Haskell-ry in BH/BSV is concerned only with STRUCTURE, i.e., describing the moudule heirarchy, rules, how they connect with methods, etc. In other words, circuit structure description. The BEHAVIOR of the circuit (how its state and signals evolve over time) is given by rule semantics, which has nothing to do with Haskell per se. |
Note
|
These examples use n registers to hold the values being sorted. This hardware is not reasonable when n is large (100s, 1000s, … billions). For that, one would store the data in an SRAM or DRAM memory module (which has dense, optimized circuitry that can store gigabytes, unlike separate registers). That is the subject of the mergesort example, later. |
The interface type generalizes to:
(Bubblesort_IFC :: # -> * -> *) n_t t
i.e., it becomes a polymorphic type constructor where n_t
is a
numeric type representing the size of the vector of values to be
sorted, and t
is the type of the values to be sorted.
The module type declaration now acquires some typeclass constraints:
mkBubblesort :: (Bits t wt, -- ensures 't' has a hardware bit representation
Ord t, -- ensures 't' has the '<=' comparison operator
Eq t) -- ensures 't' has the '==' comparison operator
=>
Module (Bubblesort_IFC n_t t)
Numeric types can be lowered to corresponding numeric values (of type Integer
):
let n :: Integer = valueOf n_t
Instantiating a vector of registers by repeated mkReg
instantiation using monadic replicateM
:
xs :: Vector n_t (Reg (Maybe t)) <- replicateM (mkReg Invalid)
Lambdas, lists and fold-like functions over vectors for our done
predicate:
(List.all (\i -> ((xs !! i)._read <= (xs !! (i+1))._read))
(List.upto 0 (n - 2)))
Rules as first class objects: a function to generate a swap rule:
gen_swap_rule :: Integer -> Rules
gen_swap_rule i = let
xs_i = xs !! i
xs_i_plus_1 = xs !! (i+1)
in
rules
"rl_swap_i": when (xs_i > xs_i_plus_1)
Rules as first class objects: adding a generated list of rules to a module’s collection:
addRules_list_descending_urgency (List.map gen_swap_rule (List.upto 0 (n - 2)))
The descending_urgency
provides additional direction to bsc please
to give the rules a descending priority order.
Our examples often use certain stylistic conventions:
-
_IFC
at the end of interface names -
mk
at the front of module names -
rg_
at the front of register names -
rl_
at the front of rule names
Later examples will also use:
-
mv_
,ma_
,mav_
in front of names of value, Action and ActionValue methods, respectively -
fv_
,fa_
,fav_
in front of names of value, Action and ActionValue functions, respectively
All these conventions are purely this author’s chosen style to improve readability, and are not required by the syntax.
-
How closely-coupled (“tight”) is the accelerator to the CPU pipeline?
-
How often is the hand-off from the “main” computation (executed by code on the CPU) to the “accelerated” computation and back?
-
How much work is done by the accelerator for each invocation between handoffs?
-
-
How is data shared between the “main” computation to the “accelerated” computation?
-
Does the accelerator need to access CPU registers and/or to memory?
-
Is memory access shared with CPU caches?
-
Is such sharing cache-coherent?
-
Does the accelerator use virtual or physical addresses?
-
How is the accelerator’s access to memory made secure (only permitted accesses)?
-
All transactions are memory reads or writes, through the AXI4 system interconnect. System interconnects are also known as “system buses” even though for speed reasons modern system interconnects are typically no longer “buses” (a shared set of wires), but actually a packet-switched network.
Note
|
AXI4 is an open interconnect standard defined by ARM ([AXISpec]). |
-
Prequel: the CPU prepares a vector in memory to be sorted.
-
The CPU “configures” the accelerator by performing a series of “writes” to the accelerator through the system interconnect. In our example, this includes:
-
The starting address of vector A of 64-bit words (input and output).
-
The starting address of the same-sized vector B (work area).
-
The size of the vectors (number of 64-bit words).
-
A “control” write that tells the accelerator to begin sorting,
-
-
The accelerator performs reads and writes to memory via the system interconnect, and implements the sorting algorithm.
-
While the accelerator works, the CPU continually “polls`"the accelertor for an indication of completion, by doing a "`status” read.
-
Sometimes instead of polling, the accelerator may “interrupt” the CPU to indicate completion; we will not be showing that here.
-
-
Sequel: the CPU computes with sorted vector.
The BH/BSV source codes for the Flute CPU and the SoC (System-on-a-chip) into which we’ll attach our accelerator should be cloned from GitHub:
git clone https://github.com/bluespec/Flute
The source code for our Mergesort example is in two files of <300 lines each:
ICFP2020_Bluespec_Tutorial/Examples/Eg040_Mergesort/src/AXI4_Accel.bs
ICFP2020_Bluespec_Tutorial/Examples/Eg040_Mergesort/src/Merge_Engine.bs
Please feel free to peruse the Flute and WindSoC code at your leisure. It is written in BSV, not BH, but if you squint appropriately, it has the same underlying syntactic structure as BH. The two major parts of the repo are:
-
$(FLUTE)$/src_Core/
: RISC-V ISA definitions, Flute CPU, caches and MMUs, PLIC (Platform Level Interrupt Controller),Near_Mem_IO
(memory mapped timer, timer-interrupt and software-interrupt generator), Debug Module (for remote GDB control over a RISC-V executable, and Tandem-Verification trace encoder. -
$(FLUTE)$/src_Testbench
: The SoC (including Boot ROM, memory controller, UART and AXI4 interconnection fabrics) and a top-level wrapper for simulation.
The repo also has build directories for several variants (32-bit and 64-bit RISC-V, with and without Virtual Memory, etc.), precompiled ISA tests, scripts to run everything, etc.
-
Note that each “merge” 's input segment is already sorted, and its output segment is also sorted.
-
Each pass reads from an input array and writes to an output array.
-
Opportunities for concurrency:
-
Each “merge” 's memory requests cn be pipeline, i.e., the path from the merge hardware to memory (requests) and back (responses). We can launch memory read-requests as fast as we can generate them, consume read-responses as they arrive to perform the merge, and stream the final write-requests out from the merge.
-
Multiple instances of “merge” can be done concurrently.
-
The are completely independent of each other.
-
-
If we use multple arrays, multiple “passes” can be done concurrently * Provided that a pass can “wait” for its input data that is being produced by an earlier pass. We could use an “IVar” (I-Structure Variable) for this.
-
-
In our example, we’ll only use opportunity (1).
-
Opportunity (2) is usable if memory has sufficient bandwidth to handle so many concurrent reads and writes.
-
Opportunity (3) will take a lot of resources (memory) for the data. It also has tricky synchronization, which “IVars” (I-Structure variables) would neatly solve, but this adds further hardware cost and complexity.
-
-
If we complete one pass before starting the next, we avoid the IVar synchronization issue. Further, we can alternate between two arrays A and B for input and output. (We may need a final copy if we don’t end with the data in the desired output array).
-
Note that each “merge” 's input segment is already sorted, and its output segment is also sorted.
-
In the code, we will see registers and wires with the same names as in the figure.
Please examine the file: Examples/Eg040_Mergesort/src/Merge_Engine.bs
.
-
The functions
fv_mkMemReadReq
andfv_mkMemWriteReq
create AXI4 memory-request structs for the given address and data. The details can be ignored here.-
The
id
argument is a tag (0 or 1) that we attach on concurrent requests for data from the left and right segments ofmerge_engine
inputs, so that we can disambiguate the responses (since responses arrive on the same AXI4 interface).
-
Other than an init
interface to initialize the module, the “work” methods are:
interface Merge_Engine_IFC =
init :: Action
start :: ... <args> ... -> Action
done :: Bool
initiator_ifc :: AXI4_Master_IFC Wd_Id Wd_Addr Wd_Data Wd_User
-
Each time we want to do a “merge” step, we invoke the
start
method giving it all the information it needs about location and size of its two input and one output segments. Later,done
will become true when it has finished merging the segments. -
While it works, it issues memory reads and writes via
initiator_ifc
. The numeric type paramters ofAXI4_Master_IFC
specify standard AXI4 parameters: bit widths of address, data, id and user fields carried in AXI4 requests and responses.
The [AXISpec] from ARM that defines the AXI4 protocol has much
detail about wires, signalling etc. To simplify our lives, we define
some library components called “transactors” (a.k.a. adapter, or
shim) that take care of all the signalling details and present us,
instead, with simple FIFO enqueue/dequeue interfaces for the different
AXI4 channels. We instantiate such a component in mkMerge_Engine
:
initiator_xactor :: AXI4_Master_Xactor_IFC Wd_Id Wd_Addr Wd_Data Wd_User
<- mkAXI4_Master_Xactor;
The source for this transactor, and other related Bluespec BH/BSV
definitions of AXI4 types, interfaces, interconnects, etc. can be
found in the Flute repository at: src_Testbench/Fabrics/AXI4/
-
Rules
rl_req0
andrl_req1
each traverse one input segment (left and right inputs of “merge”), generating and launching read requests to memory for each word. They tag their requests with atid
(transaction id) of 0 and 1 respectively, so that we can disambiguate responses. -
Rules
rl_rsp0
andrl_rsp1
collect responses and place the data into FIFOsf_data0
andf_data1
, respectively. They usetid
to select the FIFO target.-
Note: we assume that responses come in the order they were requested, to each input segment’s data will arrive into its FIFO in sorted order.
-
-
Rule
rl_merge
peforms the merge, always looking at the head of FIFOsf_data0
andf_data1
, and passing through the smaller and sending it in a write-request to memory. -
Finally, rule
rl_drain_write_rsps
collects write-responses and discards them.-
Read- and write- requests and responses all flow on different “channels” in the AXI4 interconnect, so reads and writes do not need disambiguation.
-
By waiting for write responses, we can ensure that we don’t start the first “merge” of the next pass (which reads memory) before completion of the last “merge” of the last pass (which writes memory).
-
-
Deadlock scenario: suppose
f_data0
has become empty andf_data1
has become full. Suppose the next response from memory (FIFO “Memory responses” in the diagram) is forf_data1
, let’s call it X.-
Then, any data meant for
f_data0
, which is “behind” X, is stuck there behind X since we cannot move X intof_data1
.
-
-
Solution: This is a classic “head-of-line blocking” scenario, and the solution is standard:
-
Fix a number
max_n_reqs_in_flight
that limits the number of read requests that can be outstanding from either segment. -
Ensure that rules
rl_req0
andrl_req1
cannot exceed this number of requests:-
Give them that many “credits” at the beginning.
-
Decrement the credit on each request, and don’t issue any request when credit is zero.
-
-
As items are consumed from
f_data0
andf_data1
, restore those credits. -
Ensure that FIFOs
f_data0
andf_data
are have enough capacity to hold that many responses.
-
Conceptually these credit counters are just registers that are
decremented and incremented to consume and resplenish credits,
respectively, and can be instantiated as usual using mkRegU
. In the
code (file Merge_Engine.bs
), however, we use “Concurrent
Registers” for this, instantiated with mkCRegU
. This is a
performance optimization, about which more in the Section
[ConcurrentRegs].
On the system interconnect, our accelerator is both:
-
a target/server (for configuration & status: receives read/write requests from CPU)
-
an initiator/client (for work: sends read/write requests to memory)
and this is visualized in this diagram:
The interface code is specified in the Flute repository at
src_Testbench/SoC/AXI4_Accel_IFC
. It’s in BSV (not BH) syntax:1
interface AXI4_Accel_IFC;
method Action init (Bit# (Wd_Id) axi4_id, Bit #(Wd_Addr) addr_base, Bit #(Wd_Addr) addr_lim);
interface AXI4_Slave_IFC #(Wd_Id, Wd_Addr, Wd_Data, Wd_User) slave;
method Bool interrupt_req;
interface AXI4_Master_IFC #(Wd_Id, Wd_Addr, Wd_Data, Wd_User) master;
endinterface
but we can squint at it through BH-colored glasses:
interface AXI4_Accel_IFC =
init :: (Bit Wd_Id) (Bit Wd_Addr) (Bit Wd_Addr) -> Action
slave :: AXI4_Slave_IFC Wd_Id Wd_Addr Wd_Data Wd_User
interrupt_req :: Bool
master :: AXI4_Master_IFC Wd_Id Wd_Addr Wd_Data Wd_User
1 Apologies for the unpleasant “master/slave” terminology; it was used widely in hardware design and ARM used it in their AXI4 spec ([AXISpec], 2013). For new designs we suggest using “client” or “initiator” instead of “master” and “server” or “target” instead of “slave”.
As in Merge_Engine
, we instantiate an AXI4 slave transactor through
which we receive AXI4 requests and send responses, for configuration.
The configuration itself is stored in a vector of 4 registers:
target_xactor :: AXI4_Slave_Xactor_IFC Wd_Id Wd_Addr Wd_Data Wd_User <- mkAXI4_Slave_Xactor
v_csr :: Vector N_CSRs (Reg Fabric_Addr) <- replicateM (mkReg 0)
Rules rl_handle_config_read_req
and rl_handle_config_write_req
handle AXI4 requests from the CPU for initializing the configuration
registers, receiving the “Go!” command from the CPU, and returning
completion status to the CPU.
We instantiate a merge-engine, and a register to hold the current “span” under consideration (which starts at 1 and doubles on each pass until it encompasses the whole sort array):
merge_engine :: Merge_Engine_IFC <- mkMerge_Engine
rg_span :: Reg Fabric_Addr <- mkRegU
The next set of rules encode a state machine described by the following pseudo-code:
-
L0: start when we’ve received “Go!” from the CPU
-
L1: For span = 1, 2, 4, … until >= n
-
L2: Initialize the pass with start-index i = 0
-
L3: While i < size of array
-
L4: Do a merge (merge_engine.start)
and increment i to the next span
-
-
L5: Double the span and exchange the array roles
-
-
L6: If final data is not in original array, copy it
-
L7: Wait for all merges to complete, and signal completion to CPU
First, in Eg040_Mergesort/Makefile
, there is a line:
REPO ?= $(FLUTE)
Please define an environment variable FLUTE
to point at your clone
of the Flute repo, or edit the Makefile
line to point at it.
Then, compile and link it (but not simulate):
$ pwd
... ICFP2020_Bluespec_Tutorial/Examples/Eg040_Mergesort
$ make all
This will use bsc
to compile all the code (our two local source
files, and a very large number of source files from the Flute/WindSoc
repository) and link it into a Bluesim executable.
Before we run the simulation, we need some RISC-V binaries for Flute to execute.
We are going to execute (pre-compiled versions of) two C programs:
$ pwd
... ICFP2020_Bluespec_Tutorial/Examples/Resources/C_programs
$ ls hello/
hello* hello.c hello.map hello_Mem.hex hello.text Makefile symbol_table.txt
$ ls mergesort/
mergesort* mergesort.map mergesort.text
mergesort.c mergesort_Mem.hex symbol_table.txt
We have provided pre-compiled RISC-V binaries; this diagram illustrates how they were created, in case you wish to recreate them yourself (later), or repeat the flow with new or modified C programs.
-
Source files are compiled using gcc targeting RISC-V machine code (instead of x86). “RISC-V International” is the global non-profit that steers RISC-V; you can find links to RISC-V tools (including gcc) on their web site (https://riscv.org/).
-
When running gcc, it uses some “bare metal” libraries in
C_programs/lib
-
From
hello.c
, gcc produceshello
(ELF executable),hello.text
(RISC-V Assembly Language listing),hello.map
(linker map).
-
-
ELF files are usually unpacked and loaded by an operating system (but we don’t have one, since we’re using “bare metal”).
-
In hardware engineering it is traditional to initialize memory contents with a “memory hex dump”. This is just a listing in ASCII hexadecimal text of memory contents.
-
In the tutorial repo, we provide an
elf_to_hex
program inResources/elf_to_hex
. -
This produces, for example,
hello_Mem.hex
andsymbol_table.txt
-
(This process is also described in more detail in the README in the C_programs/
directory.)
First, please examine: Resources/C_programs/hello/hello.c
it’s just the standard iconic “Hello World!” program.
Then, run it on the Bluesim simulation we’ve just created:
$ pwd
... ICFP2020_Bluespec_Tutorial/Examples/Eg040_Mergesort
$ make run_hello
cp ../Resources/C_programs/hello/hello_Mem.hex ./Mem.hex
cp ../Resources/C_programs/hello/symbol_table.txt ./symbol_table.txt
./exe_HW_sim +tohost
...
================================================================
Bluespec RISC-V WindSoC simulation v1.2
Copyright (c) 2017-2020 Bluespec, Inc. All Rights Reserved.
================================================================
...
================================================================
CPU: Bluespec RISC-V Flute v3.0 (RV64)
Copyright (c) 2016-2020 Bluespec, Inc. All Rights Reserved.
================================================================
...
Hello World!
...
The printf
in the program is linked by gcc to a library that writes
out the characters to the UART (Universal Asynchronous Receiver
Transmitter) in our SoC, and our simulation model of the UART merely
prints each character to the screen.
Please examine: Resources/C_programs/mergesort/mergesort.c
In main()
we have:
run (! accelerated, A, B, n);
...
run (accelerated, A, B, n);
i.e., we do two runs, each sorting an array of words in memory.
-
In the first run, we use a C function
mergesort ()
that has been compiled to RISC-V machine code, which is executed on the RISC-V CPU. -
In the second run, we use our hardware accelerator. The C code just writes out the addresses and size of the arrays to the accelerator, writes the “Go!” command, and then waits, polling, for completion:
-
For our test, the input array is in inverse-sorted order; after running a sort, some C code verifies that the output is in sorted order.
accel_0_addr_base [1] = (uint64_t) pA; accel_0_addr_base [2] = (uint64_t) pB; accel_0_addr_base [3] = (uint64_t) n; // "Go!" accel_0_addr_base [0] = (uint64_t) 1; // Wait for completion while (true) { uint64_t status = accel_0_addr_base [0]; if (status == 0) break; }
You will also see some calls to fence()
. This is to ensure that
data values prepared by the CPU have been flushed out of caches into
memory (so the accelerator can see them) and, after the acceleration,
to ensure that data values updated by the accelerator are reloaded by
the CPU into its caches.1
1 This is a complex subject (“coherence” between memory seen by the CPU and by the accelerator), for which there are many approaches, involving hardware support and software conventions.
$ pwd
... ICFP2020_Bluespec_Tutorial/Examples/Eg040_Mergesort
$ make run_mergesort
cp ../Resources/C_programs/mergesort/mergesort_Mem.hex ./Mem.hex
cp ../Resources/C_programs/mergesort/symbol_table.txt ./symbol_table.txt
./exe_HW_sim +tohost
...
================================================================
Bluespec RISC-V WindSoC simulation v1.2
Copyright (c) 2017-2020 Bluespec, Inc. All Rights Reserved.
================================================================
...
================================================================
CPU: Bluespec RISC-V Flute v3.0 (RV64)
Copyright (c) 2016-2020 Bluespec, Inc. All Rights Reserved.
================================================================
...
Running C function for mergesort
Verified 3000 words sorted
Sorting took 1932331 cycles
Done
Running hardware-accelerated mergesort
...
Verified 3000 words sorted
Sorting took 321869 cycles
Done
We can see that accelerated sort has about a 6X speedup over the software sort. This is respectable, but not huge, in part because mergesort, even when run in software, is mostly limited by memory system performance (it does not do much computation on the data read from memory). Flute does nto have a sophisticated memory system, and the hardware accelerator has lots of room for improvment (see next section).
Robustness:
-
Proper use of
fence
instructions in the C code -
Support misaligned data
Functionality and generality:
-
Allow “records” (
structs
) for data, of various sizes, sorting on a key field of any type withOrd
. -
Allow
scatter/gather
arrays (indirection from arrays to records). -
Allow out-of-order responses from memory
Performance:
-
Use AXI4’s “burst” reads and writes to read/write segments, instead of individual-word reads/writes.
-
Replicate
mkMerge_Engine
to do multiple merges in each step concurrently (if memory has the bandwidth to support it)
Note
|
This section is a bit heavy, and can safely be skipped on first reading. But it is an important topic describing aspects of BH/BSV that are essential for performance tuning. |
-
Although we have been generating clocked hardware in all our examples, we have hardly mentioned clocks at all in our discussion, and in fact our source codes have no mention of clocks (bsc managed all the details for us). This is the norm when working with BH/BSV: we mostly think in terms of rule steps (all actions of a rule, performed atomically).
-
The bsc compiler translates BH/BSV into clocked digital hardware (expressed in Verilog). Every rule is mapped into hardware that executes in a single clock1, and bsc tries to have as many rules firing in a clock as possible, for maximum performance.
-
In this section:
-
We provide some insight into how bsc allows multiple rules to fire in a clock while preserving atomicity
-
We describe another primitive (Concurrent Registers, or CRegs) that permit higher degrees of concurrency (multiple rules in a clock) than ordinary registers (CRegs were used for the “credit counters” in the Merge Engine in our Mergesort example).
-
1 There is nothing in BH/BSV rule semantics that demands that a rule must fire within one clock; it’s a choice by the compiler. Alternative compilation strategies can change that; the only sacrosanct property is that a rule’s actions must appear to be atomic with respect to all other rules.
Rule semantics are explained in two parts:
-
Individual rule: logically, a rule fires in a single instant. All the actions in the rule (and in any methods it invokes) happen at the same instant. We call this parallelism of actions.
-
Multiple rules within a clock: logically, they execute in sequence, so the overall state change in a clock can be understood as the simple sequential composition of the state changes of the individual rules. We call this the concurrency of rules.
This is visualized in the following picture:
Suppose we had a rule like this:
when ((y /= 0) && got_x && got_y) ==> do
if1 (lsb(y) == 1) w := w + x
x := x << 1
y := y >> 1
There are three actions in the rule. All are governed by the rule
condition, but one of them (assignment to w
) is further governed by
the predicate of the if
. The hardware corresponding to this can be
visualized as in this picture:
It consists of combinational circuits (pure functions, acyclic graphs built from primitive gates) originating at value methods of certain primitives (here, register reads), and feeding data inputs and “enable” inputs (EN/ENA/ENABLE) of state elements (here, registers). These EN and data inputs of primitives correspond to their Action.
When a state element’s EN is asserted at a clock edge, it captures (or, consumes) the value on its data input, at the instant of the clock edge. Since all the state elements are driven by the same clock, these actions all happen at the same instant.
In the picture, if the rule condition (CAN_FIRE
) is false, none of
the actions will be performed. If true, the x
and y
writes will
occur. If the if
condition is also true, the w
write will occur.
(Simultaneously) For each of the rule’s actions whose enabling condition is True, do the action
Summary:
-
There is no ordering among the actions of a rule; they all happen at the same instant.
-
Not all actions may be performed (due to internal conditions in the rule).
-
Any values “read” by the rule are from “before” the instant of execution.
-
any values affected by the rule are visible only `after`" the instant.
Define a schedule as some linear ordering of all rules in a program: r1, r2, …, rN
(For now, consider any order; defer the question of how to pick an order.)
Consider each rule in schedule order: If its rule-condition is true, and it does not conflict with rules already executed in this clock Then execute this rule
We will explain “conflict” in a moment.
Consider two rules in schedule order that may invoke methods mA
and
mB
of a common module x
. These method invocations may be in the
rule’s condition or in its body.
For every module x
, each pair of its methods has one of the following ordering constraint:
Constraint |
Meaning |
mA conflict-free mB |
Rules invoking mA and mB can fire in same clock, in either order |
mA < mB |
Rule invoking mA can fire before rule invoking mB |
mA > mB |
Rule invoking mB can fire before rule invoking mA |
mA confict mB |
Rules invoking mA and mB cannot fire in same clock (neither order) |
-
For primitive modules, these constraints are built-in (“given”, “axiomatic”).
-
For user-defined modules, these are inferred from the module implementation (by the compiler).
For standard primitive registers (which we get using mkReg
and
mkRegU
) the given method-ordering constraint is: _read < _write
-
One can see how this models a real hardware register: during a clock, we can only read the old value (from the last clock edge) and when we write, it is only visible after the next clock edge.
These constraints may result in a conflict for some rule schedules:
We say that there is a conflict between an ordered pair of rules r1 and r2 + if there is an ordering conflict between any pair of methods x.m1 invoked in r1 and x.m2 invoked in r2 (on the same module instance x).
Suppose the compiler has chosen the following schedule
and suppose there is a conflict between r3 and r5, i.e., they each invoke methods on some common module instance x where the method-ordering contraints are not consistent with the schedule order.
The compiler produces hardware that disables rule r5 whenever r3 fires, i.e., logic that looks like this:
WILL_FIRE_r5 = (! WILL_FIRE_r3) && CAN_FIRE_r5
CAN_FIRE_r5
is r5’s condition seen in isolation: its rule-condition and its method-conditions.
WILL_FIRE_r5
indicates whether the rule will actually fire on this clock.
-
When compiled, a BH/BSV program has a statically determined schedule of rules (a linear ordering).
-
On every clock, conceptually rules are considered in schedule order.
-
A rule
WILL_FIRE
if its own condition (CAN_FIRE
) is true (rule-condition + method conditions) and if firing the rule would not violate a method-ordering conflict with any rule already executed in this clock. -
When a rule fires, all its actions are performed at the same instant.
-
The overall state-change in a clock is the sequential composition of the state changes due to individual rules fired in the clock in schedule order.
Note
|
This only explains the strategy used by our particular compiler (bsc) to achieve the high-level goal that rules must fire atomically. Other strategies for executing BH/BSV are possible, including strategies that don’t involve clocks (asynchronous logic), etc. |
Note
|
The semantics works for any schedule (linear ordering). The bsc compiler implements some sophisticated analyses to produce a “good” order, with “goodness” measured by more concurrency. Other objectives, such as energy-efficiency may result in different choices. |
From the previous discussion, it should be clear that if two rules rA
and rB invoke the _read
and _write
methods of a standard register,
respectively, then they can fire concurrently only if rA is earlier in
the schedule than rB.
But consider the “credit” counter we saw in the Merge Engine of our Mergesort example.
-
When a memory request is launched, we decrement the credit counter (use credit).
-
When a memory response is consumed, we increment the credit counter (replenish credit).
Both rules read and write the register, and so these two rules can
never fire concurrently (either ordering will have a _write
before a
_read
). The rules can only fire on different clocks.
A CReg is another register-like primitive. It provides, as its
interface, an array_ of Reg
interfaces (let us call each a
“port”), with the following ordering constraints:
ports [0]._read |
⇐ |
ports [0]._write |
|
< |
ports[1]._read |
⇐ |
ports[1]._write |
< |
ports[2]._read |
⇐ |
ports[2]._write |
< |
… |
⇐ |
… |
< |
ports[n-1]._read |
⇐ |
ports[n-1]._write |
At each index j, read and write are ordered like an ordinary register. But what is interesting and different from a standard register is that the read on a later port can be scheduled after a write on an earlier port. That means that value written in one port is available and communicated to reads on higher ports.
Returning to our credit counter problem, we can schedule the
incrementing and decrementing rules concurrently by using different
ports of a CReg (the choice of ports will determine which rule goes
first). This is indeed the case in the example code in module
mkMerge_Engine
in file Merge_Engine.bs
.
Here is a hardware implementation of the CReg primitive.
It contains a standard register with a series of muxes (multiplexers). At index 0, the read-value is the standard register value. At index 1, the read-value is muxed from the index-0 write (if it occurs) and the previous read-value, and so on.
The neat property about a CReg is that, when thinking at the abstract level of rules (no clocks, just rule-steps), they are equivalent to standard registers (just ignore the port indexes). Thus, they are an excellent formal bridge betwen the two levels of abstraction: rules by themselves, and rules-mapped-to-clocks.
In general, the process of designing a system can be separated cleanly into two phases:
-
Do the design just thinking in terms of rules, and using standard registers, and establish correctness of the design.
-
Selectively and judiciously replace standard registers with CRegs to tune up performance to required performance goals.
Many non-trivial hardware systems involve multiple clocks: certain parts of the circuit may run at higher clock speeds and other parts at slower clock speeds. The motivations may include balancing energy consumption (higher clock speeds burn more energy), adhering to externally mandated clocking standards (e.g., for standard interfaces like USB or PCIe), etc.
Usually a system has a clean “partitioning” into discrete “clock domains” each clocked by a single clock. This partitioning is orthogonal to the module hierarchy, i.e., clock domains may cut across a module hierarchy; a module may contain sections on different clocks.
So called “clock-domain discipline” ensures that communication from one clock-domain to another is only done through certain carefully engineered domain-crossing primitives called “synchronizers”.
In BH/BSV, Clocks and Reset are first-class types, and can be made explicitly visible as interfaces and values, and “plumbed” through the module hierarchy using usual function an module composition mechanisms. Further, clock-domain discipline undergoes strong static checking guaranteeing that domain-crossings always go through synchronizers.
The quality of a hardware system is measured in several ways.
-
Silicon level:
-
Area (gates, mm^2, LUTs, RAMs, DSPs, …)
-
Clock speed (MHz)
-
Energy and Energy Efficiency (Joules, Watts)
-
-
Micro-architecture level:
-
Latency: how long does it take to process a datum?
-
Throughput: how many data per time (due to concurrency: pipelining, replication, …)?
-
-
Application level
-
Wall-clock time
-
-
Accelerators
-
Speedup
-
Overhead of hand-offs from main CPU to accelerator and back (cf. Amdahl’s Law)
-
-
Historically, in the HW design community, “Verification” has meant “Testing”
-
Although, in the ASIC world, testing has historically been much more rigorous than in the SW world, given that the cost of an error can be in the millions of dollars.
-
-
Formal verification tools for SystemVerilog have recently become available, and now being used for small-subystem verification.
-
BH/BSV’s strong (Haskell-ish) type-checking itself provides a certain degree of formal verification not previously available in HDLs.
-
BH/BSV’s composable Guarded Atomic Actions opens the door for an even stronger role for formal verication.
-
Atomic transactions are the most powerful construct available for reasoning about concurrent behaviors
-
In Section HLHDLs vs. HLS: precise specification of micro-architecture we have already compared HDLs against HLS (see [GajskiHLS]) (synthesis of hardware from and algorithm specified in C/C++).
Both Chisel (see [Chisel]) and BSV significantly raise the level of abstraction and expressive power of HDLs compared to Verilog (see [IEEEVerilog2005a]), SystemVerilog (see [IEEESystemVerilog2012a]) and VHDL (see [IEEEVHDL2002]). Chisel achieves this by being an embedded DSL inside Scala, and is thus able to use Scala’s expressive power to express abstractions and composition.
There have been a few attempts at using Haskell directly for hardware design:
-
“Lava” (see [Lava]) is an embedded DSL in Haskell, using Haskell to express powerful and high-level combinators for structural composition of clocked digital circuits.
-
Clash (see https://clash-lang.org/) borrows notation and types from Haskell, but is not an embedded DSL; it has its own compiler. The Haskell-ish notation is used directly to express computational circuits, with
Signal
datatype arguments/results indicating sequential (stateful) circuits.
In all the above approaches, clocked digital logic is front and center; it is the central model of behavior, and they are all directly describing clocked digital circuits. To our knowledge, BH/BSV is unique in having atomic rules as its central behavioral model, adjoined with a methodology of how these are mapped to clocked digital circuits.
[] AMBA AXI and ACE Protocol Specification: AXI3 , AXI4 , and AXI4-Lite; ACE and ACE-Lite, Document IHI 0022E (ID022613), ARM, 2013
[] Chisel: constructing hardware in a Scala embedded language Jonathan Bachrach et. al., in Proceedings of the 49th Annual Design Automation Conference (DAC 2012). San Francisco, CA, USA, June 2012, pp. 1216–1225.
[] Bluespec SystemVerilog and Bluespec Development Workstations User Guide, Bluespec, Inc., July 21, 2017, 118pp.
[] High-level synthesis: introduction to chip and system design, D.D. Gajski, Kluwer Academic, Boston, 1992
[] IEEE Standard for System Verilog---Unified Hardware Design, Specification and Verification Language, IEEE, IEEE Std 1800-2012, February 21, 2013.
[] IEEE Standard for Standard SystemC Language Reference Manual, IEEE, IEEE Std 1666-2011, January 9, 2012.
[] Abstraction in Hardware System Design, Rishiyur S. Nikhil, in Communications of the ACM, 54:10, October 2011, pp. 36-44
[] BSV by Example, Rishiyur S. Nikhil and Kathy R. Czeck, December 2010, 302pp., ISBN-10: 1456418467; ISBN-13: 978-1456418465; available on Amazon.
[] Haskell 98 Language and Libraries: The Revised Report, Simon L. Peyton Jones (Editor), Cambridge University Press, May 5, 2003, pp._272., haskell.org
[] Lava: Hardware Design in Haskell, Per Bjesse, Koen Claessen and Mary Sheeran and Satnam Singh, in _Proc. ACM Intl. Conf. on Functional Programming (ICFP), 1998