Skip to content

Latest commit

 

History

History
1997 lines (1655 loc) · 50.7 KB

README.md

File metadata and controls

1997 lines (1655 loc) · 50.7 KB

infix

This sample was designed as a tutorial in how to use the basics of Trieste. As a multi-pass term-rewriting system, Trieste is ultimately about performing transforms on trees. Let's start by defining some terms:

Name Description
Term A subtree
Node A node in the tree. Can have 0 to N children. Has one parent.
Match Object used for grabbing nodes from a tree pattern match
Location A string. This can either be a standalone string, or a view into a long string (i.e. an input file)
Token Something in the program. This can map to something in an input file (e.g. Int, Float, String), a logical grouping term (Operator, Group), or any other tree node which is of use to the system.
Well Formed Trieste provides a way to describe, using tokens, what the structure of a tree should be. A tree which adheres to this structure is considered "Well Formed"
Pass A Pass is a stage of the system. It is bounded by two Well Formed definitions, one indicating what trees should look like on input to the pass, and another that describes how the pass has changed the tree. It is important to mention that the tree will change multiple times during Pass execution.

Before we dive into how Trieste works, it may be worth describing what it is for. Trieste is a preparation tool for a compiler. By performing multiple passes over the input, Trieste is able to iteratively refine the AST until it takes the form that is best suited to compilation. What starts as a formless set of tokens provided by the parser eventually turns into a clean and easy to process tree. It allows us to go from this:

x = 5 + 10;
y = 1 - 9 + x;
print "1" x + y;
z = (5 * x) / y;
print "2" z;

to this:

flowchart TD
  A[Top]-->B[Calculation]
  B-->C[Output]
  C-->D["string #quot;1#quot;"]
  C-->E[Int 22]
  B-->F[Output]
  F-->G["string #quot;2#quot;"]
  F-->H[Int 10]
Loading

Over the course of this tutorial, we will show how we can use Trieste to transform the input program iteratively to the final AST shown.

The infix Language

In order to keep the focus on Trieste as a system we will work with a very simple language for performing and outputting mathematics using infix operators. All variables are constants (i.e. their values do not change) and the only action the program can take is to print a result to the console.

The language is defined by the following grammar (we assume standard syntax for $string$, $number$ and $variable$):

$$ \begin{align} Calc &\rightarrow Assign \ |\ Output\ |\ Assign\ Calc\ |\ Output\ Calc\\ Assign &\rightarrow variable\ \texttt{=}\ Expr \ ";" \\ Output &\rightarrow \texttt{print}\ string \ Expr \ ";"\\ Expr &\rightarrow Expr\ Op\ Expr\ |\ Operand\ |\ Open\ Expr\ Close\\ Operand &\rightarrow number\ |\ variable\\ Op &\rightarrow \texttt{+}\ |\ \texttt{-}\ |\ \texttt{*}\ |\ \texttt{/}\\ Open &\rightarrow \texttt{(}\\ Close &\rightarrow \texttt{)}\\ \end{align} $$

which results in programs that look like this:

x = 5;
print "x" x;
y = 2 - 1;
print "1 + 10" 1 + 10;
x = 1 + 2 * 3 + 5.3 - 4 - 2 / 0.1;
y = 3.2 * x + 5;
print "x" x;
print "y" y;
x = 5 + 10;
y = 1 - 9 + x;
print "1" x + y;
z = (5 * x) / y;
print "2" z;

Parser

In order to implement the infix language in Trieste, we need to begin by implementing a Parse object. The AST produced by this parser will then be the input to a series of rewrite passes, which will culminate in an AST for the program. We want our parser to convert the raw byte stream to semantically meaningful tokens with a minimum of syntax. We will rely on later passes to interpret those tokens robustly. You can see an (advanced) example of this principle at work in the YAML Parser, which encodes whitespace as whitespace tokens but does not attempt to interpret its meaning.

We begin by constructing the Parse object like so:

Parse p(depth::file, wf_parser);

With the first argument we specify the level at which the parser will run:

enum class depth
{
  file,
  directory,
  subdirectories
};

Either at the level of a single file, a directory of files, or a whole directory structure with subdirectories. An important point is that whatever level the parser runs at, it will return a single tree containing the tokens from all files. For infix we'll be operating at the level of a single file, for simplicity. The second argument, wf_parser, is the well-formed definition for the output of the parsing step. We will return to this definition after seeing some example parse trees for the infix language.

We then need to set up the Trieste Rule objects which turn text into tokens. The constructor for a Rule is:

Rule(const std::string& r, std::function<void(Make&)> effect)

r is a regular expression, and effect will be called when a token matches the RegEx. While we can use this constructor, some helpful operator overloading allows for a more compact way of specifying these rules:

p("start", // this indicates the 'mode' these rules are associated with
  {
    // Whitespace between tokens.
    "[[:blank:]]+" >> [](auto&) {}, // no-op

    // Line comment.
    "//[^\n]*" >> [](auto&) {}, // another no-op

    // Int.
    "[[:digit:]]+\\b" >> [](auto& m) { m.add(Int); },
    
    // String. s
    R"("[^"]*")" >> [](auto& m) { m.add(String); },

    // Print.
    "print\\b" >> [](auto& m) { m.add(Print); },

    // Add ('+' is a reserved RegEx character)
    "\\+" >> [](auto& m) { m.add(Add); },

    // Equals.
    "=" >> [](auto& m) { m.seq(Equals); },

    // Terminator.
    ";[\n]*" >> [](auto& m) { m.term({Equals}); },

    // ...
  });

Here, the overloaded >> operator takes a RegEx and an effect as arguments and returns a Rule object. The () operator (called on the Parse object p in the code example above) is overloaded in the Parse class and is called with a string (representing a 'mode') and the list of Rules as arguments which are added to the Parse object. A mode is simply a string, used for controlling the application of the rules. Currently, only the 'start' mode is used.

Tokens

The parser rules match strings from the input program to tokens, which must be defined for the language we are parsing. For our running example, the tokens (e.g. Int, Print) are defined as TokenDef objects:

inline const auto Int = TokenDef("int", flag::print);
inline const auto Print = TokenDef("print");
inline const auto Add = TokenDef("+");
inline const auto Equals = TokenDef("equals");

A TokenDef can be given a flag as a second argument. For example, the flag::print indicates that, when outputting the tree, this token should also output the raw string from the input file.

There are also generic tokens built into Trieste. For example, the generic token Group is used by the parser to group parsed tokens and Top to indicate the root of the parse tree. Some are given below as examples:

  inline const auto Invalid = TokenDef("invalid");
  inline const auto Top = TokenDef("top", flag::symtab);
  inline const auto Group = TokenDef("group");
  inline const auto File = TokenDef("file");
  inline const auto ErrorMsg = TokenDef("errormsg", flag::print);

There are several other flags:

Flag Description
print Print the location when printing an AST node of this type.
symtab Include a symbol table in an AST node of this type.
defbeforeuse If an AST node of this type has a symbol table, definitions can only be found from later in the same source file.
shadowing If a definition of this type is in a symbol table, it doesn't recurse into parent symbol tables.
lookup If a definition of this type is in a symbol table, it can be found when looking up.
lookdown If a definition of this type in a symbol table, it can be found when looking down.

We will touch on some of these later as we encounter the nodes that use them.

Parsing sample

Above we see a few examples of the kinds of parsing rules we need for our language. The simplest ones add a Token via m.add(). This will add the token to the current Group. Let's look at the following line:

print "5 + 10" 5 + 10;

Let's look at this step by step (we'll show Whitespace once but will skip it afterwards for space). The current position of the cursor in the tree is indicated by the ^ symbol:

StepLocationTree
Init
print "5 + 10" 5 + 10;
^
graph TD;
  A[Top]-->B[File]
  B-->C(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading
Token (add)
print "5 + 10" 5 + 10;
     ^
graph TD;
  A[Top]-->B[File]
  B-->C[Group]
  C-->D[Print]
  C-->E(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading
Whitespace (no-op)
print "5 + 10" 5 + 10;
      ^
graph TD;
  A[Top]-->B[File]
  B-->C[Group]
  C-->D[Print]
  C-->E(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading
Token (add)
print "5 + 10" 5 + 10;
              ^
graph TD;
  A[Top]-->B[File]
  B-->C[Group]
  C-->D[Print]
  C-->E["String #quot;5 + 10#quot;"]
  C-->F(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading
Token (add)
print "5 + 10" 5 + 10;
                ^
graph TD;
  A[Top]-->B[File]
  B-->C[Group]
  C-->D[Print]
  C-->E["String #quot;5 + 10#quot;"]
  C-->F[Int 5]
  C-->G(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading
Token (add)
print "5 + 10" 5 + 10;
                  ^
graph TD;
  A[Top]-->B[File]
  B-->C[Group]
  C-->D[Print]
  C-->E["String #quot;5 + 10#quot;"]
  C-->F[Int 5]
  C-->G[Add]
  C-->H(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading
Token (add)
print "5 + 10" 5 + 10;
                     ^
graph TD;
  A[Top]-->B[File]
  B-->C[Group]
  C-->D[Print]
  C-->E["String #quot;5 + 10#quot;"]
  C-->F[Int 5]
  C-->G[Add]
  C-->H[Int 10]
  C-->I(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading
Token (term)
print "5 + 10" 5 + 10;
                      ^
graph TD;
  A[Top]-->B[File]
  B-->C[Group]
  C-->D[Print]
  C-->E["String #quot;5 + 10#quot;"]
  C-->F[Int 5]
  C-->G[Add]
  C-->H[Int 10]
  B-->I(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading

Note that the term command, triggered by the semi-colon, ends the Group node. In addition to the add and term functions operating on tokens we have seen this far, we will look at the effects of the functions push, pop and seq while parsing.

Parsing example with push and pop

infix includes the ability to set expression precedence using parenthesis. How do we handle parentheses when we are parsing? Let's look at the rules:

auto indent = std::make_shared<std::vector<size_t>>();

/* ... */

// Parens.
"(\\()[[:blank:]]*" >> [indent](auto& m) {
  // we push a Paren node. Subsequent nodes will be added
  // as its children.
  m.push(Paren, 1);
},

"\\)" >>
  [indent](auto& m) {
    // terminate the current group
    m.term();
    // pop back up out of the Paren
    m.pop(Paren);
  },

We'll explore this behavior with an example:

1 + (2 * 3) + 4;

Let's see how this develops:

ActionLocationTree
Init
1 + (2 * 3) + 4;
^
graph TD;
  A[Top]-->B[File]
  B-->C(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading
Token (add x 2)
1 + (2 * 3) + 4;
    ^
graph TD;
  A[Top]-->B[File]
  B-->C[Group]
  C-->D[Int 1]
  C-->E[+]
  C-->F(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading
Token (push)
1 + (2 * 3) + 4;
     ^
graph TD;
  A[Top]-->B[File]
  B-->C[Group]
  C-->D[Int 1]
  C-->E[+]
  C-->F[Paren]
  F-->G[cursor]:::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading
Token (add x 3)
1 + (2 * 3) + 4;
          ^
graph TD;
  A[Top]-->B[File]
  B-->C[Group]
  C-->D[Int 1]
  C-->E[+]
  C-->F[Paren]
  F-->G[Group]
  G-->H[Int 2]
  G-->I[*]
  G-->J[Int 3]
  G-->K(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading
Token (pop)
1 + (2 * 3) + 4;
           ^
graph TD;
  A[tTp]-->B[File]
  B-->C[Group]
  C-->D[Int 1]
  C-->E[+]
  C-->F[Paren]
  F-->G[Int 2]
  F-->H[*]
  F-->I[Int 3]
  C-->J(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading
Token (add x 2)
1 + (2 * 3) + 4;
               ^
graph TD;
  A[Top]-->B[File]
  B-->C[Group]
  C-->D[Int 1]
  C-->E[+]
  C-->F[Paren]
  F-->G[Int 2]
  F-->H[*]
  F-->I[Int 3]
  C-->J[+]
  C-->K[Int 4]
  C-->L(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading
Token (term)
1 + (2 * 3) + 4;
                ^
graph TD;
  A[Top]-->B[File]
  B-->C[Group]
  C-->D[Int 1]
  C-->E[+]
  C-->F[Paren]
  F-->G[Int 2]
  F-->H[*]
  F-->I[Int 3]
  C-->J[+]
  C-->K[Int 4]
  B-->L(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading

As you can see, push and pop allow us to move up and down the parse tree, creating structure even as we parse the tokens themselves that will help us later in the process.

Parsing example with seq

Let's look at the rule definitions again in detail:

// Equals.
"=" >> [](auto& m) { m.seq(Equals); },

// Terminator.
";[\n]*" >> [](auto& m) { m.term({Equals}); },

These rules, for parsing assignments, are different from the other parsing rules for the infix language. The first rule calls seq. This command indicates that we are entering into a sequence of groups. Let's look at how we parse this line:

x = 5 + 3;
ActionLocationTree
Init
x = 5 + 3;
^
graph TD;
  A[Top]-->B[File]
  B-->C(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading
Token (add)
x = 5 + 3;
  ^
graph TD;
  A[Top]-->B[File]
  B-->C[Group]
  C-->D[x]
  C-->E(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading
Token (seq)
x = 5 + 3;
   ^
graph TD;
  A[Top]-->B[File]
  B-->C[=]
  C-->D[Group]
  D-->E[x]
  C-->F(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;

Loading
Token (add x 3)
x = 5 + 3;
         ^
graph TD;
  A[Top]-->B[File]
  B-->C[=]
  C-->D[Group]
  D-->E[x]
  C-->F[Group]
  F-->G[Int 5]
  F-->H[Add]
  F-->I[Int 3]
  F-->J(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading
Token (term)
x = 5 + 3;
          ^
graph TD;
  A[Top]-->B[File]
  B-->C[=]
  C-->D[Group]
  D-->E[x]
  C-->F[Group]
  F-->G[Int 5]
  F-->H[Add]
  F-->I[Int 3]
  B-->J(cursor):::cursor
  classDef cursor stroke-width:4px,stroke-dasharray:5 5;
Loading

Because we call term with Equals, it will not only terminate the group but also terminate the sequence of values in the Equals. The next call to add will add a new group at the same level as the Equals.

Passes

We now have a parse tree for our program. It has turned this program:

x = 5 + 10;
y = 1 - 9 + x;
print "1" x + y;
z = (5 * x) / y;
print "2" z;

into this tree:

top
└── file
    ├── equals
    │   ├── group
    │   │   └── ident x
    │   └── group
    │       ├── int 5
    │       ├── +
    │       └── int 10
    ├── equals
    │   ├── group
    │   │   └── ident y
    │   └── group
    │       ├── int 1
    │       ├── -
    │       ├── int 9
    │       ├── +
    │       └── ident x
    ├── group
    │   ├── print
    │   ├── string "1"
    │   ├── ident x
    │   ├── +
    │   └── ident y
    ├── equals
    │   ├── group
    │   │   └── ident z
    │   └── group
    │       ├── paren
    │       │   └── group
    │       │       ├── int 5
    │       │       ├── *
    │       │       └── ident x
    │       ├── /
    │       └── ident y
    └── group
        ├── print
        ├── string "2"
        └── ident z

We can specify what a well formed parse tree looks like in the following way:

// | is used to create a Choice between all the elements
// this indicates that literals can be an Int or a Float
inline const auto wf_literal = Int | Float;

inline const auto wf_parse_tokens = wf_literal | String | Paren | Print | Ident | Add | Subtract | Divide | Multiply;

// A <<= B indicates that B is a child of A
// ++ indicates that there are zero or more instances of the token
inline const auto wf_parser =
    (Top <<= File)
  | (File <<= (Group | Equals)++)
  | (Paren <<= Group++)
  | (Equals <<= Group++)
  | (Group <<= wf_parse_tokens++)
  ;

Each rewriting pass has a corresponding well-formedness check which is used to ensure that the input to the next pass is as expected. If the parse tree fails this check, Trieste will stop rewriting at that pass and output an error indicating what is wrong with the tree.

Pass 1: Expressions

Our first pass will take the raw tokens from the parser and organize them into mathematical expressions. We will do this by specifying a PassDef object, which has the following constructor:

PassDef(
      const std::string& name,
      const wf::Wellformed& wf,
      dir::flag direction,
      const std::initializer_list<detail::PatternEffect<Node>>& r)

The first argument is a name for the pass, wf is the well-formedness rules for the pass and the direction flag specifies the direction of the tree traversal (top-down, bottom-up or once). r in this case are rules which change the current tree using a Pattern, which matches to a subtree, and Effect:

using Effect = std::function<T(Match&)>;

which is a function that takes a Match object (i.e. a term matching the pattern) and produces an updated subtree. Naturally, we have some nice semantic sugar to keep us from having to call these constructors manually. Let's look at some example rewrite rules for this pass. The first rule can be read as "If we find a Top node with a File node as its child, replace the File node with a Calculation node with the File content":

// Here we're saying that we want to create a
// Calculation node and make all of the values in File (*_[File]) its
// children. The children from the matched term (_) are accessed by
// first retrieving the File node from the match object (_[File])
// and then its children (*_[File]).
In(Top) * T(File)[File] >>
    [](Match& _) { return Calculation << *_[File]; },

// This rule selects an Equals node with the right structure,
// i.e. a single ident being assigned. We replace it with
// an Assign node that has two children: the Ident and the
// an Expression, which will take the children of the Group.
In(Calculation) *
    (T(Equals) << ((T(Group) << T(Ident)[Id]) * T(Group)[Rhs])) >>
  [](Match& _) { return Assign << _(Id) << (Expression << *_[Rhs]); },

// This rule selects a Group that matches the Output pattern
// of `print <string> <expression>`. In this case, Any++ indicates that
// Rhs should contain all the remaining tokens in the group.
// When used here, * means nodes that are children of the In()
// node in the specified order. They can be anywhere inside
// the In() child sequence.
In(Calculation) *
    (T(Group) << (T(Print) * T(String)[Lhs] * Any++[Rhs])) >>
  [](Match& _) { return Output << _(Lhs) << (Expression << _[Rhs]); },

// This node unwraps Groups that are inside Parens, making them
// Expression nodes.
In(Expression) * (T(Paren) << T(Group)[Group]) >>
  [](Match& _) { return Expression << *_[Group]; },

The general shape of a rule is Pattern >> Effect, which produces a PatternEffect. In a pattern, T(Foo) matches a single node Foo, and C * D matches pattern C followed by pattern D. P << C means that the pattern C matches the children of whatever node is matched by pattern P. In(Foo) indicates that the parent node where the current pattern matches is a Foo node. This parent node will not be part of the Match object: only subsequent terms in the pattern are part of the Match and can be replaced by the effect. We can also bind patterns to variables to conveniently refer to constituent parts of a Pattern. For instance, T(Foo)[V] creates a pattern match of type Foo bound to the name V. Note that the name V also is a Token (for performance reasons) but does not have to be the same token as it is binding.

The result of running this pass (on the parse tree) is the following tree:

top
└── calculation
    ├── symbols: {x y z}
    ├── assign
    │   ├── ident x
    │   └── expression
    │       ├── int 5
    │       ├── +
    │       └── int 10
    ├── assign
    │   ├── ident y
    │   └── expression
    │       ├── int 1
    │       ├── -
    │       ├── int 9
    │       ├── +
    │       └── ident x
    ├── output
    │   ├── string "1"
    │   └── expression
    │       ├── ident x
    │       ├── +
    │       └── ident y
    ├── assign
    │   ├── ident z
    │   └── expression
    │       ├── expression
    │       │   ├── int 5
    │       │   ├── *
    │       │   └── ident x
    │       ├── /
    │       └── ident y
    └── output
        ├── string "2"
        └── ident z

Note how there is now a symbol table for the Calculation node. This is because it, and Assign, are defined like this:

  inline const auto Calculation = TokenDef("calculation", flag::symtab | flag::defbeforeuse);
  inline const auto Assign = TokenDef("assign", flag::lookup | flag::shadowing);

Note the use of additional flags from the table above. symtab tells Trieste that Calculation should have its own symbol table. defbeforeuse tells Trieste that every symbol in a Calculation has to be defined before it can be used. lookup indicates that children of assign nodes can be the targets of reference lookups, and shadowing indicates that the symbol defined by an Assign node will take precedence over any other which exists.

We must also specify what it means for a tree to be well-formed after the expressions pass:

inline const auto wf_expressions_tokens =
    wf_literal | Ident | Add | Subtract | Divide | Multiply | Expression;

inline const auto wf_pass_expressions =
    (Top <<= Calculation)
  | (Calculation <<= (Assign | Output)++)
  // [Ident] here indicates that the Ident node is a symbol that should
  // be stored in the symbol table
  | (Assign <<= Ident * Expression)[Ident]
  | (Output <<= String * Expression)
  // [1] here indicates that there should be at least one token
  | (Expression <<= wf_expressions_tokens++[1])
  ;

This will ensure that a tree which progresses to the next step takes this form.

Now that we have a pass, let us look at how we construct a Driver object:

static Driver d(
      "infix",
      nullptr,
      parser(),
      {
        expressions() //our only rewrite pass so far
      }
  );

We first provide a name for our language, and then a reference to some command line options (none in this case). The parser() function returns the Parse object that we specified before. We then include our rewrite passes, where the PassDef returned by the function expressions() is the only one defined thus far.

Passes 2 and 3: Operator precedence

Now that we have our tokens roughly organized around mathematical expressions, we need to solve the problem of operator precedence. We want the multiply and divide operations to have higher precedence than add and subtract. It so happens that with a multi-pass term rewriting system like Trieste, this is very straightforward. We will have one pass devoted to grouping all multiply and divide expressions together (as if they were surrounded in parentheses) and then in the next pass we will do the same for all add and subtract expressions.

Here is the first rule:

// Group multiply and divide operations together. This rule will
// select any triplet of <arg> *|/ <arg> in an expression list and
// replace it with a single <expr> node that has the triplet as
// its children.
In(Expression) *
    (ExpressionArg[Lhs] * (T(Multiply) / T(Divide))[Op] *
      ExpressionArg[Rhs]) >>
  [](Match& _) {
    return Expression
      << (_(Op) << (Expression << _(Lhs)) << (Expression << _[Rhs]));
  },

with the updated well-formedness check:

inline const auto wf_pass_multiply_divide =
  wf_pass_expressions
  | (Multiply <<= Expression * Expression)
  | (Divide <<= Expression * Expression)
  ;

Note that we do not have to repeat what came before, simply update and rewrite how certain nodes should behave.

Now we want to do the same for add and subtract:

In(Expression) *
    (ExpressionArg[Lhs] * (T(Add) / T(Subtract))[Op] *
      ExpressionArg[Rhs]) >>
  [](Match& _) {
    return Expression
      << (_(Op) << (Expression << _(Lhs)) << (Expression << _[Rhs]));
  },

And an updated well-formedness check:

inline const auto wf_pass_add_subtract =
  wf_pass_expressions
  | (Add <<= Expression * Expression)
  | (Subtract <<= Expression * Expression)
  ;

A side effect of these operations is that our flat expression structure where every token was a child of an expression is now a binary tree. Let's see this develop by looking at the following expression:

1 + 2 * 3 - 4 + 5 / (6 - 7)

and how its tree evolves over time:

IterationTree
expressions[end]
flowchart TD
  A[Expr]-->B[Int 1]
  A-->C[+]
  A-->D[Int 2]
  A-->E[*]
  A-->F[Int 3]
  A-->G[-]
  A-->H[Int 4]
  A-->I["#47;"]
  A-->J[Expr]
  J-->K[Int 6]
  J-->L[-]
  J-->M[Int 7]
Loading
multiply_divide[0]
flowchart TD
  A[Expr]-->B[Int 1]
  A-->C[+]
  A-.->N[Expr]:::current
  N-.->E[*]
  E-.->DD[Expr]:::current
  DD-.->D[Int 2]
  E-.->FF[Expr]:::current
  FF-.->F[Int 3]
  A-->G[-]
  A-->H[Int 4]
  A-->I["#47;"]
  A-->J[Expr]
  J-->K[Int 6]
  J-->L[-]
  J-->M[Int 7]
  classDef current stroke-width:4px,stroke-dasharray:5 5;
Loading
multiply_divide[end]
flowchart TD
  A[Expr]-->B[Int 1]
  A-->C[+]
  A-->N[Expr]
  N-->E[*]
  E-->DD[Expr]
  DD-->D[Int 2]
  E-->FF[Expr]
  FF-->F[Int 3]
  A-->G[-]
  A-.->O[Expr]:::current
  O-.->I["#47;"]
  I-.->HH[Expr]:::current
  HH-.->H[Int 4]
  I-.->J[Expr]
  J-->K[Int 6]
  J-->L[-]
  J-->M[Int 7]
  classDef current stroke-width:4px,stroke-dasharray:5 5;
Loading
add_subtract[0]
flowchart TD
  A[Expr]-.->CC[Expr]:::current
  CC-.->C[+]
  C-.->BB[Expr]:::current
  BB-.->B[Int 1]
  C-.->N[Expr]
  N-->E[*]
  E-->DD[Expr]
  DD-->D[Int 2]
  E-->FF[Expr]
  FF-->F[Int 3]
  A-->G[-]
  A-->O[Expr]
  O-->I["#47;"]
  I-->HH[Expr]
  HH-->H[Int 4]
  I-->J[Expr]
  J-->K[Int 6]
  J-->L[-]
  J-->M[Int 7]
  classDef current stroke-width:4px,stroke-dasharray:5 5;
Loading
add_subtract[1]
flowchart TD
  A[Expr]-->GG[-]
  GG-.->CC[Expr]
  CC-->C[+]
  C-->BB[Expr]
  BB-->B[Int 1]
  C-->N[Expr]
  N-->E[*]
  E-->DD[Expr]
  DD-->D[Int 2]
  E-->FF[Expr]
  FF-->F[Int 3]
  GG-.->O[Expr]
  O-->I["#47;"]
  I-->HH[Expr]
  HH-->H[Int 4]
  I-->J[Expr]
  J-->K[Int 6]
  J-->L[-]
  J-->M[Int 7]
Loading
add_subtract[end]
flowchart TD
  A[Expr]-->GG[-]
  GG-->CC[Expr]
  CC-->C[+]
  C-->BB[Expr]
  BB-->B[Int 1]
  C-->N[Expr]
  N-->E[*]
  E-->DD[Expr]
  DD-->D[Int 2]
  E-->FF[Expr]
  FF-->F[Int 3]
  GG-->O[Expr]
  O-->I["#47;"]
  I-->HH[Expr]
  HH-->H[Int 4]
  I-->J[Expr]
  J-->L[-]
  L-.->KK[Expr]:::current
  KK-.->K[Int 6]
  L-.->MM[Expr]:::current
  MM-.->M[Int 7]
  classDef current stroke-width:4px,stroke-dasharray:5 5;
Loading

The resulting tree can be evaluated recursively to obtain a result which respects the operator precedence rules.

Pass 4: Trim

At this point there may be a few nodes where you have an expression as a child of an expression. Before we proceed we want to clean up the tree so that all expressions really are binary calculation trees. Rule:

// End is a special pattern which indicates that there
// are no further nodes. So in this case we are matching
// an Expression which has a single Expression as a
// child.
T(Expression) << (T(Expression)[Expression] * End) >>
    [](Match& _) { return _(Expression); },

Well-formed check:

inline const auto wf_pass_trim =
  wf_pass_add_subtract
  | (Expression <<= wf_operands_tokens)
  ;

At first this may seem like a trivial pass. Why not include it in another pass? However, the "smaller" a pass, the less it does, the easier it is to reason about and support. There is no downside to having many passes, so when using Trieste err on the side of more, smaller passes like this one.

At the end of these passes our tree looks like this:

top
└── calculation
    ├── symbols: {x y z}
    ├── assign
    │   ├── ident x
    │   └── expression
    │       └── +
    │           ├── expression
    │           │   └── int 5
    │           └── expression
    │               └── int 10
    ├── assign
    │   ├── ident y
    │   └── expression
    │       └── +
    │           ├── expression
    │           │   └── -
    │           │       ├── expression
    │           │       │   └── int 1
    │           │       └── expression
    │           │           └── int 9
    │           └── expression
    │               └── ident x
    ├── output
    │   ├── string "1"
    │   └── expression
    │       └── +
    │           ├── expression
    │           │   └── ident x
    │           └── expression
    │               └── ident y
    ├── assign
    │   ├── ident z
    │   └── expression
    │       └── /
    │           ├── expression
    │           │   └── *
    │           │       ├── expression
    │           │       │   └── int 5
    │           │       └── expression
    │           │           └── ident x
    │           └── expression
    │               └── ident y
    └── output
        ├── string "2"
        └── expression
            └── ident z

Pass 5: Checking References

At this stage, we will check that all of our identifiers are defined. Recall that we are building a symbol table for the calculation. Trieste allows us to lookup identifiers that are used in an expression in the symbol table. This means that we can verify that all identifiers being used are properly defined before use. Also, as a bonus, Trieste will automatically detect multiple declaration errors as part of building the symbol table. Here is an example of a rule that checks the identifiers:

In(Expression) * T(Ident)[Id] >>
  [](Match& _) {
    auto id = _(Id);            // the Node object for the identifier
    auto defs = id->lookup();   // a list of matching symbols
    if (defs.size() == 0)
    {
      // there are no symbols with this identifier
      return err(id, "undefined");
    }

    return Ref << id;
  }

This means that from this point onwards we know that every identifier in the tree is valid.

Well-formedness Definition

At this stage we have successfully constructed an AST for an infix program. Users of our code need to know what this final form is, so we should export it as part of the infix namespace:

  const auto wf =
    (Top <<= Calculation)
    | (Calculation <<= (Assign | Output)++)
    | (Assign <<= Ident * Expression)[Ident]
    | (Output <<= String * Expression)
    | (Expression <<= (Add | Subtract | Multiply | Divide | Ref | Float | Int))
    | (Ref <<= Ident)
    | (Add <<= Expression * Expression)
    | (Subtract <<= Expression * Expression)
    | (Multiply <<= Expression * Expression)
    | (Divide <<= Expression * Expression)
    ;

Every infix AST which the Parse object and the above passes produce will conform to this. Now we need to expose a Reader to our users which provides an easy way for them to produce an AST.

Reader

The Reader helper is a class which Trieste provides to make it easy to provide parsing functionality to users. It has the following constructor:

Reader(const std::string& language_name,
       const std::vector<Pass>& passes,
       const Parse& parser)

So, in the case of infix we create one like this:

  Reader reader()
  {
    return {
      "infix",
      {expressions(), multiply_divide(), add_subtract(), trim(), check_refs()},
      parser(),
    };
  }

The user can then use the reader object to read a source file as easily as:

ProcessResult result = infix::reader().file("simple.infix").read();
if (!result.ok)
{
  logging::Error err;
  result.print_errors(err);
  return 1;
}

Node ast = result.ast;

However, the reader object has many other useful fields which can aid in debugging, for example:

reader.file(path)
  .debug_enabled(true)
  .debug_path("debug")
  .wf_check_enabled(true)

Which will output intermediary ASTs to a debug folder for each pass and perform a strict well-formedness check after parse and each pass. You can even restart parsing from a particular pass using start_pass or stop prematurely using end_pass.

Writer

The Writer helper class provides an easy way as a language implementer for you to expose the ability to take an AST and write it out again. This is useful for things such as auto-formatters or converters. In the case of infix we will use it to implement two writers: one which outputs a fully-parenthesized version of the input, and another which outputs a postfix version of the input.

The Writer class has the following constructor:

using WriteFile = std::function<bool(std::ostream&, Node)>;
Writer(const std::string& language_name,
       const std::vector<Pass>& passes,
       const wf::Wellformed& input_wf,
       WriteFile write_file)

The passes here are different than they usually would be. The job of these passes is to prepare the AST so that it has the following structure:

  // clang-format off
  inline const auto wf_writer =
    (Top <<= Directory | File)
    | (Directory <<= Path * FileSeq)
    | (FileSeq <<= (Directory | File)++)
    | (File <<= Path * Contents)
    ;
  // clang-format on

Note that Contents here is a placeholder. The passes you provide will determine how to take an AST that adheres to input_wf and break it out into one or more files in a directory structure, as shown above. In the case of infix this will always be a single file, and Contents will be a Calculation node, but the system provides full flexibility for more complex outputs.

Once the passes have completed successfully, the Writer calls write_file with an output stream and a node, and as language implementer you then have fully freedom to turn the node it provides into a file. Here is our first example, in which we create a fully-parenthesized version of the input:

  Writer writer(const std::filesystem::path& path)
  {
    return {
      "infix", {to_file(path)}, infix::wf, [](std::ostream& os, Node contents) {
        return write_infix(os, contents);
      }};
  }

  PassDef to_file(const std::filesystem::path& path)
  {
    return {
      "to_file",
      wf_to_file,
      dir::bottomup | dir::once,
      {
        In(Top) * T(Calculation)[Calculation] >>
          [path](Match& _) {
            return File << (Path ^ path.string()) << _(Calculation);
          },
      }};
  }

The write_infix function takes nodes and writes them out. For example, take a look at this snippet:

    if (node->in({Add, Subtract, Multiply, Divide}))
    {
      os << "(";
      if (write_infix(os, node->front()))
      {
        return true;
      }
      os << " " << node->location().view() << " ";
      if (write_infix(os, node->back()))
      {
        return true;
      }
      os << ")";

      return false;
    }

We can create the postfix writer in a very similar way:

  Writer postfix_writer(const std::filesystem::path& path)
  {
    return {
      "postfix", {to_file(path)}, infix::wf, [](std::ostream& os, Node contents) {
        return write_postfix(os, contents);
      }};
  }

And a snippet from write_postfix:

    if (node->in({Add, Subtract, Multiply, Divide}))
    {
      if (write_postfix(os, node->front()))
      {
        return true;
      }

      os << " ";

      if (write_postfix(os, node->back()))
      {
        return true;
      }
      
      os << " " << node->location().view();

      return false;
    }

One created, these Writer objects can be used in much the same way as Reader objects:

ProcessResult result = infix::writer().dir(".").write(ast);

There are also some operator overloads, so you can write an auto-formatter (for example) as:

ProcessResult result = infix::reader().file("simple.infix") >>
                       infix::writer("clean.infix").dir(".");

Rewriter

In addition to the ability to export Reader and Writer helpers, Trieste also provides the Rewriter helper class as a way of exporting a sequence of paths that transform your AST into other forms. In the YAML language implementation this functionality is used to convert YAML to JSON (see to_json.cc). For infix we will use this to expose a series of passes which execute the program, computing all the expressions until we end up with a series of print statements stated in terms of literals. This will give us a chance to look at some more useful functionality which comes into play when performing these kinds of transformations.

Pass 6: Maths

First, we lookup all identifiers and replace them with their values:

T(Expression) << (T(Ref) << T(Ident)[Id])(
  [](auto& n) { return can_replace(n); }) >>
  [](Match& _) {
    auto defs = _(Id)->lookup();
    auto assign = defs.front();
    // the assign node has two children: the ident, and its value
    // this returns the second
    return assign->back();
  },

Note that this pass will run many times (for example, with the sample program we've been looking at it runs 8 times) and so while this may not be true at first, it will be eventually.

Next, we replace all expressions which have a single literal value with a Literal node:

T(Expression) << (T(Int) / T(Float))[Rhs] >>
  [](Match& _) { return Literal << _(Rhs); },

Now we are ready to start collapsing the maths nodes. Here is an example of the logic for Add:

int get_int(const Node& node)
{
  std::string text(node->location().view());
  return std::stoi(text);
}

double get_double(const Node& node)
{
  std::string text(node->location().view());
  return std::stod(text);
}

/* ... */

T(Add) << ((T(Literal) << T(Int)[Lhs]) * (T(Literal) << T(Int)[Rhs])) >>
  [](Match& _) {
    int lhs = get_int(_(Lhs));
    int rhs = get_int(_(Rhs));
    // ^ here means to create a new node of Token type Int with the
    // provided string as its location.
    return Int ^ std::to_string(lhs + rhs);
  },

T(Add) << ((T(Literal) << Number[Lhs]) * (T(Literal) << Number[Rhs])) >>
  [](Match& _) {
    double lhs = get_double(_(Lhs));
    double rhs = get_double(_(Rhs));
    return Float ^ std::to_string(lhs + rhs);
  },

These rules (and similar ones for Subtract, Multiply, and Divide) will run again and again until every expression has been collapsed to a single Literal node as we see below:

top
└── calculation
    ├── assign
    │   ├── ident x
    │   └── literal
    │       └── int 15
    ├── assign
    │   ├── ident y
    │   └── literal
    │       └── int 7
    ├── output
    │   ├── string "1"
    │   └── literal
    │       └── int 22
    ├── assign
    │   ├── ident z
    │   └── literal
    │       └── int 10
    └── output
        ├── string "2"
        └── literal
            └── int 10

Pass 7: Cleanup

Those Assign nodes do nothing at this point, and so we can remove them. We can also elide the Literal node for similar reasons. Rules:

In(Calculation) * T(Assign) >> [](Match&) -> Node { return {}; },

T(Literal) << Any[Rhs] >> [](Match& _) { return _(Rhs); },

The final well-formedness check:

inline const auto wf_pass_cleanup =
  wf_pass_maths
  | (Calculation <<= Output++)
  // note the use of >>= here. This allows us to have a choice
  // as a field by giving it a temporary name.
  | (Output <<= String * (Expression >>= wf_literal))
  ;

and our final tree:

top
└── calculation
    ├── output
    │   ├── string "1"
    │   └── int 22
    └── output
        ├── string "2"
        └── int 10

The Rewriter class has the following constructor:

Rewriter(
      const std::string& name,
      const std::vector<Pass>& passes,
      const wf::Wellformed& input_wf);

For our rewriter we can create this in the following way:

Rewriter calculate()
{
  return Rewriter("calculate", {maths(), cleanup()}, infix::wf);
}

We can then write the following code:

ProcessResult result = infix::reader().file("simple.infix") >> infix::calculate();

And receive an AST which contains only output nodes:

Node calc = result.ast->front();
for(auto& output : *calc){
  auto str = output->front()->location().view();
  auto val = output->back()->location().view();
  std::cout << str << " " << val << std::endl;
}

You can see a full working example that uses all the helpers we have discussed in infix.cc.

Running the infix_trieste Executable

Trieste also provides a default way of creating an executable in the Driver class. The resulting program will do some incredibly useful things for us:

int main(int argc, char** argv)
{
  return trieste::Driver(infix::reader()).run(argc, argv);
}

Usage:

infix_trieste
Usage: ./dist/infix/infix_trieste [OPTIONS] SUBCOMMAND

Options:
  -h,--help                   Print this help message and exit
  --help-all                  Expand all help

Subcommands:
  build                       Build a path
  test                        Run automated tests

build

The build command takes an infix source file and will produce an AST file. Its usage is:

Build a path
Usage: ./dist/infix/infix_trieste build [OPTIONS] path

Positionals:
  path TEXT REQUIRED          Path to compile.

Options:
  -h,--help                   Print this help message and exit
  --help-all                  Expand all help
  -l,--log_level TEXT         Set Log Level to one of Trace, Debug, Info, Warning, Output, Error, None
  -w,                         Check well-formedness.
  -p,--pass TEXT:{parse,expressions,multiply_divide,add_subtract,trim,check_refs,maths,cleanup}
                              Run up to this pass.
  -o,--output TEXT            Output path.
  --dump_passes TEXT          Dump passes to the supplied directory.

for example:

./infix_trieste build simple.infix -l Info

outputs:

---------
Pass	Iterations	Changes	Time (us)
expressions     2       5       110
multiply_divide 1       0       50
add_subtract    2       2       95
trim    2       2       76
check_refs      2       1       68
---------

and creates the file simple.trieste containing:

infix
check_refs
(top
  {}
  (infix-calculation
    {
      x = infix-assign
      y = infix-assign}
    (infix-assign
      (infix-ident 1:x)
      (infix-expression
        (infix-int 1:5)))
    (infix-output
      (infix-string 3:"x")
      (infix-expression
        (infix-ref
          (infix-ident 1:x))))
    (infix-assign
      (infix-ident 1:y)
      (infix-expression
        (infix-subtract
          (infix-expression
            (infix-int 1:2))
          (infix-expression
            (infix-int 1:1)))))
    (infix-output
      (infix-string 8:"1 + 10")
      (infix-expression
        (infix-add
          (infix-expression
            (infix-int 1:1))
          (infix-expression
            (infix-int 2:10)))))))

test

The test command will perform generative testing of each pass using the well-formedness definitions. Usage:

Run automated tests
Usage: ./dist/infix/infix_trieste test [OPTIONS] [start] [end]

Positionals:
  start TEXT:{parse,expressions,multiply_divide,add_subtract,trim,check_refs}
                              Start at this pass.
  end TEXT:{parse,expressions,multiply_divide,add_subtract,trim,check_refs}
                              End at this pass.

Options:
  -h,--help                   Print this help message and exit
  --help-all                  Expand all help
  -c,--seed_count UINT        Number of iterations per pass
  -s,--seed UINT              Random seed for testing
  -l,--log_level TEXT         Set Log Level to one of Trace, Debug, Info, Warning, Output, Error, None
  -d,--max_depth UINT         Maximum depth of AST to test
  -f,--failfast               Stop on first failure

For each pass, it will use its input WF definition to produce seed_count random inputs that adhere to that definition, process them with the pass, and check them against the output WF definition. For example

./infix_trieste test -f -c 1000 -l Info

outputs:

Testing x1000, seed: 3316469074
Testing pass: expressions
Testing pass: multiply_divide
Testing pass: add_subtract
Testing pass: trim
Testing pass: check_refs
Testing pass: maths
Testing pass: cleanup

This testing can be incredibly helpful in finding errors in the WF definitions and the rewrite rules, but also requires that you explicitly produce error messages for all possible syntax problems in each pass.

Errors

Yet another advantage of a multi-pass rewrite system like Trieste is the ability to produce fine-grained errors. Indeed, as we just mentioned, the testing system of the driver will require these messages and help you find the various edge and corner cases in your rules. Most of these messages will likely be in the first pass (i.e. post-parse).

The Error token allows the creation of a special node which we can use to replace the erroneous node. This will then exempt that subtree from the well-formedness check. This is the mechanism by which we can use the testing system to discover edge cases, i.e. the testing will not proceed to the next pass until all of the invalid subtrees have been marked as Error. Error nodes can conveniently be created with the err function:

auto err(NodeRange& r, const std::string& msg)
{
  return Error << (ErrorMsg ^ msg) << (ErrorAst << r);
}

auto err(Node node, const std::string& msg)
{
  return Error << (ErrorMsg ^ msg) << (ErrorAst << node);
}

In addition to the rewrite rules for valid expressions (pass 1) we saw before, we should define Error rules. Since the rules are matched in order, the first rule below matches Paren nodes with no children (a previous rule will have handled those with children). A few examples of Error rules for the expression pass:

// Empty parenthesis
T(Paren)[Paren] >> [](Match& _) { return err(_(Paren), "Empty paren"); },

// Ditto for malformed equals nodes
T(Equals)[Equals] >>
  [](Match& _) { return err(_(Equals), "Invalid assign"); },

// Orphaned print node will catch bad output statements
T(Print)[Print] >>
  [](Match& _) { return err(_(Print), "Invalid output"); },

// Our WF definition allows this, so we need to handle it.
T(Expression)[Rhs] << End >>
  [](Match& _) { return err(_(Rhs), "Empty expression"); },

// Same with this.
In(Expression) * T(String)[String] >>
  [](Match& _) {
    return err(_(String), "Expressions cannot contain strings");
  },

Standard practice is to write the "positive" rules for your language, add the errors you can think of, and then discover edge and corner cases via the testing. This does not just happen in the early stages of the process, however. Take, for example, the maths pass. Let's look at the logic for Divide:

T(Divide)
    << ((T(Literal) << T(Int)[Lhs]) * (T(Literal) << T(Int)[Rhs])) >>
  [](Match& _) {
    int lhs = get_int(_(Lhs));
    int rhs = get_int(_(Rhs));
    if (rhs == 0)
    {
      return err(_(Rhs), "Divide by zero");
    }

    return Int ^ std::to_string(lhs / rhs);
  },

In a very cool way, Trieste allows us to detect divide by zero errors. However, this now means that, potentially, some of the nodes in our AST will be of type error, resulting in empty subexpressions. This means we need to add some error handling, for example:

// Note how we pattern match explicitly for the Error node
In(Expression) *
    (MathsOp << ((T(Expression)[Expression] << T(Error)) * T(Literal))) >>
  [](Match& _) {
    return err(_(Expression), "Invalid left hand argument");
  },

By finding these errors explicitly we can propagate the error up the tree, thus eventually allowing the bad subtree to be exempted from the WF check and allowing the testing to proceed.