Skip to content
seancribbs edited this page Sep 13, 2010 · 11 revisions

About

Neotoma is a packrat parser-generator for Erlang for Parsing Expression Grammars (PEGs). It consists of a parsing-combinator library with memoization routines, a parser for PEGs, and a utility to generate parsers from PEGs. It is inspired by treetop, a Ruby library with similar aims, and parsec, the parser-combinator library for Haskell.

Getting started

  1. Clone the repository:
    $ git clone git://github.com/seancribbs/neotoma.git
  2. Build the library:
    $ cd neotoma
    $ make
  3. Symlink or copy the neotoma application into your lib path (if you configured erlang with —prefix=/usr/local, for example):
    $ ln -s neotoma /usr/local/lib/erlang/lib/neotoma
  4. Start the Erlang shell and generate your parser:
    $ erl
    1> neotoma:file(“mygrammar.peg”).
    ok

Writing a Grammar

Neotoma’s PEG grammars are based on the grammars from Bryan Ford’s thesis with some influences from Treetop. The basic format is thus:

 nonterminal <- parsing_expression;

Where parsing_expression is any combination of nonterminals, terminals and sub-expressions (e, e1, e2 are parsing expressions) as described below:

Non-terminal symbol some_nonterminal All nonterminals on the RHS must have a corresponding rule/reduction.
String "Hello, world" single- or double-quoted, quotes escaped with \\
Character class [a-zA-Z0-9] just as in the re module
Any single character .
Sequence e1 e2
Ordered choice e1 / e2
Grouping (e)
Zero-width positive lookahead &e
Zero-width negative lookahead !e
Optional (zero-or-more) repetition e*
Mandatory (one-or-more) repetition e+
Optional expression e?
Label name:e Helps in extracting sub-expressions, creates {name, SubTree} tuples in the AST.

Currently all reductions must end with a semi-colon ;. The first rule/reduction in your grammar will be considered the root of the parse-tree.

Working with the AST

Without specifying any transformations, Neotoma will return a nested list of the results of its parse — essentially an S-expression. In this form, the AST is not very useful; one needs to transform and annotate the tree into a useful data structure. Neotoma provides hooks into the parsing process in the form of the transform/3 function (or the inline code blocks). Once you have generated your parser, you can edit this function in the generated file. The prototype is thus:

transform('nonterminal', Node, Index)
  • nonterminal is the nonterminal that was successfully parsed.
  • Node is a list of the results from sub-expressions, which may be raw terminals or the transformations of other nonterminals.
  • Index is a tuple representing the position of the parser at the start of this expression, in the form {{line, L},{column,C}} where L and C are both integers.

Using transform modules

While editing this within the generated parser is easy, Neotoma will overwrite your changes if you regenerate the parser. Therefore, I recommend that you specify an external module in which to do your transformations (or use inline blocks, as described below). Doing so will allow you to develop your grammar and transformations independently, without the parser-generator overwriting your transformations. You can do this by specifying the transform_module option to peg_gen:file/2. The module will be generated for you if it does not exist already. An example:

1> neotoma:file("mygrammar.peg", [{transform_module, myast}]).

Inline AST transformations

As of 1.3 and later, Neotoma allows code inline with your grammar for AST transformation and additional support functions. Reductions may be optionally followed by a code block that is enclosed in backticks (`), or a single tilde (~). The code block will become the body of the transformation function. The ~ will create an identity transformation, equivalent to `Node`. Example from the JSON parser:

number <- int frac? exp? 
`
case Node of
  [Int, [], []] -> list_to_integer(lists:flatten([Int]));
  [Int, Frac, []] -> list_to_float(lists:flatten([Int, Frac]));
  [Int, [], Exp] -> list_to_float(lists:flatten([Int, ".0", Exp));
  _ -> list_to_float(lists:flatten(Node))
end
`;

The Node and Idx variables are available to your code block.

To add additional support functions, just put another backtick-delimited block at the bottom of the grammar. All code will be added verbatim to the generated parser.

Future features

  • Support for parsing in binary form/UTF.
  • Support for LFE and Reia.
Clone this wiki locally