Skip to content

Parsing is Difficult

andychu edited this page Oct 14, 2017 · 31 revisions

Blog Theme

https://news.ycombinator.com/item?id=13821829 -- Real parsers are hand-written; they use code generators or meta-languages.

https://news.ycombinator.com/item?id=13041646 -- metalanguage feedback in response to old META II paper. Ruby parser is complex. JRuby parser is an example of writing two parsers.

https://news.ycombinator.com/item?id=13630686 -- my own metalanguage

Language Design and Theory of Computation -- CFGs aren't powerful enough. Java has two grammars.

Note about error messages: are parse errors really hard? They all seem similar, as long as you have location info? Python doesn't seem to have any special handling of parse errors. It uses a grammmar.

I think type-checking error messages are hard, like Clang diagnostics. And runtime errors to an extent. But parse errors may be easy if the language is straightforward enough.

The one place I might want an advanced error message is for $((. You want to know where it started thinking of things as arithmetic. The place you fix is different than the place the error occurred.

Real Languages have at least Two Parsers

  • Python: CPython parser, lib2to3, redbaron

  • Ruby: MRI parser, JRuby parser

  • Go: C parser, Go parser

  • Scala: seems like scala-meta might have another parser

  • JavaScript: many different parsers

  • Unified parsers:

    • Clang
    • Roslyn

From coverage.py. The Python stdlib tokenizer isn't good enough, because it's not lossless! It's also a second tokenizer, since the C one is not wrapped.

def phys_tokens(toks):
    """Return all physical tokens, even line continuations.

    tokenize.generate_tokens() doesn't return a token for the backslash that
    continues lines.  This wrapper provides those tokens so that we can
    re-create a faithful representation of the original source.

    Returns the same values as generate_tokens()

Blog Outline

  • The state of the art is hand-written parsers

  • The state of the art is two hand-written parsers (possibly written to the same spec, or not)

  • The state of the art is two grammars? (TODO: Look at Java). Grammars are no longer declarative.

  • State of the art:

    • v8 and Clang for performance
    • TypeScript / C# for generality
    • v8 for safety? any others?

Gaps Between Theory and Practice

Parsing Requires Little Tricks

  • Operator Precedence Parsing, even though C grammars can manually encode it (https://news.ycombinator.com/item?id=15470988)
  • Handling left-recursive rules in a recursive descent parser: https://tavianator.com/bfs-from-the-ground-up-2/
    • Matters for shell's equal-precedence || and &&
    • this is the difference between treating a parsing as set membership vs. creating a useful tree structure
  • Parse Tree vs. AST vs. LST (Lossless Syntax Tree Pattern)
    • these are punted onto semantic actions
  • backtracking or predictive recursive descent?
    • lookahead?
  • C's context sensitivity (types vs. identifiers)
  • dangling else problem (the canonical example of a shift-reduce conflict)

Contrast with paper: Pure Declarative Syntax Lost and Regained.

Idea for Parsing Tool

  • Stateful lexer -- generalization of regular lexer.
  • LL(k) for efficiency
  • pratt parsing (or shunting yard) for efficiency and compact description of a grammar
  • LR(k) for C, Awk, maybe Ruby, R, etc.
    • TODO: research the ways in which LL(k) is not sufficient for those languages?
    • we know about C prototypes declarations vs. definitions. This is LR(k), but requires infinite k for LL(k).
    • JavaScript => function.
  • Zephyr ASDL for abstract syntax and Lossless Syntax Tree Pattern
  • what language for semantic actions? Maybe reuse the OVM library?
Clone this wiki locally