-
-
Notifications
You must be signed in to change notification settings - Fork 162
Parsing is Difficult
Blog Theme
https://news.ycombinator.com/item?id=13821829 -- Real parsers are hand-written; they use code generators or meta-languages.
https://news.ycombinator.com/item?id=13041646 -- metalanguage feedback in response to old META II paper. Ruby parser is complex. JRuby parser is an example of writing two parsers.
https://news.ycombinator.com/item?id=13630686 -- my own metalanguage
Language Design and Theory of Computation -- CFGs aren't powerful enough. Java has two grammars.
Note about error messages: are parse errors really hard? They all seem similar, as long as you have location info? Python doesn't seem to have any special handling of parse errors. It uses a grammmar.
I think type-checking error messages are hard, like Clang diagnostics. And runtime errors to an extent. But parse errors may be easy if the language is straightforward enough.
The one place I might want an advanced error message is for $((
. You want to know where it started thinking of things as arithmetic. The place you fix is different than the place the error occurred.
-
Python: CPython parser, lib2to3, redbaron
-
Ruby: MRI parser, JRuby parser
-
Go: C parser, Go parser
-
Scala: seems like scala-meta might have another parser
-
JavaScript: many different parsers
-
Unified parsers:
- Clang
- Roslyn
From coverage.py. The Python stdlib tokenizer isn't good enough, because it's not lossless! It's also a second tokenizer, since the C one is not wrapped.
def phys_tokens(toks):
"""Return all physical tokens, even line continuations.
tokenize.generate_tokens() doesn't return a token for the backslash that
continues lines. This wrapper provides those tokens so that we can
re-create a faithful representation of the original source.
Returns the same values as generate_tokens()
- TODO: Look at the list of linters here: https://github.com/mcandre/linters
- what parsers do they use?
-
The state of the art is hand-written parsers
-
The state of the art is two hand-written parsers (possibly written to the same spec, or not)
-
The state of the art is two grammars? (TODO: Look at Java). Grammars are no longer declarative.
-
State of the art:
- v8 and Clang for performance
- TypeScript / C# for generality
- v8 for safety? any others?
-
https://www.reddit.com/r/ProgrammingLanguages/comments/74ktjg/what_compromises_have_you_made/do0rsku/ -- In theory, parsing is about testing whether a language is in a set. In practice, it's also about creating an AST/LST structure. This is where the concept of "ambiguity" arises -- more than one possible sequence of productions can lead to the same string.
- It's also analogous to the "group capture" problem in regular expressions. We want to retain linear time performance and specify greedy/nongreedy behavior when extracting substrings: https://swtch.com/~rsc/regexp/
- Operator Precedence Parsing, even though C grammars can manually encode it (https://news.ycombinator.com/item?id=15470988)
- Handling left-recursive rules in a recursive descent parser: https://tavianator.com/bfs-from-the-ground-up-2/
- Matters for shell's equal-precedence
||
and&&
- this is the difference between treating a parsing as set membership vs. creating a useful tree structure
- Matters for shell's equal-precedence
- Parse Tree vs. AST vs. LST (Lossless Syntax Tree Pattern)
- these are punted onto semantic actions
- backtracking or predictive recursive descent?
- lookahead?
- C's context sensitivity (types vs. identifiers)
- dangling else problem (the canonical example of a shift-reduce conflict)
Contrast with paper: Pure Declarative Syntax Lost and Regained.
- Stateful lexer -- generalization of regular lexer.
- LL(k) for efficiency
- pratt parsing (or shunting yard) for efficiency and compact description of a grammar
- LR(k) for C, Awk, maybe Ruby, R, etc.
- TODO: research the ways in which LL(k) is not sufficient for those languages?
- we know about C prototypes declarations vs. definitions. This is LR(k), but requires infinite k for LL(k).
- JavaScript
=>
function.
- Zephyr ASDL for abstract syntax and Lossless Syntax Tree Pattern
- what language for semantic actions? Maybe reuse the OVM library?