Why Lexing and Parsing Should Be Separate

Summary: Do the easy thing with the fast algorithm, and the hard thing with one of several slow algorithms. They are different things.*

Note: the arguments here are more important for big languages/parsers. For small languages, these issues don't matter as much.

Code Clarity: Lexing and parsing are fundamentally different. While in practice there isn't a precise line between lexing and parsing (for a given language, or across languages), here's a clear way to organize it:
- A lexer recognizes the non-recursive structure of a language. Its output is a token stream.
- A parser recognizes the recursive structure of a language. Its output is a tree.
Code stability. Lexing is straightforward and "solved", but parsing isn't. The best parsing model isn't clear up front, since it depends on the details of the language. At some point, you may want to change the parsing model, while keeping the lexer the same. (CPython is doing this in Python 3.9! From pgen LL(1) to PEG.)
- Note: If you're parsing an existing language, the safest option is to the same parsing model that the original parser used, and the same definition of tokens to divide the lexer and parser.
Performance, i.e. do the easy thing (lexing) with the fast algorithm, and the hard thing (parsing) with the slow algorithm.
- Lexing with regexes is nice because it can be done in O(n) time and O(1) space (using automata-based techniques). Many lexers are also straightforward to write by hand.
  - There is essentially one algorithm for lexing -- march forward through the input exactly once.
- Parsing CFGs is O(n^3) in general, and parsing PEGs either takes O(n) space (packrat parsing) or exponential time (backtracking)
  - There are many algorithms for parsing, each with a complex set of tradeoffs. Also see Parsing Models Cheatsheet.
- Important point: separating lexing and parsing reduces the n in your O(n^3). In any real program, there are many more characters than tokens.
- Case study: Speeding up the Sixty compiler. This is a compiler written in Haskell with a front end that used parser combinators. Parsing took 45% of the time at first. After other optimizations, it took 30% of the time. Optimization 7: Separate Lexer: So the change I made here is to write a faster lexer that's separate from the parser, and then make the parser work on the list of tokens that the lexer spits out. This sped up the program by 37%, which was one of the bigger wins for the whole compiler.
- Case study: Lessons learned from writing ShellCheck, GitHub’s now most starred Haskell project -- This part is an argument for a separate lexer, which Oil has: Be careful if your parser framework makes it too easy to backtrack. Look up good parser design. I naively wrote a character based parser function for each construct like ${..}, $(..), $'..', etc, and now the parser has to backtrack a dozen times to try every possibility when it hits a $. With a tokenizer or a parser that read $ followed by {..}, (..) etc, it would have been much faster

Easier composition of sublanguages with modal lexers. On this thread, there seemed to be the misconception that lexers inhibit language composition. In fact it's just the opposite -- the lexer can take a mode parameter, and each mode corresponds to the different lexical structure of a sublanguage (templated strings, etc.)
- How OSH Uses Lexer Modes
- When Are Lexer Modes Useful?
- List of Sublanguages -- there are main 4 composed sublanguages, as well as 5 minor ones, all of which have recursive structure.
- More posts tagged #lexing

Ease of Development / Debugging -- In my experience, it is hard to debug PEGs for large grammars, because lexing and parsing are intermingled. The gibes with comments here and below:
- https://www.reddit.com/r/ProgrammingLanguages/comments/9plvqa/question_about_language_creation_tools/e82wqb4/

I used to use a PEG for my language, but I had to move away because the error messages were so bad I couldn't iterate on the parser because I couldn't understand what was going wrong.

Threads/comments:

Not a strawman:

https://fosdem.org/2019/schedule/event/perl6newtool/ - The great part is that you no longer need to split your language implementation in traditional phases: lexer, parser, etc. -- IMO this is a mistake, at least for medium/large languages.

Consistent handling of Location Info -- You will likely want to attach filename/line/column information to tokens in the lexer. Then the parser can be ignorant of this concern.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Lexing and Parsing Should Be Separate

Clone this wiki locally