Skip to content
andychu edited this page Nov 16, 2017 · 20 revisions

These facts are useful for the parsing contest.

Shell WTFs

Facts

  • 15 lexer modes (lexical state)
  • 233 IDs (token types / node types) in 23 kinds (core/id_kind_test.py shows this)
  • 3 recursive descent parsers (command, word, [[)
  • 1 Pratt parser (arithmetic)
  • [ fallback reuses osh/bool_parser
  • TODO: modify asdl.py to show these stats?
    • X product types
    • X sum types with X alternatives

Facts Requiring Dynamic Instrumentation

  • what CPython opcodes does it use?
  • how many lines of code does it use in CPython? (Compare with execution.)
  • What is the distribution of ASDL string and array lengths per node type?
    • note: there are several uses of string, not just token. Is this a good or bad optimization?

Other Parser Components That Do String Manipulation

  • Brace detection -- this is a separate metaprogramming pass (doesn't depend on input). This is a recursive parser, although it operates entirely on token types and not chars/strings?
  • Per-Word Algorithms
    • core/glob_.py
      • LooksLikeGlob
      • GlobEscape
      • GlobUnescape (in case of no matches, may not be necessary)
    • regex escape, for passing to regcomp() (not done yet)
  • checking validity of names:
    • for invalid-var in a b; do ...
    • readonly invalid-var

Runtime String Manipulation

  • core/word_eval.py -- after evaluating VarOp arguments, we compile globs to Python regexes, e.g. for ${x%foo*}
  • IFS splitting (this is quite slow and needs to be sped up!)
  • core/args.py -- this is not a recursive parser
  • echo -e -- backslash escapes (and printf if it turns out we need it as a builtin)
  • read without -r -- backslash escapes are parsed

Other Notes on Porting to C++

  • Polymorphism:
    • FileLineReader, StringLineReader, VirtualLineReader for here docs.
    • BoolParser can take test_builtin._StringWordEmitter or WordParser
Clone this wiki locally