Skip to content
Adrian Istrate edited this page Nov 24, 2013 · 3 revisions

Implemented Regex Features

  • Character escapes:
    • any character except for one of .$^{[(|)*+?\ matches itself
    • \n: new line
    • \r: carriage return
    • \t: tab
    • \b: backspace (only inside a character class)
    • \nnn: ASCII character, where nnn is a two- or three-digit octal character code
    • \xnn: ASCII character, where nn is a two-digit hexadecimal character code
    • \unnnn: UTF-16 code unit whose value is nnnn hexadecimal
    • \ followed by a character not recognized as escaped matches that character
  • Character classes:
    • . matches any character except \n
    • positive character groups (e.g., [aeiou], [a-zA-Z], [abcA-H\d\n])
    • negative character groups (e.g., [^a-zA-Z])
    • named character classes:
      • \w: a word character; same as [0-9A-Z_a-z]
      • \W: a non-word character; same as [^0-9A-Z_a-z]
      • \s: a whitespace character; same as [ \n\r\t]
      • \S: a non-whitespace character; same as [^ \n\r\t]
      • \d: a digit character; same as [0-9]
      • \D: a non-digit character; same as [^0-9]
    • character class subtraction (e.g., [0-9-[246]] matches any digit except for 2, 4 and 6)
  • Grouping (without capturing): (subexpr)
  • Quantifiers:
    • Greedy: *, +, ?, {n}, {n,}, {n,m}
    • Lazy: *?, +?, ??, {n}?, {n,}?, {n,m}?
    The difference between greedy and lazy quantifiers is in how they control backtracking. _Greedy quantifiers_ will first try to match as _many_ characters as possible. Then, if the rest of the regex does not match, they will backtrack to matching one character _less_, then try again on the rest of the regex--and so on, one character _less_ every time. _Lazy quantifiers_, on the other hand, will first try to match as _few_ characters as possible, then backtrack to matching one character _more_ every time.
  • Alternation: |
  • Anchors:
    • ^: start of string or line (depending on the Multiline option)
    • $: end of string or line (depending on the Multiline option)
    • \A: start of string only
    • \Z: end of string or before ending newline
    • \z: end of string only
    • \G: contiguous match (must start where previous match ended)
    • \b: word boundary
    • \B: non-word boundary
  • Regex options (global):
    • IgnoreCase
    • Multiline: ^ and $ will match the beginning and end of each line (instead of the beginning and end of the input string)
    • Singleline: the period (.) will match every character (instead of every character except \n)

See also: Missing Regex Features.

Architecture

The Regex Parser has three layers, corresponding to three parsing phases:

  1. Parsing the regex pattern, resulting in an Abstract Syntax Tree (AST)
  2. Transforming the AST
  3. Parsing the target string using the AST

Phases 1 and 2 happen only once for a given regex. Phase 3 may happen multiple times, for different target strings.

The Parser Type

The Parser type is defined as:

public delegate Result<TToken, TTree> Parser<TToken, TTree>(IConsList<TToken> consList);

public class Result<TToken, TTree>
{
    public Result(TTree tree, IConsList<TToken> rest)
    {
        Tree = tree;
        Rest = rest;
    }
    public TTree Tree { get; private set; }
    public IConsList<TToken> Rest { get; private set; }
}

public interface IConsList<T>
{
    T Head { get; }
    IConsList<T> Tail { get; }

    bool IsEmpty { get; }
    int Length { get; }
}

A parser is simply a function that takes a list of tokens (e.g., characters), and returns a syntax tree and the list of unconsumed tokens. To indicate failure to match, it will return null.

The Parser type is equivalent to the following Haskell type:

newtype Parser token tree = Parser ([token] -> Maybe (tree, [token]))

NOTE: The idea of parser combinators came from these articles:

In the articles, the type is defined similarly to:

newtype Parser token tree = Parser ([token] -> [(tree, [token])])

This allows the parser to be ambiguous (to be able to parse a string in multiple ways). The parser will return either a list of one or more "success" alternatives, or an empty list to indicate failure.

As the regex syntax is non-ambigious, the Maybe definition was the one preferred.

Parser Combinators in C#

The following parser combinators have been defined (see definitions):

  • Choice
  • Option
  • Many
  • Many1
  • Count
  • Sequence
  • PrefixedBy
  • Between
  • SepBy
  • SepBy1
  • NotFollowedBy
  • Eof
  • AnyToken
  • Succeed
  • Fail

For example, the Choice combinator is defined as:

// Try to apply the parsers in the 'choices' list in order, until one succeeds.
// Return the tree returned by the succeeding parser.
public static Parser<TToken, TTree> Choice<TTree>(params Parser<TToken, TTree>[] choices)
{
    return consList =>
    {
        foreach (var parser in choices)
        {
            var result = parser(consList);
            if (result != null)
                return result;
        }
        return null;
    };
}

Besides combinators, there are also a number of "primitive" character parsers (see definitions):

  • AnyChar
  • Satisfy
  • Char
  • Digit
  • OctDigit
  • HexDigit
  • OneOf
  • NoneOf

Each of these will match exactly one character.

Missing Regex Features

Still on the TODO list:

  • Groups:
    • capturing groups
    • backreferences
    • named groups
    • atomic groups (non-backtracking)
  • Substitution:
    • Regex.Replace() method
    • substitution patterns: $$, $1, etc.
  • Look-ahead
  • Look-behind
  • Regex options:
    • IgnorePatternWhitespace
    • ExplicitCapture
    • inline options (as opposed to global)
  • Comments in patterns: (?#comment)