Regex Parser

Implemented Regex Features

Character escapes:
- any character except for one of .$^{[(|)*+?\ matches itself
- \n: new line
- \r: carriage return
- \t: tab
- \b: backspace (only inside a character class)
- \nnn: ASCII character, where nnn is a two- or three-digit octal character code
- \xnn: ASCII character, where nn is a two-digit hexadecimal character code
- \unnnn: UTF-16 code unit whose value is nnnn hexadecimal
- \ followed by a character not recognized as escaped matches that character
Character classes:
- . matches any character except \n
- positive character groups (e.g., [aeiou], [a-zA-Z], [abcA-H\d\n])
- negative character groups (e.g., [^a-zA-Z])
- named character classes:
  - \w: a word character; same as [0-9A-Z_a-z]
  - \W: a non-word character; same as [^0-9A-Z_a-z]
  - \s: a whitespace character; same as [ \n\r\t]
  - \S: a non-whitespace character; same as [^ \n\r\t]
  - \d: a digit character; same as [0-9]
  - \D: a non-digit character; same as [^0-9]
- character class subtraction (e.g., [0-9-[246]] matches any digit except for 2, 4 and 6)
Grouping (without capturing): (subexpr)
Quantifiers:
- Greedy: *, +, ?, {n}, {n,}, {n,m}
- Lazy: *?, +?, ??, {n}?, {n,}?, {n,m}?
The difference between greedy and lazy quantifiers is in how they control backtracking. _Greedy quantifiers_ will first try to match as _many_ characters as possible. Then, if the rest of the regex does not match, they will backtrack to matching one character _less_, then try again on the rest of the regex--and so on, one character _less_ every time. _Lazy quantifiers_, on the other hand, will first try to match as _few_ characters as possible, then backtrack to matching one character _more_ every time.
Alternation: |
Anchors:
- ^: start of string or line (depending on the Multiline option)
- $: end of string or line (depending on the Multiline option)
- \A: start of string only
- \Z: end of string or before ending newline
- \z: end of string only
- \G: contiguous match (must start where previous match ended)
- \b: word boundary
- \B: non-word boundary
Regex options (global):
- IgnoreCase
- Multiline: ^ and $ will match the beginning and end of each line (instead of the beginning and end of the input string)
- Singleline: the period (.) will match every character (instead of every character except \n)

Architecture

The Regex Parser has three layers, corresponding to three parsing phases:

Parsing the regex pattern, resulting in an Abstract Syntax Tree (AST)
Transforming the AST
Parsing the target string using the AST

Phases 1 and 2 happen only once for a given regex. Phase 3 may happen multiple times, for different target strings.

The Parser Type

The Parser type is defined as:

public delegate Result<TToken, TTree> Parser<TToken, TTree>(IConsList<TToken> consList);

public class Result<TToken, TTree>
{
    public Result(TTree tree, IConsList<TToken> rest)
    {
        Tree = tree;
        Rest = rest;
    }
    public TTree Tree { get; private set; }
    public IConsList<TToken> Rest { get; private set; }
}

public interface IConsList<T>
{
    T Head { get; }
    IConsList<T> Tail { get; }

    bool IsEmpty { get; }
    int Length { get; }
}

A parser is simply a function that takes a list of tokens (e.g., characters), and returns a syntax tree and the list of unconsumed tokens. To indicate failure to match, it will return null.

The Parser type is equivalent to the following Haskell type:

newtype Parser token tree = Parser ([token] -> Maybe (tree, [token]))

NOTE: The idea of parser combinators came from these articles:

Monadic Parsing in Haskell (Hutton, Meijer) (1998)

Parsec, a fast combinator parser (Leijen) (2001)

In the articles, the type is defined similarly to:

newtype Parser token tree = Parser ([token] -> [(tree, [token])])

This allows the parser to be ambiguous (to be able to parse a string in multiple ways). The parser will return either a list of one or more "success" alternatives, or an empty list to indicate failure.

As the regex syntax is non-ambigious, the Maybe definition was the one preferred.

Parser Combinators in C#

The following parser combinators have been defined (see definitions):

Choice
Option
Many
Many1
Count
Sequence
PrefixedBy
Between
SepBy
SepBy1
NotFollowedBy
Eof
AnyToken
Succeed
Fail

For example, the Choice combinator is defined as:

// Try to apply the parsers in the 'choices' list in order, until one succeeds.
// Return the tree returned by the succeeding parser.
public static Parser<TToken, TTree> Choice<TTree>(params Parser<TToken, TTree>[] choices)
{
    return consList =>
    {
        foreach (var parser in choices)
        {
            var result = parser(consList);
            if (result != null)
                return result;
        }
        return null;
    };
}

Besides combinators, there are also a number of "primitive" character parsers (see definitions):

AnyChar
Satisfy
Char
Digit
OctDigit
HexDigit
OneOf
NoneOf

Each of these will match exactly one character.

Missing Regex Features

Still on the TODO list:

Groups:
- capturing groups
- backreferences
- named groups
- atomic groups (non-backtracking)
Substitution:
- Regex.Replace() method
- substitution patterns: $$, $1, etc.
Look-ahead
Look-behind
Regex options:
- IgnorePatternWhitespace
- ExplicitCapture
- inline options (as opposed to global)
Comments in patterns: (?#comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly