-
Notifications
You must be signed in to change notification settings - Fork 3
Regex Parser
- Character escapes:
- any character except for one of
.$^{[(|)*+?\
matches itself -
\n
: new line -
\r
: carriage return -
\t
: tab -
\b
: backspace (only inside a character class) -
\nnn
: ASCII character, wherennn
is a two- or three-digit octal character code -
\xnn
: ASCII character, wherenn
is a two-digit hexadecimal character code -
\unnnn
: UTF-16 code unit whose value isnnnn
hexadecimal -
\
followed by a character not recognized as escaped matches that character
- any character except for one of
- Character classes:
-
.
matches any character except\n
- positive character groups (e.g.,
[aeiou]
,[a-zA-Z]
,[abcA-H\d\n]
) - negative character groups (e.g.,
[^a-zA-Z]
) - named character classes:
-
\w
: a word character; same as[0-9A-Z_a-z]
-
\W
: a non-word character; same as[^0-9A-Z_a-z]
-
\s
: a whitespace character; same as[ \n\r\t]
-
\S
: a non-whitespace character; same as[^ \n\r\t]
-
\d
: a digit character; same as[0-9]
-
\D
: a non-digit character; same as[^0-9]
-
- character class subtraction (e.g.,
[0-9-[246]]
matches any digit except for 2, 4 and 6)
-
- Grouping (without capturing):
(subexpr)
- Quantifiers:
- Greedy:
*
,+
,?
,{n}
,{n,}
,{n,m}
- Lazy:
*?
,+?
,??
,{n}?
,{n,}?
,{n,m}?
The difference between greedy and lazy quantifiers is in how they control backtracking. _Greedy quantifiers_ will first try to match as _many_ characters as possible. Then, if the rest of the regex does not match, they will backtrack to matching one character _less_, then try again on the rest of the regex--and so on, one character _less_ every time. _Lazy quantifiers_, on the other hand, will first try to match as _few_ characters as possible, then backtrack to matching one character _more_ every time.
- Greedy:
- Alternation:
|
- Anchors:
-
^
: start of string or line (depending on theMultiline
option) -
$
: end of string or line (depending on theMultiline
option) -
\A
: start of string only -
\Z
: end of string or before ending newline -
\z
: end of string only -
\G
: contiguous match (must start where previous match ended) -
\b
: word boundary -
\B
: non-word boundary
-
- Regex options (global):
IgnoreCase
-
Multiline
:^
and$
will match the beginning and end of each line (instead of the beginning and end of the input string) -
Singleline
: the period (.
) will match every character (instead of every character except\n
)
See also: Missing Regex Features.
The Regex Parser has three layers, corresponding to three parsing phases:
- Parsing the regex pattern, resulting in an Abstract Syntax Tree (AST)
- Transforming the AST
- Parsing the target string using the AST
Phases 1 and 2 happen only once for a given regex. Phase 3 may happen multiple times, for different target strings.
The Parser
type is defined as:
public delegate Result<TToken, TTree> Parser<TToken, TTree>(IConsList<TToken> consList);
public class Result<TToken, TTree>
{
public Result(TTree tree, IConsList<TToken> rest)
{
Tree = tree;
Rest = rest;
}
public TTree Tree { get; private set; }
public IConsList<TToken> Rest { get; private set; }
}
public interface IConsList<T>
{
T Head { get; }
IConsList<T> Tail { get; }
bool IsEmpty { get; }
int Length { get; }
}
A parser is simply a function that takes a list of tokens (e.g., characters), and returns a syntax tree and the list of unconsumed tokens. To indicate failure to match, it will return null
.
The Parser
type is equivalent to the following Haskell type:
newtype Parser token tree = Parser ([token] -> Maybe (tree, [token]))
NOTE: The idea of parser combinators came from these articles:
- Monadic Parsing in Haskell (Hutton, Meijer) (1998)
- Parsec, a fast combinator parser (Leijen) (2001)
In the articles, the type is defined similarly to:
newtype Parser token tree = Parser ([token] -> [(tree, [token])])
This allows the parser to be ambiguous (to be able to parse a string in multiple ways). The parser will return either a list of one or more "success" alternatives, or an empty list to indicate failure.
As the regex syntax is non-ambigious, the
Maybe
definition was the one preferred.
The following parser combinators have been defined (see definitions):
Choice
Option
Many
Many1
Count
Sequence
PrefixedBy
Between
SepBy
SepBy1
NotFollowedBy
Eof
AnyToken
Succeed
Fail
For example, the Choice
combinator is defined as:
// Try to apply the parsers in the 'choices' list in order, until one succeeds.
// Return the tree returned by the succeeding parser.
public static Parser<TToken, TTree> Choice<TTree>(params Parser<TToken, TTree>[] choices)
{
return consList =>
{
foreach (var parser in choices)
{
var result = parser(consList);
if (result != null)
return result;
}
return null;
};
}
Besides combinators, there are also a number of "primitive" character parsers (see definitions):
AnyChar
Satisfy
Char
Digit
OctDigit
HexDigit
OneOf
NoneOf
Each of these will match exactly one character.
Still on the TODO list:
- Groups:
- capturing groups
- backreferences
- named groups
- atomic groups (non-backtracking)
- Substitution:
-
Regex.Replace()
method - substitution patterns:
$$
,$1
, etc.
-
- Look-ahead
- Look-behind
- Regex options:
IgnorePatternWhitespace
ExplicitCapture
- inline options (as opposed to global)
- Comments in patterns:
(?#comment)