Can you write a lexer in a single expression? Now you can.
- Performance improved by a factor of 500.
- Removed ILinePositionCalculator interface and everything related. (The cost of the abstraction was too high.)
Here is simple working example how you would write a simple lexer for an arithmetic parser.
Define your tokens by marking them with the IToken
interface.
public record PlusToken : IToken;
public record MinusToken : IToken;
public record MultiplicationToken : IToken;
public record DivisionToken : IToken;
You have to define an EpsilonToken marked with the IEpsilonToken
interface. This will be returned by the ILexemeReader
when you read past the end.
This also means you should not construct this Token on your own.
public record EpsilonToken : IEpsilonToken;
Create all the rules you need to create the lexer tokens.
LexerResult result = LexerRuleBook.Builder
.AddRule(char.IsDigit, ScanNumber)
.AddSimpleRule<PlusToken>("+")
.AddSimpleRule<MinusToken>("-")
.AddSimpleRule<MultiplicationToken>("*")
.AddSimpleRule<DivisionToken>("/")
.WithEpsilonToken<EpsilonToken>()
.Build()
.Scan("40+20*6");
Use the ILexemeBuilder
to build any token you want, and the builder is taking care of the position and length of your token.
private static Lexeme ScanNumber(ILexemeBuilder builder)
=> builder.Peek().Match(none: false, some: char.IsDigit)
? ScanNumber(builder.Retain())
: builder.Build(new NumberToken(int.Parse(builder.CurrentToken)));
The LexerResult
will have a sequence of lexems which would look like this:
Lexeme(NumberToken(40), Position(0, 1, 1, 2))
Lexeme(PlusToken(), Position(2, 1, 3, 1))
Lexeme(NumberToken(20), Position(3, 1, 4, 2))
Lexeme(MultiplicationToken(), Position(5, 1, 6, 1))
Lexeme(NumberToken(6), Position(6, 1, 7, 1))
You always start with an empty LexerRuleBook.Builder
which is already in its default configuration, you just have to define your rules and an epsilon token to start.
You can add rules with the following methods:
AddRule(Predicate<char> symbolPredicate, Lexeme.Factory createLexeme, int weight = DefaultWeight)
AddSimpleRule<TToken>(string textSymbol)
AddRuleWithContext(Predicate<char> symbolPredicate, Predicate<IReadOnlyList<Lexeme>> contextPredicate, Lexeme.Factory createLexeme, int weight)
AddSimpleRuleWithContext<TToken>(string textSymbol, Predicate<IReadOnlyList<Lexeme>> contextPredicate, int weight)
The Simple rules create a token of type TToken
when they see the specified textSymbol
the others only look at the first character to decide which rules to invoke.
AddSimpleRule
has no weight
parameter, that is because the weight
always equals the length of the string to avoid overlapping conflicts.
A simple rule defined like this has therefore the weight
of 5.
AddSimpleRule<StringToken>("string")
This means you can define a concurrent simple rules for =
and ==
and it will lex them as expected.
The WithContext rules have an additional contextPredicate which can decide if the rule should be activated or not. That way you can produce different symbols with the same input, for details look at Lexemes dependent on context
You can implement your own rules also by implementing the ILexerRule interface yourself. This will be necessary if you need to look ahead more than one character and the simple rule is not enough for your use case.
AddRule(ILexerRule rule)
You have to define an epsilon token to mark the end of your token-stream, it should only be built by the ILexemeWalker
and not by hand.
LexerRuleBook.Builder
.WithEpsilonToken<EofToken>()
Sometimes you don't want to deal with every Token in your parser. A simple way to remove certain nodes, or rewrite all your lexemes is to define a post-processing function which takes the list of lexemes and transform it to another list of lexemes.
Here is a simple example for a parser who is not interested in the the whitespace after lexing.
LexerRuleBook.Builder
.WithPostProcess(lexemes => lexemes.Where(t => t.Token.GetType() != typeof(WhiteSpaceToken)))
But you can also inject your own ILexemeWalker to do such tasks.
In simple lexer scenerios you should be fine with the default implemenations but if you reach the limit of the lexer, there is a simple way to add new unlimited functionality:
You can define your own implementations for ILexerReader
, ILexemeWalker
, ILexemeBuilder
in complex scenarios where you need more control over the lexing process.
WithLexerReader(ILexerReader.Factory newLexerReader)
WithLexemeWalker(ILexemeWalker.Factory newLexemeWalker)
WithLexemeBuilder(ILexemeBuilder.Factory newLexemeBuilder)
This way you can change every implementation detail. If you want to know more you should go to the Change the default implementations chapter.
You can only call the Build()
method if you already defined an epsilon token.
The Build()
method will return a LexerRuleBook
which has all the information needed to transform your expressions to lexemes.
// This is valid and defines an empty lexer which will throw on any input.
LexerRuleBook.Builder
.WithEpsilonToken<EofToken>()
.Build();
The ILexemeBuilder
helps you to create the lexeme, it keeps the state and automates the position calculations.
The most important function to investigate is the Peek
function. It gives you the current (or any later) character or it will return Option<char>.None
if you are at the end of the expression.
Option<char> Peek(int lookAhead = NoLookAhead);
The Peek
function does not advance the position, to advance the Position
there are the Retain()
and Discard()
methods.
Retain()
will advance thePosition
and add the current character at the end ofCurrentToken
.Discard()
will only advance thePosition
and discard the current character.
Sometimes you want to start over with the CurrentToken
, in that case use Clear()
.
You can look at the underlying absolute Position
if you want, but it is usually not necessary.
At the end you have to build a lexeme, a lexeme is the token you are constructing and the position information (Start, End, Line etc.).
The Build(IToken token)
method on the ILexemeBuilder
will construct everything for you, you just have to create the Token and return the Lexeme.
Here we have a realistic more complex example where we want to create a NumberToken
which represent a floating point number.
private static Lexeme ScanNumber(ILexemeBuilder builder)
{
var decimalExists = false;
// we iterate as long we see a dot or a digit.
while (builder.Peek().Match(none: false, some: c => c is '.' || char.IsDigit(c)))
{
var digit = builder.Peek();
var isDot = from d in digit select d is '.';
// only one dot is allowed so we throw an exception if we find a second one.
if (isDot.Match(none: false, some: Identity))
{
if (decimalExists)
{
throw new LexerException("Multiple dots in decimal number");
}
decimalExists = true;
}
builder.Retain();
}
// here we build the lexeme based on our double we constructed and put in a NumberToken.
return double.TryParse(builder.CurrentToken, out var number)
? builder.Build(new NumberToken(number))
: throw new Exception("Could not parse number: " + builder.CurrentToken);
}
}
The default line position calculation is very simple,
you designate one ore more of your tokens as ILineBreakToken
and the lexer is
automatically calculating the position of each lexeme based on that in the result.
public record NewLineToken : ILineBreakToken;
The absolute positions (StartPosition
, EndPosition
) are zero-based indexed because they usually are used in a programming context where this is more natural.
The line and column positions (Line
,StartColumn
,EndColumn
) are one-based indexed because these positions usually are shown in the output for end users.
If you need anything more complicated you should read: Change the default implementations.
Sometimes you want to create different tokens for the same input.
In such cases you want AddRuleWithContext
or AddSimpleRuleWithContext
on the LexerRuleBookBuilder
, these rules can be activated arbitrarly by the lexemes which came before.
These rules have an additional parameter, the contextPredicate
. These rules are only activated when this predicate returns true
. The predicate works on the list of the lexemes already produced.
Lets say we want to have a token class
, but we want to allow to use it as an identifier if the ClassToken
already appeared previously.
LexerRuleBook.Builder
.AddRule(char.IsLetter, ScanIdentifier)
.AddSimpleRuleWithContext<ClassToken>("class", context => context.All(lexeme => lexeme is not { Token: ClassToken }), 1)
We first define a rule which creates the identifier token, this works as usual and has the default weight of 0. Then we add a rule with context and restrict the usage of it to contexts which have no ClassToken so far. We give this a weight of 1 so this matches first if it is activated.
For the example input below this means the first class will be created as ClassToken()
and the second one will be created as IdentifierToken("class")
class MySuperClass
{
function DoSomething
{
let class = anything
}
}
You have even more freedoms if you implement the ILexerRule yourself. The interface is very simple and for simple cases you usually only have to implement the Match function.
public interface ILexerRule
{
int Weight { get; }
Option<Lexeme> Match(ILexemeBuilder builder);
bool IsActive(IReadOnlyList<Lexeme> context);
}
The Weight
just defines the precedence of the rules if more than one rule matches.
IsActive
is usually true except when your rule depends on the context, then you have to write a predicate on the context.
The Match function works very similar to the lexeme factory, when the rule matches it should return a lexeme, but if it does not it needs to return Option<Lexeme>.None
.
Attention: If you return Option<Lexeme>.None
you should only have used Peek to inspect the underyling stream. You should not change the underlying stream by advancing it with Retain()
or Discard()
.
This is illustrated in this simple example, where we call Discard()
to advance the stream only when we create a Lexeme
.
public Option<Lexeme> Match(ILexemeBuilder builder)
{
if (builder.Peek() is '=')
{
builder.Discard()
return builder.Build(new EqualToken())
} else {
return Option<Lexeme>.None;
}
The LexerRuleBook sets up the default implementations for:
ILexerReader
ILexemeWalker
ILexemeBuilder
You change them while setting up your LexerRuleBook
the methods start with With
and are described in Configure the complex szenarios.
Scenario: you want to manipulate the input before it is used by the logic.
ILexerReader
has two methods and a property you need to implement.
int Position { get; }
Option<char> Peek(int lookAhead = NoLookAhead);
Option<char> Read();
When registering the reader you have access to the expression
as given to the Scan
method. The expression
is therefore a string
.
.WithLexerReader(expression => new TrimLexerReader(expression))
Scenario: you want to manipulate the result after the lexing.
ILexemeWalker
is usually stateful (track the position and change it on Pop
) and has two methods to implemlent.
Lexeme Pop();
Lexeme Peek(int lookAhead = NoLookAhead);
When registering the reader you have access to the lexemes
and the newEpsilonToken
factory delegate.
The lexemes
is a IReadOnlyList<Lexeme>
which is the list of all lexemes produced by the expression.
The newEpsilonToken
is a simple factory which returns the configured epsilon token.
The ILexemeWalker
returns a lexeme on Pop()
and Peek()
, even if you read past the end. In those cases the ILexemeWalker
returns the epsilon token.
.WithLexemeWalker((lexemes, newEpsilonToken) => new SkipLexemeWalker(lexemes, newEpsilonToken, skippables))
Scenario: If you want to do anything else not described in this document yet.
By replacing the ILexemeBuilder
you are basically taking control over everything, you can ignore most of the default implementation and do whatever you want.
At this point you have to ask yourself, is this the right library for you?
The ILexemeBuilder
is a bit more involved, if you really want to change this implementation refer to the source and the examples in the test.
When registering the ILexemeBuilder
you have access to all the internas of the implementation.
This means you have full control to change basically everything.
.WithLexemeBuilder((reader, linePositionCalculator) => new FakeDigitLexemeBuilder(reader, linePositionCalculator))