Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow controlling lexer state in use sites? #70

Open
osa1 opened this issue Jan 22, 2025 · 0 comments
Open

Allow controlling lexer state in use sites? #70

osa1 opened this issue Jan 22, 2025 · 0 comments

Comments

@osa1
Copy link
Owner

osa1 commented Jan 22, 2025

Consider parsing interpolated expression in string literals: https://langdev.stackexchange.com/questions/243.

Ideally an interpolated expression should be allowed to have strings with interpolations. E.g. this works in Dart:

void main() {
  print("--- ${f("as${1.toString()}df")} ---");
}

f(String s) => s;

One way to parse this would be to generate a "start interpolation" event when lexing string literals. For the string above, this would generate:

  • StringStart
  • InterpolationStart

After the second event, we want to keep lexing not a string but any token (i.e. go back to the initial state), which is easy to do.

However after generating the } that terminates the interpolation, the lexer doesn't know that it's tokenizing an interpolation and so can't revert back to the "tokenize string" state.

The parser knows that the } terminates the interpolation, but currently we don't have a way to update a lexer state outside of a lexer semantic action function, so it cannot tell the lexer to go back to the top state or string state.

We should add a public method to the generated lexers to set lexer state in the call site to allow this kind of thing.

In an LR(1)/LALR(1) parser, this method would be called in the semantic action that produces an interpolated expression, as lexer.switch_(LexerState::String).

Why not keep track of the nesting level in a lexer state?

This requires lexer to know too much about the structure of the parsed format, as it would need to keep track of all nestings of parens, brackets etc.

For example:

"asdf ${ f( {  } ) } asdf"

Here the first } does not terminate the interpolation. The lexer needs to know about parens (and other delimiters) so that it won't go back to string lexing after the }.

Parser already maintains the full structure, so it's not extra work in the parser to update lexer state when an interpolation is finished.

A problem with this approach of tokenizing interpolations is that the lexer cannot tokenize full files by itself anymore. One may accept this as the price of this syntax, or make lexer keep track of the delimiters and nesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant