Copyright (C) 2022 Bryan A. Jones.

This file is part of the CodeChat Editor.

The CodeChat Editor is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

The CodeChat Editor is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with the CodeChat Editor. If not, see https://www.gnu.org/licenses/.

Lexer walkthrough

This walkthrough shows how the lexer parses the following Python code fragment:

print("""¶
# This is not a comment! It's a multi-line string.¶
""")¶
# This is a comment.

Paragraph marks (the ¶ character) are included to show how the lexer handles newlines. To explain the operation of the lexer, the code will be highlighted in yellow to represent the unlexed source code, represented by the contents of the variable source_code[source_code_unlexed_index..] and in green for the current code block, defined by source_code[current_code_block_index..source_code_unlexed_index]. Code that is classified by the lexer will be placed in the classified_code array.

Start of parse

The unlexed source code holds all the code (everything is highlighted in yellow); the current code block is empty (there is no green highlight).

print("""¶
# This is not a comment! It's a multi-line string.¶
""")¶
# This is a comment.

classified_code = [
]

Search for a token

The lexer begins by searching for the regex in language_lexer_compiled.next_token, which is (\#)|(""")|(''')|(")|('). The first token found is """. Everything up to the match is moved from the unlexed source code to the current code block, giving:

print("""¶
# This is not a comment! It's a multi-line string.¶
""")¶
# This is a comment.

classified_code = [
]

String processing

The regex is accompanied by a map named language_lexer_compiled.map, which connects the mapped group to which token it matched (see struct RegexDelimType):

Regex:           (#)       |  (""") | (''')  |  (")   |  (')
Mapping:    Inline comment   String   String   String   String
Group:            1            2        3        4        5

Since group 2 matched, looking up this group in the map tells the lexer it’s a string, and also gives a regex which identifies the end of the string . This regex identifies the end of the string, moving it from the (unclassified) source code to the (classified) current code block. It correctly skips what looks like a comment but is not a comment. After this step, the lexer’s state is:

print("""¶
# This is not a comment! It's a multi-line string.¶
""")¶
# This is a comment.

classified_code = [
]

Search for a token (second time)

Now, the lexer is back to its state of looking through code (as opposed to looking inside a string, comment, etc.). It uses the next_token regex as before to identify the next token # and moves all the preceding characters from source code to the current code block. The lexer state is now:

print("""¶
# This is not a comment! It's a multi-line string.¶
""")¶
# This is a comment.

classified_code = [
]

Inline comment lex

Based on the map, the lexer identifies this as an inline comment. The inline comment lexer first identifies the end of the comment (the next newline or, as in this case, the end of the file), putting the entire inline comment except for the comment opening delimiter # into full_comment. It then splits the current code block into two groups: code_lines_before_comment (lines in the current code block which come before the current line) and the comment_line_prefix (the current line up to the start of the comment). The classification is:

print("""¶
# This is not a comment! It's a multi-line string.¶
""")¶
# This is a comment.

classified_code = [
]

Code/doc block classification

Because comment_line_prefix contains only whitespace and full_comment has a space after the comment delimiter, the lexer classifies this as a doc block. It adds code_lines_before_comment as a code block, then the text of the comment as a doc block:

classified_code = [
  Item 0 = CodeDocBlock {
    indent: "", delimiter: "", contents = "print("""¶
# This is not a comment! It's a multi-line string.¶
""")¶
"},
  Item 1 = CodeDocBlock {
    indent: "  ", delimiter: "#", contents = "This is a comment"
  },
]

Done

After this, the unlexed source code is empty since the inline comment classified moved the remainder of its contents into classified_code. The function exits.