Track tokens more accurately, and compactly. #898

dvander · 2023-09-30T00:04:30Z

This is a big refactoring of how tokens are tracked throughout the compiler.

Rather than carry explicit file and column information on every token, instead we now track a 32-bit encoded location id. This id can be decoded back to a specific file, column, and line. It's much more compact, and it supports correct tracking of macro expanded tokens (which can cross file boundaries). It can also track the complete #include lineage of a token.

Gradually, the error reporter and AST will migrate to the new location id. The lexer still needs token_pos_t, which tracks the decoded line number. Decoding line numbers from a location id is very expensive, and the hottest function in the lexer is peek_same_line() which relies on this information.

The motivation for this is twofold. First, it makes tokens and the AST more compact, which means we can retain more stuff in memory. The AST is enormous due to the size of #includes. And being able to save and replay tokens is an important feature for future work. Reducing the size of each token will reduce peak memory pressure.

The accuracy of location ids will also let us retain a rather weird feature of spcomp: tab/space linting. The way it's implemented now requires tight coupling between the lexer state and parser, which makes it hard to save and replay token streams. Now that location ids can provide indexes into the source text, we can lint no matter what state the lexer is in.

Finally, this paves the way for clang style error messages. Not implemented here, but, you know, it's doable now.

This merge will bump the minimum C++ standard to C++17 and requires std::filesystem support. This also bumps macOS requirements up to 10.15.

This might introduce warnings or errors in very obscure cases, but nothing obvious showed up in the corpus. We can fix it properly if something does come up. It's not clear to me that it's needed, because peek_same_line() does what tTERM intends to do.

…n, not fline. fline can be stale when tokens are cached.

This also fixes a bug where passing a custom tests dir did not work.

This shaves four bytes off each token.

This was not really used anywhere, and saves another 12 bytes of memory on each token.

This has long been deprecated in favor of report(), which is type-safe.

This simplifies tracking source positions on symbols by using the same machinery as the lexer.

This is completely unused, and will be derived from SourceLocations instead later.

This replaces the "file" member of token_pos_t with a SourceLocation. A source location is an index plus offset into a list of text streams representing source files and macro expansions. In theory, SourceLocation could fully replace token_pos_t. However, decoding a SourceLocation back to a line number is very expensive, and we need the line number extremely frequently. During lexing, we must retain the line number in addition to the source location. There are a few major benefits of this approach, versus the old "file, line, col" tuple. First, SourceLocations are tiny. We can use them in way more places, which makes the AST easier to annotate. Second, they are way more accurate. We can now trace a token back to the macro that defined it, and where that macro was defined, and what line included that file. But most importantly, this is a precursor to caching tokens. The built-in lint checker needs to convert a token position back to the source text stream, which was not easily possible in the old model. NOTE: This bumps the minimum C++ standard to C++17, which means we now require macOS 10.15 or higher, and Visual Studio 2017 or higher.

dvander added 12 commits September 28, 2023 22:19

For reporting errors in the parser, use the most recently parsed toke…

00667e2

…n, not fline. fline can be stale when tokens are cached.

Fix an infinite loop when parsing reaches EOF in a methodmap.

43d618f

Add a filtering option to runtests.py.

bcdd3d0

This also fixes a bug where passing a custom tests dir did not work.

Union full_token_t members "atom" and "value" together.

a578a28

This shaves four bytes off each token.

Remove full_token_t::end.

a0ddac7

This was not really used anywhere, and saves another 12 bytes of memory on each token.

Replace two overloads of error() with calls to report().

a73e39b

Remove error() and associated helpers.

23a14aa

This has long been deprecated in favor of report(), which is type-safe.

Replace symbol::fnumber/lnumber with token_pos_t.

e91c26f

This simplifies tracking source positions on symbols by using the same machinery as the lexer.

Remove token_pos_t::col.

4c0153c

This is completely unused, and will be derived from SourceLocations instead later.

Kind of hide token_pos_t::file, since it's about to be removed.

c646997

Fix indentation in source-manager.

8ad5a00

dvander force-pushed the pre-save-tokens branch 3 times, most recently from 627b32b to 3f6ff09 Compare September 30, 2023 03:10

dvander force-pushed the pre-save-tokens branch from 3f6ff09 to fb14cbb Compare September 30, 2023 03:27

dvander merged commit b1287bf into master Sep 30, 2023

dvander deleted the pre-save-tokens branch September 30, 2023 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track tokens more accurately, and compactly. #898

Track tokens more accurately, and compactly. #898

dvander commented Sep 30, 2023 •

edited

Loading

Track tokens more accurately, and compactly. #898

Track tokens more accurately, and compactly. #898

Conversation

dvander commented Sep 30, 2023 • edited Loading

dvander commented Sep 30, 2023 •

edited

Loading