Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track tokens more accurately, and compactly. #898

Merged
merged 13 commits into from
Sep 30, 2023
Merged

Conversation

dvander
Copy link
Member

@dvander dvander commented Sep 30, 2023

This is a big refactoring of how tokens are tracked throughout the compiler.

Rather than carry explicit file and column information on every token, instead we now track a 32-bit encoded location id. This id can be decoded back to a specific file, column, and line. It's much more compact, and it supports correct tracking of macro expanded tokens (which can cross file boundaries). It can also track the complete #include lineage of a token.

Gradually, the error reporter and AST will migrate to the new location id. The lexer still needs token_pos_t, which tracks the decoded line number. Decoding line numbers from a location id is very expensive, and the hottest function in the lexer is peek_same_line() which relies on this information.

The motivation for this is twofold. First, it makes tokens and the AST more compact, which means we can retain more stuff in memory. The AST is enormous due to the size of #includes. And being able to save and replay tokens is an important feature for future work. Reducing the size of each token will reduce peak memory pressure.

The accuracy of location ids will also let us retain a rather weird feature of spcomp: tab/space linting. The way it's implemented now requires tight coupling between the lexer state and parser, which makes it hard to save and replay token streams. Now that location ids can provide indexes into the source text, we can lint no matter what state the lexer is in.

Finally, this paves the way for clang style error messages. Not implemented here, but, you know, it's doable now.

This merge will bump the minimum C++ standard to C++17 and requires std::filesystem support. This also bumps macOS requirements up to 10.15.

This might introduce warnings or errors in very obscure cases, but
nothing obvious showed up in the corpus. We can fix it properly if
something does come up. It's not clear to me that it's needed, because
peek_same_line() does what tTERM intends to do.
…n, not fline.

fline can be stale when tokens are cached.
This also fixes a bug where passing a custom tests dir did not work.
This shaves four bytes off each token.
This was not really used anywhere, and saves another 12 bytes of memory
on each token.
This has long been deprecated in favor of report(), which is type-safe.
This simplifies tracking source positions on symbols by using the same
machinery as the lexer.
This is completely unused, and will be derived from SourceLocations
instead later.
@dvander dvander force-pushed the pre-save-tokens branch 3 times, most recently from 627b32b to 3f6ff09 Compare September 30, 2023 03:10
This replaces the "file" member of token_pos_t with a SourceLocation. A
source location is an index plus offset into a list of text streams
representing source files and macro expansions.

In theory, SourceLocation could fully replace token_pos_t. However,
decoding a SourceLocation back to a line number is very expensive, and
we need the line number extremely frequently. During lexing, we must
retain the line number in addition to the source location.

There are a few major benefits of this approach, versus the old "file,
line, col" tuple. First, SourceLocations are tiny. We can use them in
way more places, which makes the AST easier to annotate. Second, they
are way more accurate. We can now trace a token back to the macro that
defined it, and where that macro was defined, and what line included
that file.

But most importantly, this is a precursor to caching tokens. The
built-in lint checker needs to convert a token position back to the
source text stream, which was not easily possible in the old model.

NOTE: This bumps the minimum C++ standard to C++17, which means we now
require macOS 10.15 or higher, and Visual Studio 2017 or higher.
@dvander dvander merged commit b1287bf into master Sep 30, 2023
@dvander dvander deleted the pre-save-tokens branch September 30, 2023 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant