-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track tokens more accurately, and compactly. #898
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This might introduce warnings or errors in very obscure cases, but nothing obvious showed up in the corpus. We can fix it properly if something does come up. It's not clear to me that it's needed, because peek_same_line() does what tTERM intends to do.
…n, not fline. fline can be stale when tokens are cached.
This also fixes a bug where passing a custom tests dir did not work.
This shaves four bytes off each token.
This was not really used anywhere, and saves another 12 bytes of memory on each token.
This has long been deprecated in favor of report(), which is type-safe.
This simplifies tracking source positions on symbols by using the same machinery as the lexer.
This is completely unused, and will be derived from SourceLocations instead later.
627b32b
to
3f6ff09
Compare
This replaces the "file" member of token_pos_t with a SourceLocation. A source location is an index plus offset into a list of text streams representing source files and macro expansions. In theory, SourceLocation could fully replace token_pos_t. However, decoding a SourceLocation back to a line number is very expensive, and we need the line number extremely frequently. During lexing, we must retain the line number in addition to the source location. There are a few major benefits of this approach, versus the old "file, line, col" tuple. First, SourceLocations are tiny. We can use them in way more places, which makes the AST easier to annotate. Second, they are way more accurate. We can now trace a token back to the macro that defined it, and where that macro was defined, and what line included that file. But most importantly, this is a precursor to caching tokens. The built-in lint checker needs to convert a token position back to the source text stream, which was not easily possible in the old model. NOTE: This bumps the minimum C++ standard to C++17, which means we now require macOS 10.15 or higher, and Visual Studio 2017 or higher.
3f6ff09
to
fb14cbb
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a big refactoring of how tokens are tracked throughout the compiler.
Rather than carry explicit file and column information on every token, instead we now track a 32-bit encoded location id. This id can be decoded back to a specific file, column, and line. It's much more compact, and it supports correct tracking of macro expanded tokens (which can cross file boundaries). It can also track the complete #include lineage of a token.
Gradually, the error reporter and AST will migrate to the new location id. The lexer still needs token_pos_t, which tracks the decoded line number. Decoding line numbers from a location id is very expensive, and the hottest function in the lexer is peek_same_line() which relies on this information.
The motivation for this is twofold. First, it makes tokens and the AST more compact, which means we can retain more stuff in memory. The AST is enormous due to the size of #includes. And being able to save and replay tokens is an important feature for future work. Reducing the size of each token will reduce peak memory pressure.
The accuracy of location ids will also let us retain a rather weird feature of spcomp: tab/space linting. The way it's implemented now requires tight coupling between the lexer state and parser, which makes it hard to save and replay token streams. Now that location ids can provide indexes into the source text, we can lint no matter what state the lexer is in.
Finally, this paves the way for clang style error messages. Not implemented here, but, you know, it's doable now.
This merge will bump the minimum C++ standard to C++17 and requires std::filesystem support. This also bumps macOS requirements up to 10.15.