Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: track whitespace #1135

Closed
wants to merge 4 commits into from
Closed

Conversation

arieldon
Copy link
Contributor

This set of commits addresses issues #282 and #1113 by tracking whitespace and adding a few new diagnostics. quick-lint will print warnings about mismatched curly brace indentation, if/else indentation, and misleading indentation after a braceless if.

Commit 2379dea modifies a couple existing tests by adding whitespace to them to avoid triggering Diag_Mismatched_Curly_Indentation.

@arieldon arieldon changed the title Track Whitespace feat: track whitespace Dec 26, 2023
Copy link
Collaborator

@strager strager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this issue. However, I am concerned about false positives (i.e. diagnostics on correct code). See my comments marked 'must fix'.

Char8 c = input[0];
unsigned char c0 = static_cast<unsigned char>(input[0]);
unsigned char c1 = static_cast<unsigned char>(input[1]);
unsigned char c2 = static_cast<unsigned char>(input[2]);
if (c == ' ' || c == '\t' || c == '\f' || c == '\v') {
this->indent_level_ += this->increasing_indent_;
Copy link
Collaborator

@strager strager Dec 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like tracking indent level like this for a few reasons:

  • I don't like introducing another addition to this hot loop. This is possibly a performance problem.
  • It is bug prone. For example, your patch does not treat non-ASCII spaces (such as Ogham Space Mark on line 1882) as indentation.
  • This does not account for differences between tabs and spaces.

I think a better approach would be to store a pointer to the beginning of the line. reset_indent_level would then be the only thing doing new work in the lexer. However, to determine the indentation level, we would need a more sophisticated algorithm; it would re-scan the indentation and track tabs and spaces separately. The parser would invoke this algorithm rarely, so the complexities of the indentation algorithm shouldn't hurt performance much.

Perhaps my idea is bad. I haven't tried it. But storing the beginning of the line feels right to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not tried to implement different ways to track whitespace. I figured the addition would be fairly cheap even in a hot loop, but I didn't measure. I also increment this->indent_level in the Ogham Space Mark branch of skip_whitespace now -- thanks for catching that.

As for the difference between tabs and spaces, I assume we should let the programmer configure this? I guess we can try to autodetect this too. I don't mix tabs and spaces myself though, so I don't understand the intent of doing so.

src/quick-lint-js/fe/lex.cpp Show resolved Hide resolved
Comment on lines 1909 to 1912
this->diag_reporter_->report(Diag_Mismatched_Curly_Indentation{
.left_curly = left_curly.span(),
.right_curly = this->peek().span(),
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must fix: I am very skeptical of this diagnostic. Wonky indentation is not necessarily a problem. For example, quick-lint-js should not complain about the following code:

if (x) {
    y();
 } else {
    z();
 }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this diagnostic completely in this new set of patches. I replaced it with Diag_Unclosed_Code_Block_V2, which I think matches your intention in #282 much more closely.

It will not report any warnings for the snippet of code above. But it will report an error for the snippet of code in #282 as well as others obviously. See the tests for more examples.

Note, this new diagnostic does not catch something like this, where more code trails an expected but missing closing curly:

function f() {
  z;
  if (false) {
    z;
    if (1) {
      z;
    }

  if (true) {
  }
}

It will report that the curly brace after if (false) is missing its matching curly. However, it will suggest the matching rightly curly go at the end of the file instead of before if (true).

Comment on lines 4447 to 4462
if (!allow_several_statements &&
this->peek().indent_level > keyword_token.indent_level) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that the logic for this diagnostic is so simple! 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit more complicated now to allow a snippet like this, where else is more indented than if.

if (x)
    y();
 else
    z();

@@ -4468,6 +4482,14 @@ void Parser::parse_and_visit_if(Parse_Visitor_Base &v) {

parse_maybe_else:
if (this->peek().type == Token_Type::kw_else) {
Token else_token = this->peek();
if (this->peek().indent_level != if_token.indent_level) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must fix: I think this logic is too strict. The following code is fine, for example, but your code would report annoying diagnostics:

if (x) {
    y();
 } else {
    z();
 }

if (x)
    y();
 else
    z();

I think we should only trigger a warning about else if there's an nested if prior (or something). We shouldn't warn about all elses.

test/test-parse-function.cpp Outdated Show resolved Hide resolved
[[qljs::diag("E0443", Diagnostic_Severity::warning)]] //
[[qljs::message("misleading indentation after 'if' body",
ARG(if_or_else_span))]] //
Source_Code_Span if_or_else_span;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the wrong spot. We should point at the following statement or at the indentation before the following statement. The location of the if or else is supplementary information.

  if (cond)
      f();
      g();
//^^^^ warning: misleading indentation after 'if' body

src/quick-lint-js/diag/diagnostic-types-2.h Outdated Show resolved Hide resolved
test/test-parse-statement.cpp Outdated Show resolved Hide resolved
Comment on lines +612 to +619
u8"if(x)\n\tif (y)\n\t\tz();\nelse\n\tw();"_sv, //
u8" ^^ Diag_Misleading_Braceless_If_Else_Indentation.if_span\n"_diag //
u8" ^^^^ .else_span"_diag);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, though I find \t a bit hard to read. Perhaps two spaces is a good substitute:

Suggested change
u8"if(x)\n\tif (y)\n\t\tz();\nelse\n\tw();"_sv, //
u8" ^^ Diag_Misleading_Braceless_If_Else_Indentation.if_span\n"_diag //
u8" ^^^^ .else_span"_diag);
u8"if(x)\n if (y)\n z();\nelse\n w();"_sv, //
u8" ^^ Diag_Misleading_Braceless_If_Else_Indentation.if_span\n"_diag //
u8" ^^^^ .else_span"_diag);

Eh, maybe not. We don't have great testing for diagnostics with multi-line code samples...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept \t for now. I agree it's difficult to read, but I don't find spaces easier. For now, I added an unraveled version of the string above each test.

@arieldon
Copy link
Contributor Author

arieldon commented Jan 3, 2024

I addressed your "must fix" comments. The diagnostics are more flexible now. They do not necessarily report wonky indentation, but they still use indentation to try to track where to place a closing right curly brace. They also still report misleading indentation for if and else statements.

I have not tried your suggestion about changing the way we track indentation in Lexer yet. Diag_Misleading_If_Or_Else_Body_Indentation also still reports the if and not the actual misleading indentation.

@arieldon arieldon requested a review from strager January 3, 2024 16:10
Copy link
Collaborator

@strager strager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should focus on only one diagnostic for now (probably either Diag_Unclosed_Code_Block or Diag_Misleading_If_Or_Else_Body_Indentation). You can add the other diagnostics later in a separate pull request.

Must fix: The following code reports a false positive ("indentation of 'if' does not match 'else' [E0442]"):

if (true) {
  if (true)
    console.log();
//      console.log();
  else
    console.log();
}

I think the comment is significant. Deleting the comment fixes the false positive. (Indentation level of the comment doesn't matter.)

Similarly, there's a false negative in the following code (expected "misleading indentation after 'if' body [E0443]", but got nothing):

if (true) {
  if (true)
    console.log();
  //      console.log();
  //else
    console.log();
}

Comment on lines +2846 to +2849
[[qljs::message("indentation of '{0}' does not match '{1}'", ARG(if_span),
ARG(else_span))]] //
[[qljs::message("indentation of '{0}' does not match '{1}'", ARG(else_span),
ARG(if_span))]] //
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the user, this will look like two different diagnostics.

Nit: I would inline 'else' and 'if' to make the messages more legible to maintainers.

I also think that it's more likely that else is misindented than if, so we should point to the else first.

I think this reads better:

Suggested change
[[qljs::message("indentation of '{0}' does not match '{1}'", ARG(if_span),
ARG(else_span))]] //
[[qljs::message("indentation of '{0}' does not match '{1}'", ARG(else_span),
ARG(if_span))]] //
[[qljs::message("indentation of 'else' does not match indentation of 'if'", ARG(else_span))]] //
[[qljs::message("matching 'if' here", ARG(if_span))]] //

}
this->diag_reporter_->report(Diag_Unclosed_Code_Block_V2{
.block_open = block_open.span(),
.expected_block_close = Source_Code_Span::unit(block_close.begin),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the inaccuracy of expected_block_close in common situations such as the following:

function f() {
  if (true) {   // <-- matching '{' here [E0441]
      body();
  foo();
}               // <-- missing '}' [E0441]

I want quick-lint-js to not mislead.

I think we should omit it and only mention the opening {. In other words, lets use Diag_Unclosed_Code_Block until we have a smart enough algorithm to point to the correct place for }.

.block_open = left_curly_span,
case Token_Type::end_of_file: {
Token block_open, block_close;
if (this->mismatched_curly_braces_.has_value()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must fix: Because we don't reset mismatched_curly_braces_, we might end up with duplicate diagnostics. For example:

function a() {
  function f() {
    if (true) {    // <-- matching '{' here [E0441]; matching '{' here [E0441]
        body();
    foo();
}
/*
sandbox/hello3.mjs:6:1: error: missing '}' [E0441]
sandbox/hello3.mjs:3:15: note: matching '{' here [E0441]
sandbox/hello3.mjs:6:1: error: missing '}' [E0441]
sandbox/hello3.mjs:3:15: note: matching '{' here [E0441]
*/

Adding this->mismatched_curly_braces_.reset(); after reporting the diagnostic fixes this issue somewhat:

function a() {     // <-- matching '{' here [E0441]
  function f() {
    if (true) {    // <-- matching '{' here [E0441]
        body();
    foo();
}
/*
sandbox/hello3.mjs:6:1: error: missing '}' [E0441]
sandbox/hello3.mjs:3:15: note: matching '{' here [E0441]
sandbox/hello3.mjs:7:1: error: missing '}' [E0441]
sandbox/hello3.mjs:1:14: note: matching '{' here [E0441]
*/

I think a better solution is to maintain a stack of curlies instead of either zero or one pairs of curlies.

Comment on lines +4426 to +4427
bool previous_in_if_statement = this->in_if_statement_;
this->in_if_statement_ = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must fix: Use Bool_Guard. (It's a bit of boilerplate, but the safety is worth it.) See Switch_Guard for example usage.

this->peek().indent_level > keyword_token.indent_level) {
this->diag_reporter_->report(Diag_Misleading_If_Or_Else_Body_Indentation{
.if_or_else_span = keyword_token.span()});
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code reports diagnostics for the following code:

// misleading indentation after 'if' body [E0443]
if (true) console.log();
 console.log();

function f() {
 if (true) console.log();
 // misleading indentation after 'if' body [E0443]
 else console.log()
  console.log();
}

Was this intentional? If so, I think it deserves a test to codify the behavior.

@@ -4468,6 +4496,15 @@ void Parser::parse_and_visit_if(Parse_Visitor_Base &v) {

parse_maybe_else:
if (this->peek().type == Token_Type::kw_else) {
Token else_token = this->peek();
if (previous_in_if_statement &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must fix: This detection of whether we're in a braceless if or not is faulty. For example, the following code triggers the diagnostic, but if you remove the outer if (first and last lines) there is no diagnostic:

if (true) {
   // indentation of 'if' does not match 'else' [E0442]
   if (true) {}
  else console.log()

}

Another example of faulty if detection:

if (true) {
  a.map((o) => {
         if (true) return 1;
    else if (true) return 2;
    else return 3;
  });
}

Comment on lines +621 to +627
// if (x)
// if (y)
// z();
// else
// w();
test_parse_and_visit_statement(
u8"if(x)\n\tif (y)\n\t\tz();\n\telse\n\t\tw();"_sv, no_diags);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the no_diag tests, we can split the string onto multiple source lines:

Suggested change
// if (x)
// if (y)
// z();
// else
// w();
test_parse_and_visit_statement(
u8"if(x)\n\tif (y)\n\t\tz();\n\telse\n\t\tw();"_sv, no_diags);
test_parse_and_visit_statement(
u8"if(x)\n"_sv
u8"\tif (y)\n"_sv
u8"\t\tz();\n"_sv
u8"\telse\n"_sv
u8"\t\tw();"_sv,
no_diags);

@arieldon
Copy link
Contributor Author

arieldon commented Jan 8, 2024

I'm not able to figure out a heuristic and an implementation for it that doesn't feel complex and hacky. I'm going to close this PR for now. Thank you for your suggestions and help nonetheless.

@arieldon arieldon closed this Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants