Add option to output comments in the parser #239

PeterTillema · 2021-10-28T09:34:07Z

It would be very nice if there was a command-line option to also include the comments in the parser, rather than skipping them entirely. This is useful for code generation later (then you might prefer to keep the comments as well).

PeterTillema · 2021-10-29T13:21:29Z

Update: what I wanted already exists, namely parse_docstrings in miss_hit_core/m_parse_utils.py. Now I only need to know if single comments can be added somehow...

florianschanda · 2021-10-30T14:06:12Z

Hey. So the lexer includes comments (this is indeed what I use to pretty-print the code later). The parser, ignores them. See https://github.com/florianschanda/miss_hit/blob/master/miss_hit_core/m_parser.py#L168

And I think all parsers I know of do that; and you already found the one special utility that checks for docstrings.

What you want is the comments in the parse tree - but this is not obvious how to do. Bear in mind a comment can appear at any point, including in the middle of statements. Consider for example:

x = potato + ... This is a comment
    42;

What should the parse tree look like here? Right now we have:

(assign x (+ potato 42))

But there is no real place to attach a comment. Should it be on the +? On the assign?

For tracability of course it would be good to keep them somehow, but likely I will recover them with some post-processing similar to what I do with docstrings.

As an exercise, consider writing a BNF for a language where comments are part of the BNF grammar. It would be possible, but you quickly realise that the only comments you can support sensibly are the ones between statements.

florianschanda · 2021-10-30T16:02:03Z

I am going to close this ticket as I think the functionality to extract comments is there; but if you have a concrete proposal we can open it again.

florianschanda · 2021-10-30T16:03:35Z

One more bit of detail on code generation:

I do plan to do this, but it's very far away
First I need to get semantic analysis working
Then I need to write down an actual formal grammar for my version of MATLAB (let's call it MISS_HIT)
Only then can I start thinking about doing code generation

florianschanda · 2021-10-30T16:57:27Z

OK, know what I will re-open this (but do not expect anything anytime soon). There is still massive issues here.

Some bits are easy, consider:

x = 1;
% potato
x = 2;

Parse tree here could be:

Sequence_Of_Statements
   Assignment
   Comment
   Assignment

Inline comments could be attached as a special attribute, most likely the top-level statement. For example:

x = 1;  % potato

Gives us something like:

Sequence_Of_Statements
   Assignment
      Comment: <potato>
      LHS: Reference
         Identifier: <x>
      RHS: Literal
         Value: <1>

But where it gets fun is when it becomes ambiguous:

x = 1; % ok, so this is a comment
       % and it goes on a while

I bet you can guess where this is going :) One statement? Or a statement + a comment node?

We could come up with some rules, like if there is no spacing, it gets merged, but if there is, it is made separate.

Since we're never going to do anything with these comments that semantically makes a difference, it doesn't matter too much. But it's just a massive can of worms.

Maybe you can share a bit what you had in mind, or if the docstring thing does fix all your problems?

PeterTillema · 2021-11-01T09:32:04Z

This is indeed a tough question, so let me dump here my thoughts about this.

x = 1;
% potato
x = 2;

Parse tree here could be:

Sequence_Of_Statements
   Assignment
   Comment
   Assignment

Yep, that's a relatively easy one, exactly what I had in mind.

Inline comments could be attached as a special attribute, most likely the top-level statement. For example:
x = 1;  % potato
Gives us something like:
Sequence_Of_Statements
   Assignment
      Comment: <potato>
      LHS: Reference
         Identifier: <x>
      RHS: Literal
         Value: <1>

Basically what I had in mind too.

But where it gets fun is when it becomes ambiguous:
x = 1; % ok, so this is a comment
       % and it goes on a while
I bet you can guess where this is going :) One statement? Or a statement + a comment node?

We could come up with some rules, like if there is no spacing, it gets merged, but if there is, it is made separate.

Since we're never going to do anything with these comments that semantically makes a difference, it doesn't matter too much. But it's just a massive can of worms.

Hmm, this is the hard one. I thought it would be nice if you have a list of comments connected to the top-most statement of these lines. So, in this case, it is attached to the Simple_Assignment_Statement (or something). Since an expression might cover multiple lines, and thus also multiple comments, it should be a list.

Other example:

a = 2, b=3 % comment

Here the comment should be attached to the "=" from b = 3, i.e. the last statement on the line the comment occurs on.

Full example:

% comment 1
a = 2; % comment 2
b = a + % comment 3
      4 % comment 4
c = 6, d = 9; % comment 5

Here the should should be something like this (I don't know the exact names, but you get the point):

Root
   Comment: comment 1
   Simple_Assignment_Statement
      Comments
         Comment: comment 2
      LHS: Identifier (a)
      RHS: Number_Literal (2)
   Simple_Assignment_Statement
      Comments
         Comment: comment 3
         Comment: comment 4
      LHS: Identifier (b)
      RHS: Binary_Operation (+)
         LHS: Identifier (a)
         RHS: Number_Literal (4)
   Simple_Assignment_Statement
      Comments
      LHS: Identifier (c)
      RHS: Number_Literal (6)
   Simple_Assignment_Statement
      Comments
         Comment: comment 5
      LHS: Identifier (d)
      RHS: Number_Literal (9)

Note that docstrings might be catched twice in that case (if you run the post-process function), but it's up to the user to properly handle that I think.

florianschanda added the invalid This doesn't seem right label Oct 30, 2021

florianschanda closed this as completed Oct 30, 2021

florianschanda added component: core Affects the core infrastructure component: parser Affects the parser difficulty: extreme This change requires significant language/algorithm/architecture design and removed invalid This doesn't seem right labels Oct 30, 2021

florianschanda reopened this Oct 30, 2021

florianschanda mentioned this issue Dec 22, 2021

Determine Array_Index or Function_Call in AST #246

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to output comments in the parser #239

Add option to output comments in the parser #239

PeterTillema commented Oct 28, 2021

PeterTillema commented Oct 29, 2021

florianschanda commented Oct 30, 2021 •

edited

Loading

florianschanda commented Oct 30, 2021

florianschanda commented Oct 30, 2021

florianschanda commented Oct 30, 2021

PeterTillema commented Nov 1, 2021 •

edited

Loading

Add option to output comments in the parser #239

Add option to output comments in the parser #239

Comments

PeterTillema commented Oct 28, 2021

PeterTillema commented Oct 29, 2021

florianschanda commented Oct 30, 2021 • edited Loading

florianschanda commented Oct 30, 2021

florianschanda commented Oct 30, 2021

florianschanda commented Oct 30, 2021

PeterTillema commented Nov 1, 2021 • edited Loading

florianschanda commented Oct 30, 2021 •

edited

Loading

PeterTillema commented Nov 1, 2021 •

edited

Loading