-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add start and end properties to WikiText AST nodes for all elements. #7866
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
@Gk0Wk I have not reviewed the code yet but this is a long overdue improvement, thank you. I note that the tests are failing and need to be updated. |
Thanks @Gk0Wk this would indeed be a welcome improvement. Given the scale of the changes needed to the tests, I wonder if we might also think about a further modification that has been discussed in the past: adding a new parse tree node property that (for example) distinguishes a Without the ability to distinguish them, it would remain impossible to reconstruct the original wikitext from the parse tree. The property could be Would that be useful in the applications you have in mind? |
@Jermolene Yes, I completely agree. In fact, today I also wanted to do the same thing, that is, to consider the However, I'm not sure if this alone is sufficient to describe everything completely. What I mean is, can we reconstruct the original wiki string without any discrepancies solely based on these? Is it possible that we still need more context? Nevertheless, this would indeed be better than before. The previous AST was geared towards widget generation, or rather, it was oriented towards rendering, and not for describing all the information. I've also noticed that some rules do not generate any nodes, such as Meanwhile, plain text is treated as a paragraph element and does not correspond to any rule. |
I've looked at the error messages from the testing process, but I can't understand what they mean. For example:
Can someone help explain them to me? Is this because I need to further modify the expected results in |
We can do that in another PR, I think that is difficult. Also, I've been long waiting for macro call, distinguish from transclusion #7646 |
I think it would be more useful if the "rule" property contained a string. It's very useful to be able to save parse trees in JSON.
I think we would indeed still have some more work to do, but it would be a huge improvement.
Yes, exactly.
Yes, we would need to retain some ore information in the parse tree. That may well have performance implications of course.
I think a reasonable initial goal might be to be able to convert parse trees to a kind of normalised wikitext translation, where we don't have 100% fidelity of the source, but the rendered output is identical.
Yes, I am afraid so. It is a lot of work. It may well be that some significant refactoring might be needed. For example, the quickest way to get the existing tests running again might be to hack the parse function that it uses to remove the new attributes. Of course then we'd need another set of tests for the new attributes, but we could use the newer wikitext test format for those. |
I would be not happy with a test-hack. We do not have the best test coverage for a project that size and hacking tests IMO makes the unreliable. I did not run the tests with this PR yet, but I think basically all of the parser related tests fail now. -> Because they should. I'll have a closer look |
I also think so. IMO this PR should be merged as is after the tests have been fixed and in a second step we can add more info to the AST |
This is already done in https://github.com/tiddly-gittly/wikiast , should I PR it to the core? But it rely on unist library, and due to complicity of AST operation, I can't do this without TS or JSDoc. |
…add more range attributes
I have modified all the expected results in test-wikitext-parser.js and verified the correctness of the expected results; the tests have passed. |
@Jermolene @pmario New changes for html aprser: OverviewFixing a parsing issue within the WikiText HTML syntax parser. The original implementation incorrectly set the Changes Made
These modifications ensure that the AST node for HTML elements now correctly encapsulates the complete range of the HTML snippet, allowing for more precise manipulation and analysis. |
Great stuff thank you @Gk0Wk |
Tagging @flibbles – would these changes to the parse tree be helpful and/or problematic for Uglify/Relink? |
Tests all pass for this. It's additive rather than changing, so shouldn't be a problem. I actually considered proposing something like this myself a while back. It would have made it much easier for my plugins to figure things out instead of having to count out start and stop for themselves after-the-fact. |
Hi @Gk0Wk @Jermolene |
@Gk0Wk can we please add some tests for these cases? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed this PR, and the tests look sufficient however we should make sure that there is not any significant performance regression introduced here.
There are a few code style issues that need attention.
var styleStart = templateEnd + 2; | ||
var styleEnd = styleStart + (this.match[4] ? this.match[4].length : 0); | ||
var classesStart = styleEnd + 1; | ||
var classesEnd = classesStart + (this.match[5] ? this.match[5].length : 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always collapse multiple var statements into one:
var filterStart = this.parser.pos + 3,
filterEnd = .....
This applies to all changes throughout this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this apply while there is assignment?
Hope this can be done by eslint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would need to be resolved manually.
@@ -42,21 +52,21 @@ exports.parse = function() { | |||
var node = { | |||
type: "list", | |||
attributes: { | |||
filter: {type: "string", value: filter} | |||
filter: {type: "string", value: filter, start: filterStart, end: filterEnd}, |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean use {type: "string", value: filter, filterStart, filterEnd}
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@linonetwo I must have misread the code this morning, this code is fine and needs no further changes, and this comment can be ignored.
} | ||
// Treat it as a paragraph if we didn't find a block rule | ||
var start = this.pos; | ||
var children = this.parseInlineRun(terminatorRegExp); | ||
var end = this.pos; | ||
return [{type: "element", tag: "p", children: children, start: start, end: end }]; | ||
return [{type: "element", tag: "p", children: children, start: start, end: end, rule: null }]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jermolene flagging for your review whether there might be a more appropriate value than null for the rule property here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My instinct would be to omit the "rule" property entirely in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seem to be a bad idea now, because I need this info to restore \n\n in #8258
if (tag.closeTagStart < closeTagMinPos) { | ||
tag.closeTagStart = tag.end; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gk0Wk @Jermolene we should make sure that this PR does not introduce any significant performance regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this won't cost much, because it just while
from >
to the <
in <div>
, won't step more than <blockquote>
But why would we need open/closeTagStart/End ? Can this feature be turn on when needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Gk0Wk much appreciated. There are a few coding style issues that need cleaning up, and then I am keen to merge this as soon as possible.
} | ||
// Treat it as a paragraph if we didn't find a block rule | ||
var start = this.pos; | ||
var children = this.parseInlineRun(terminatorRegExp); | ||
var end = this.pos; | ||
return [{type: "element", tag: "p", children: children, start: start, end: end }]; | ||
return [{type: "element", tag: "p", children: children, start: start, end: end, rule: null }]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My instinct would be to omit the "rule" property entirely in this case.
var start = this.pos; | ||
var subTree = nextMatch.rule.parse(); | ||
// Set the start and end positions of the first and last child if they're not already set | ||
if (subTree.length > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have whitespace between if
and (
// Set the start and end positions of the first and last child if they're not already set | ||
if (subTree.length > 0) { | ||
// Set the start and end positions of the first and last child if they're not already set | ||
if (subTree[0].start === undefined) subTree[0].start = start; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always use braces and a new line for the body of if statements.
tag.closeTagStart = -1; | ||
break; | ||
} | ||
if (char === '<') break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always use double quotes for string constants.
commit 5687d9f Author: Gk0Wk <[email protected]> Date: Wed Dec 6 11:33:43 2023 +0800 Fix for html parser commit df0a1b1 Author: Gk0Wk <[email protected]> Date: Wed Dec 6 02:47:47 2023 +0800 Fix HTML AST node boundary parsing in WikiText commit ac8dda0 Author: Gk0Wk <[email protected]> Date: Sat Dec 2 13:02:52 2023 +0800 update test-wikitext-parser.js, change for-const-of -to .utils.each, add more range attributes commit e2b9a4e Author: Gk0Wk <[email protected]> Date: Wed Nov 29 22:35:39 2023 +0800 Add more start-end range attributes for AST commit d3e62ec Author: Gk0Wk <[email protected]> Date: Wed Nov 29 20:45:00 2023 +0800 Add rule attribute for WikiText AST nodes commit 4200495 Author: Gk0Wk <[email protected]> Date: Wed Nov 29 15:48:38 2023 +0800 Add start and end properties to AST nodes for list, codeblock, and all other elements
Enhancements to WikiText Parser AST
Summary
This pull request introduces a significant enhancement to the WikiText parser by adding
start
andend
properties to the AST nodes that did not previously have them. This update includes nodes such as lists, codeblocks, and various other elements that were lacking these properties.Background
The ability to identify the exact substring range of a syntax subtree within the source file is crucial for advanced functionalities like section editing. While some progress had been made in the past by adding
start
andend
properties to paragraph elements, other nodes were still missing this information.Changes
With this PR, I've implemented a uniform check across all syntax rules during the AST generation process. Now, every node produced by the parser will have its corresponding
start
andend
properties, marking the range of the substring it represents in the source text.Benefits
Testing
Before:
After:
I look forward to your feedback on these enhancements!