Skip to content

gh-102555: Fix comment parsing in HTMLParser #135664

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 4, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion Lib/html/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@
starttagopen = re.compile('<[a-zA-Z]')
endtagopen = re.compile('</[a-zA-Z]')
piclose = re.compile('>')
commentclose = re.compile(r'--\s*>')
commentclose = re.compile(r'--!?>')
commentabruptclose = re.compile(r'-?>')
# Note:
# 1) if you change tagfind/attrfind remember to update locatetagend too;
# 2) if you change tagfind/attrfind and/or locatetagend the parser will
Expand Down Expand Up @@ -336,6 +337,21 @@ def parse_html_declaration(self, i):
else:
return self.parse_bogus_comment(i)

# Internal -- parse comment, return length or -1 if not terminated
# see https://html.spec.whatwg.org/multipage/parsing.html#comment-start-state
def parse_comment(self, i, report=True):
Copy link
Contributor

@Privat33r-dev Privat33r-dev Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the change should be made in the _markupbase.

cpython/Lib/_markupbase.py

Lines 165 to 175 in c2f2fd4

def parse_comment(self, i, report=1):
rawdata = self.rawdata
if rawdata[i:i+4] != '<!--':
raise AssertionError('unexpected call to parse_comment()')
match = _commentclose.search(rawdata, i+4)
if not match:
return -1
if report:
j = match.start(0)
self.handle_comment(rawdata[i+4: j])
return match.end(0)

If the method is overloaded here, then there are no other use cases, and the original method becomes dead code.
https://github.com/search?q=repo%3Apython%2Fcpython%20parse_comment&type=code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to override the method here. Since _markupbase has been made internal/private in Python 3 and it's only used by html.parser, it makes sense to me to add new code directly to html.parser (and possibly even merging _markupbase into html.parser eventually).

Regarding the (now) dead code, we could either let it be, adding a comment noting that the method is unused/overridden, or delete it. The first two options are less destructive, but since the module is private there shouldn't be much concern about breaking backward compatibility (and if anyone is relying on the original implementation, they are probably using it through HTMLParser anyway).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_markupbase can be used in third-party code (if it is also the base for the SGML parser or other parsers), so it is better to not touch it in maintained versions. This change can break it if SGML has other rules for comments. But in the developing branch we can remove it, after finishing all other bug fixes.

rawdata = self.rawdata
assert rawdata.startswith('<!--', i), 'unexpected call to parse_comment()'
match = commentclose.search(rawdata, i+4)
if not match:
match = commentabruptclose.match(rawdata, i+4)
if not match:
return -1
if report:
j = match.start()
self.handle_comment(rawdata[i+4: j])
return match.end()

# Internal -- parse bogus comment, return length or -1 if not terminated
# see https://html.spec.whatwg.org/multipage/parsing.html#bogus-comment-state
def parse_bogus_comment(self, i, report=1):
Expand Down
32 changes: 30 additions & 2 deletions Lib/test/test_htmlparser.py
Original file line number Diff line number Diff line change
Expand Up @@ -367,17 +367,45 @@ def test_comments(self):
html = ("<!-- I'm a valid comment -->"
'<!--me too!-->'
'<!------>'
'<!----->'
'<!---->'
# abrupt-closing-of-empty-comment
'<!--->'
'<!-->'
'<!----I have many hyphens---->'
'<!-- I have a > in the middle -->'
'<!-- and I have -- in the middle! -->')
'<!-- and I have -- in the middle! -->'
'<!--incorrectly-closed-comment--!>'
'<!----!>'
'<!----!-->'
'<!---- >-->'
'<!---!>-->'
'<!--!>-->'
# nested-comment
'<!-- <!-- nested --> -->'
'<!--<!-->'
'<!--<!--!>'
)
expected = [('comment', " I'm a valid comment "),
('comment', 'me too!'),
('comment', '--'),
('comment', '-'),
('comment', ''),
('comment', ''),
('comment', ''),
('comment', '--I have many hyphens--'),
('comment', ' I have a > in the middle '),
('comment', ' and I have -- in the middle! ')]
('comment', ' and I have -- in the middle! '),
('comment', 'incorrectly-closed-comment'),
('comment', ''),
('comment', '--!'),
('comment', '-- >'),
('comment', '-!>'),
('comment', '!>'),
('comment', ' <!-- nested '), ('data', ' -->'),
('comment', '<!'),
('comment', '<!'),
]
self._run_check(html, expected)

def test_condcoms(self):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Fix comment parsing in :class:`html.parser.HTMLParser` according to the
HTML5 standard. ``--!>`` now ends the comment. ``-- >`` no longer ends the
comment. Support abnormally ended empty comments ``<-->`` and ``<--->``.
Loading