Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzing reveals a number of parse errors #568

Open
leonardr opened this issue Mar 20, 2023 · 2 comments
Open

Fuzzing reveals a number of parse errors #568

leonardr opened this issue Mar 20, 2023 · 2 comments

Comments

@leonardr
Copy link

I'm the lead developer of Beautiful Soup, which has html5lib as an optional dependency. Over the past couple of years I've gotten a number of notifications from Google's oss-fuzz project about unhandled exceptions that actually turned out to be problems in html5lib. There wasn't much I could do with these errors, but now that it looks like html5lib maintenance is picking up, I can pass them on to you. (Sorry. 😿)

I've incorporated the fuzz reports into the Beautiful Soup test suite, and the test cases themselves are here, but here's a general picture of what problems I see. In each case, I believe just parsing the bad markup is enough to trigger the error.

clusterfuzz-testcase-minimized-bs4_fuzzer-4999465949331456

Markup: b')<a><math><TR><a><mI><a><p><a>'

Error:

self = <html>, node = <p>, refNode = None

    def insertBefore(self, node, refNode):
>       index = self.element.index(refNode.element)
E       AttributeError: 'NoneType' object has no attribute 'element'

clusterfuzz-testcase-minimized-bs4_fuzzer-5843991618256896

Markup: b'-<math><sElect><mi><sElect><sElect>'

Error:

    def resetInsertionMode(self):
    ...
            # Check for conditions that should only happen in the innerHTML
            # case
            if nodeName in ("select", "colgroup", "head", "html"):
>               assert self.innerHTML
E               AssertionError

clusterfuzz-testcase-minimized-bs4_fuzzer-6241471367348224

Markup: b'ñ<table><svg><html>'

Error:

self = <html5lib.html5parser.getPhases.<locals>.InTablePhase object at 0x7f8f405ad440>

    def processEOF(self):
        if self.tree.openElements[-1].name != "html":
            self.parser.parseError("eof-in-table")
        else:
>           assert self.parser.innerHTML
E           AssertionError

clusterfuzz-testcase-minimized-bs4_fuzzer-6600557255327744

Markup: b'\t<TABLE><<!>;<!><<!>.<lec><th>i><a><mat\x00\x01<mi\x00a><math>><th><mI>chardeta\xff\xff\xff\xff<><th><mI><||||||||A<select><>qu?\xbemath><th><mie>qu'

Error:

self = <html5lib.html5parser.getPhases.<locals>.InTableBodyPhase object at 0x7f8f4184ce00>

    def clearStackToTableBodyContext(self):
        while self.tree.openElements[-1].name not in ("tbody", "tfoot",
                                                      "thead", "html"):
            # self.parser.parseError("unexpected-implied-end-tag-in-table",
            #  {"name": self.tree.openElements[-1].name})
            self.tree.openElements.pop()
        if self.tree.openElements[-1].name == "html":
>           assert self.parser.innerHTML
E           AssertionError

Also reported to me recently was the issue that was reported to you as issue #557.

@leonardr
Copy link
Author

Another such error: clusterfuzz-testcase-minimized-bs4_fuzzer-6401239223762944

Markup: <math>\x10<select><mi><select><select>t

Same assert self.parser.innterHTML AssertionError as seen before. Going forward I'll probably only mention issues that look new.

@leonardr
Copy link
Author

leonardr commented Jun 16, 2023

This one is different from the rest:

Markup: b'y<framesetboheadrb$al>t<table><><t><th><math><th>u<\x0ch><mi><thx><TR>ind><<meta><i<isind<i\xff\xff\xff\xffex><select><<tr>i=ut\x00\x007>'

Raises an IndexError:

  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 284, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 133, in _parse
    self.mainLoop()
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 240, in mainLoop
    new_token = phase.processStartTag(new_token)
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 469, in processStartTag
    return func(token)
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 2232, in startTagTableOther
    self.closeCell()
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 2220, in closeCell
    self.endTagTableCell(impliedTagToken("th"))
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 2254, in endTagTableCell
    self.tree.clearActiveFormattingElements()
  File "/usr/lib/python3/dist-packages/html5lib/treebuilders/base.py", line 265, in clearActiveFormattingElements
    entry = self.activeFormattingElements.pop()
IndexError: pop from empty list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant