Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfctr(html): replace html parser #3218

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

scanny
Copy link
Collaborator

@scanny scanny commented Jun 17, 2024

Summary
Replace legacy HTML parser with recursive version that captures all content and provides flexibility to add new metadata. It's also substantially faster although that's just a happy side-effect.

Additional Context
The prior HTML parsing algorithm that makes up the core of HTML partitioning was buggy and very difficult to reason about because it did not conform to the inherently recursive structure of HTML. The new version retains lxml as the performant and reliable base library but uses lxml's custom element classes to efficiently classify HTML elements by their behaviors (block-item and inline (phrasing) primarily) and give those elements the desired partitioning behaviors.

This solves a host of existing problems with content being skipped and elements (paragraphs) being divided improperly, but also provides a clear domain model for reasoning about its behavior and reliably adjusting it to suit our existing and future purposes.

The parser's operation is recursive, closely modeling the recursive structure of HTML itself. It's behaviors are based on the HTML Standard and reliably produce proper and explainable results even for novel cases.

Fixes #2325
Fixes #2562
Fixes #2675
Fixes #3168
Fixes #3227
Fixes #3228
Fixes #3230
Fixes #3237
Fixes #3245
Fixes #3247
Fixes #3255
Fixes #3309

BEHAVIOR DIFFERENCES

emphasized_text_tags encoding is changed:

  • <strong> is encoded as "b" rather than "strong".
  • <em> is encoded as "i" rather than "em".
  • <span> is no longer recorded in emphasized_text_tags (because without the CSS we can't tell whether it's used for emphasis or if so what kind).
  • nested emphasis (e.g. bold+italic) is encoded as multiple characters ("bi").
  • emphasized_text_contents is broken on emphasis-change boundaries, like:
     `<p>foo <b>bar <i>baz</i> bada</b> bing</p>`
    produces:
    {
      "emphasized_text_contents": ["bar", "baz", "bada"],
      "emphasized_text_tags": ["b", "bi", "b"]
    }
    whereas previously it would have produced:
    {
      "emphasized_text_contents": ["bar baz bada", "baz"],
      "emphasized_text_tags": ["b", "i"]
    }

<pre> text is preserved as it appears in the html

Except that a leading newline is removed if present (has to be in position 0 of text). Also, a trailing newline is stripped but only if it appears in the very last position ([-1]) of the <pre> text. Old parser stripped all leading and trailing whitespace.

Result is that:

<pre>
foo
bar
baz
</pre>

parses to "foo\nbar\nbaz" which is the same result produced for:

<pre>foo
bar
baz</pre>

This equivalence is the same behavior exhibited by a browser, which is why we did the extra work to make it this way.

Whitespace normalization

Leading and trailing whitespace are removed from element text, just as it is removed in the browser. Runs of whitespace within the element text are reduced to a single space character (like in the browser). Note this means that \t, \n, and &nbsp; are replaced with a regular space character. All text derived from elements is whitespace normalized except the text within a <pre> tag. Any leading or trailing newline is trimmed from <pre> element text; all other whitespace is preserved just as it appeared in the HTML source.

link_start_indexes metadata is no longer captured. Rationale:

  • It was frequently wrong, often -1.
  • It was deprecated but then added back in a community PR.
  • Maintaining it across any possible downstream transformations (e.g. chunking) would be expensive and almost certainly lead to wrong values as distant code evolves.
  • It is complex to compute and recompute when whitespace is normalized, adding substantial complexity to the code and reducing readability and maintainability

<br/> element is replaced with a single newline ("\n")

but that is usually replaced with a space in Element.text when it is normalized. The newline is preserved within a <pre> element.

  • Related: No paragraph-break on <br/><br/>

Empty h1..h6 elements are dropped.

HTML heading elements (<h1..h6>) are "skipped" (do not generate a Title element) when they contain no text or contain only whitespace.

@scanny scanny force-pushed the scanny/replace-html-parser branch from 105fa52 to 16ade6e Compare June 17, 2024 05:26
@scanny scanny force-pushed the scanny/replace-html-parser branch from 16ade6e to df885d6 Compare June 17, 2024 07:22
@scanny scanny force-pushed the scanny/replace-html-parser branch from df885d6 to 8e136e7 Compare June 17, 2024 07:26
@scanny scanny force-pushed the scanny/replace-html-parser branch from 8e136e7 to fde32a2 Compare June 17, 2024 17:40
@scanny scanny force-pushed the scanny/replace-html-parser branch from fde32a2 to 80f3808 Compare June 17, 2024 19:43
@scanny scanny force-pushed the scanny/replace-html-parser branch from 80f3808 to fb6eb31 Compare June 17, 2024 19:45
@scanny scanny force-pushed the scanny/replace-html-parser branch from 1567c91 to fa5dba3 Compare June 17, 2024 21:09
@scanny scanny force-pushed the scanny/replace-html-parser branch from fa5dba3 to 51d5519 Compare June 17, 2024 22:10
@scanny scanny force-pushed the scanny/replace-html-parser branch 2 times, most recently from e2ca119 to 5c85c6d Compare June 17, 2024 22:39
Adjust unit tests that relied on behaviors changed by the new HTML
parser.
These differences have all been accounted for by bug-fixes and desired
behavior changes of the new HTML parser.
@@ -79,15 +79,15 @@ re endowed with reason and conscience and should act towards one another in=
ty and rights. They are endowed with reason and conscience and should act t=
owards one another in a spirit of brotherhood. All human beings are born fr=
ee and equal in dignity and rights. They are endowed with reason and consci=
ence and should act towards one another in a spirit of brotherhood.<br><br>=
ence and should act towards one another in a spirit of brotherhood. <p> =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the example doc here supposed to change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This email is used by the test that verifies the detect_language_per_element behaviors for EML. That test depends on the body of this email resolving to distinct paragraphs so it can verify each Element (paragraph) was assigned a language separately.

The original HTML text of this email is not separate paragraphs, it just "looks" like separate paragraphs. So that test started failing.

It's a legitimate question whether we want to "double-parse" <div> elements like this one and consider <br/><br/> to be a paragraph boundary. But that's not the behavior this email's test is verifying so I went with the element change.

Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added inline comments for an initial review.

@amanda103 or @cragwolfe - do you have any input on the behavior changes in the PR description?

" <div>Get excited for our first annual family day!</div>\n"
' <div>Best.<br clear=3D"all">\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this supposed to have changed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, definitely didn't need to change. I don't know why it did. =3D is a synonym for = and is only used in email as far as I know. I could add it back if you want. Otherwise it's a random test that both variants work :)

"element_id": "568c824acda361cfc270a75e2eca7a23",
"text": "Weather.gov >",
"element_id": "6ef25d028f4c859b5d955c777c4c597e",
"text": "Weather.gov > News Around NOAA > Are You Weather-Ready for the Spring?",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good change to me, with link texts captured below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the HTML behind this looks like this:

<div>
  <a href="https://www.weather.gov">Weather.gov</a>
  &gt; <a href="https://www.weather.gov/news">News Around NOAA</a>
  &gt; Are You Weather-Ready for the Spring?
</div>

All the content of the <div> is phrasing (inline) content so it's all considered the same paragraph and therefore Element.

The old parser would break the paragraph/element when there were two <a> elements in succession.

That's sometimes the right choice, but only when the <a> element is itself surrounding a block element like <div> or <p>.

@@ -335,7 +335,7 @@
"eng"
]
},
"text": "\\uD83D\\uDD17 Reference materials",
"text": "\\uD83D\\uDD17 Reference materials",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just different whitespace?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old one is a non-breaking space and the new one is a plain space.

This is an aspect of the whitespace-normalization difference. Each run of whitespace in the text for an element is reduced to a single space (after leading and trailing whitespace are stripped from the paragraph).

&nbsp; is included in whitespace as are "\n", "\t", and several others.

@@ -37,7 +37,7 @@
"Win"
],
"emphasized_text_tags": [
"strong"
"b"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

],
"filetype": "text/html",
"languages": [
"eng"
]
},
"text": "Select Insert",
"type": "ListItem"
"type": "Title"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scanny - Do you know what caused these types to flip?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an edge case we probably need to noodle a little bit. The HTML looks like this:

<li>
  <p>Select <strong>Insert</strong></p>
</li>

Current behavior

  • <li> is a block element. That means it starts a new paragraph->Element, as opposed to phrasing/inline content which just extends the paragraph it belongs to.
  • <p> is also a block element.
  • A block element (like <li> in this case) generates one element for the text it contains and also one for each block-item child it contains.
  • <li> contains 0 text and 1 block child so it generates the Element formed from the <p> element which in this case resolves to Title.

Complication

  • It would be reasonable to say that because the <p> is "inside" the <li> that the element generated should be considered a ListItem instead of whatever type its text would classify to (Title in this case but could be NarrativeText etc.).
  • This idea is given further credence by the fact this nested <p> text displays exactly that way in the browser; there is no separate paragraph "container", the <p> text just appears as the list item.
  • The complication arises when the <li> contains multiple block elements or when it contains text as well as one or more block elements:
<ul>
  <li>
    <p>Lorem</p>
    <p>Ipsum</p>
    <p>Dolor</p>
  </li>
  <li>Consectetur</li>
  <li>Adipiscing</li>
</ul>
  • In the "contains multiple block-items" case, the browser displays the first block item as the list item and the remaining block items as "list-continuation" paragraphs as in this screenshot:
    Screenshot 2024-06-25 at 12 58 15 PM
  • However, there is no ListContinuation Element-type, so it's not immediately clear how to handle this conundrum.

Conclusion
Not having a good ready answer to the problem I went with the simplest thing that could possibly work and that was to leave the behavior as it defaults.

We could noodle it more or wait for feedback where it seems to matter for someone and think more about how to handle this case in the context of a concrete example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, the old parser got this particular one right, but it was at the cost of squashing the entire contents of the <li> element into a single element, including when it contained a nested list: #3245

That's the behavior that would squash a long list of nested list-items into a single ListItem element, like this example: https://github.com/Unstructured-IO/unstructured/pull/3218/files?show-deleted-files=true&show-viewed-files=true&file-filters%5B%5D=#diff-78e86d31815f4af1fe1cddeafa40432b9f5331fcd0385696ca81e1636cbdabf8L127

Original HTML:

<ul id="subMenuNav">
  <li>
    <div class="subMenuNavList section-link">
      <a href="http://www.weather.gov/safetycampaign">Weather Safety</a>
    </div>

    <div class="sub">
      <ul>
        <li><a href="https://www.weather.gov/safety/airquality">Air Quality</a></li>
        <li><a href="https://www.weather.gov/safety/beachhazards"> Beach Hazards</a></li>
        <li><a href="https://www.weather.gov/safety/cold">Cold</a></li>
        <li><a href="https://www.weather.gov/safety/coldwater">Cold Water</a></li>
        <li><a href="https://www.weather.gov/safety/drought">Drought</a></li>
        <li><a href="https://www.weather.gov/safety/flood">Floods</a></li>
        <li><a href="https://www.weather.gov/safety/fog">Fog</a></li>
        <li><a href="https://www.weather.gov/safety/heat">Heat</a></li>
        <li><a href="https://www.weather.gov/safety/hurricane">Hurricanes</a></li>
        <li><a href="https://www.weather.gov/safety/lightning">Lightning Safety</a></li>
        <li><a href="https://www.weather.gov/safety/ripcurrent"> Rip Currents</a></li>
        <li><a href="https://www.weather.gov/safety/safeboating"> Safe Boating</a></li>
        <li><a href="https://www.weather.gov/safety/space">Space Weather</a></li>
        <li><a href="https://www.weather.gov/safety/heat-uv">Sun (Ultraviolet Radiation)</a></li>
        <li><a href="https://www.weather.gov/safety/thunderstorm"> Thunderstorms &amp; Tornadoes</a></li>
        <li><a href="https://www.weather.gov/safety/tornado">Tornado</a></li>
        <li><a href="https://www.weather.gov/safety/tsunami">Tsunami</a></li>
        <li><a href="https://www.weather.gov/safety/wildfire">Wildfire</a></li>
        <li><a href="https://www.weather.gov/safety/wind">Wind</a></li>
        <li><a href="https://www.weather.gov/safety/winter">Winter</a></li>
      </ul>
    </div>
  </li>
  ...

@@ -1,8 +1,9 @@
## 0.14.8-dev3
## 0.14.8-dev4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's bump to 0.15.0 when we release this since it's a major change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, will do.

Copy link

@sebastianSRS12 sebastianSRS12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the <br</br> for <p>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment