rfctr(html): replace html parser #3218

scanny · 2024-06-17T04:19:56Z

Summary
Replace legacy HTML parser with recursive version that captures all content and provides flexibility to add new metadata. It's also substantially faster although that's just a happy side-effect.

Additional Context
The prior HTML parsing algorithm that makes up the core of HTML partitioning was buggy and very difficult to reason about because it did not conform to the inherently recursive structure of HTML. The new version retains lxml as the performant and reliable base library but uses lxml's custom element classes to efficiently classify HTML elements by their behaviors (block-item and inline (phrasing) primarily) and give those elements the desired partitioning behaviors.

This solves a host of existing problems with content being skipped and elements (paragraphs) being divided improperly, but also provides a clear domain model for reasoning about its behavior and reliably adjusting it to suit our existing and future purposes.

The parser's operation is recursive, closely modeling the recursive structure of HTML itself. It's behaviors are based on the HTML Standard and reliably produce proper and explainable results even for novel cases.

Fixes #2325
Fixes #2562
Fixes #2675
Fixes #3168
Fixes #3227
Fixes #3228
Fixes #3230
Fixes #3237
Fixes #3245
Fixes #3247
Fixes #3255
Fixes #3309

BEHAVIOR DIFFERENCES

`emphasized_text_tags` encoding is changed:

<strong> is encoded as "b" rather than "strong".
<em> is encoded as "i" rather than "em".
<span> is no longer recorded in emphasized_text_tags (because without the CSS we can't tell whether it's used for emphasis or if so what kind).
nested emphasis (e.g. bold+italic) is encoded as multiple characters ("bi").

emphasized_text_contents is broken on emphasis-change boundaries, like:

 `<p>foo <b>bar <i>baz</i> bada</b> bing</p>`

produces:

{
  "emphasized_text_contents": ["bar", "baz", "bada"],
  "emphasized_text_tags": ["b", "bi", "b"]
}

whereas previously it would have produced:

{
  "emphasized_text_contents": ["bar baz bada", "baz"],
  "emphasized_text_tags": ["b", "i"]
}

`<pre>` text is preserved as it appears in the html

Except that a leading newline is removed if present (has to be in position 0 of text). Also, a trailing newline is stripped but only if it appears in the very last position ([-1]) of the <pre> text. Old parser stripped all leading and trailing whitespace.

Result is that:

<pre>
foo
bar
baz
</pre>

parses to "foo\nbar\nbaz" which is the same result produced for:

<pre>foo
bar
baz</pre>

This equivalence is the same behavior exhibited by a browser, which is why we did the extra work to make it this way.

Whitespace normalization

Leading and trailing whitespace are removed from element text, just as it is removed in the browser. Runs of whitespace within the element text are reduced to a single space character (like in the browser). Note this means that \t, \n, and   are replaced with a regular space character. All text derived from elements is whitespace normalized except the text within a <pre> tag. Any leading or trailing newline is trimmed from <pre> element text; all other whitespace is preserved just as it appeared in the HTML source.

`link_start_indexes` metadata is no longer captured. Rationale:

It was frequently wrong, often -1.
It was deprecated but then added back in a community PR.
Maintaining it across any possible downstream transformations (e.g. chunking) would be expensive and almost certainly lead to wrong values as distant code evolves.
It is complex to compute and recompute when whitespace is normalized, adding substantial complexity to the code and reducing readability and maintainability

`<br/>` element is replaced with a single newline (`"\n"`)

but that is usually replaced with a space in Element.text when it is normalized. The newline is preserved within a <pre> element.

Related: No paragraph-break on <br/><br/>

Empty `h1..h6` elements are dropped.

HTML heading elements (<h1..h6>) are "skipped" (do not generate a Title element) when they contain no text or contain only whitespace.

Adjust unit tests that relied on behaviors changed by the new HTML parser.

These differences have all been accounted for by bug-fixes and desired behavior changes of the new HTML parser.

MthwRobinson · 2024-06-25T17:57:54Z

example-docs/language-docs/eng_spa_mult.eml

@@ -79,15 +79,15 @@ re endowed with reason and conscience and should act towards one another in=
 ty and rights. They are endowed with reason and conscience and should act t=
 owards one another in a spirit of brotherhood. All human beings are born fr=
 ee and equal in dignity and rights. They are endowed with reason and consci=
-ence and should act towards one another in a spirit of brotherhood.<br><br>=
+ence and should act towards one another in a spirit of brotherhood. <p>    =


Was the example doc here supposed to change?

This email is used by the test that verifies the detect_language_per_element behaviors for EML. That test depends on the body of this email resolving to distinct paragraphs so it can verify each Element (paragraph) was assigned a language separately.

The original HTML text of this email is not separate paragraphs, it just "looks" like separate paragraphs. So that test started failing.

It's a legitimate question whether we want to "double-parse" <div> elements like this one and consider <br/><br/> to be a paragraph boundary. But that's not the behavior this email's test is verifying so I went with the element change.

MthwRobinson

Added inline comments for an initial review.

@amanda103 or @cragwolfe - do you have any input on the behavior changes in the PR description?

test_unstructured/partition/html/test_partition.py

MthwRobinson · 2024-06-25T18:00:12Z

test_unstructured/partition/html/test_partition.py

        "  <div>Get excited for our first annual family day!</div>\n"
-        '  <div>Best.<br clear=3D"all">\n'


Was this supposed to have changed?

Hmm, definitely didn't need to change. I don't know why it did. =3D is a synonym for = and is only used in email as far as I know. I could add it back if you want. Otherwise it's a random test that both variants work :)

MthwRobinson · 2024-06-25T18:01:02Z

test_unstructured_ingest/expected-structured-output/azure/spring-weather.html.json

-    "element_id": "568c824acda361cfc270a75e2eca7a23",
-    "text": "Weather.gov >",
+    "element_id": "6ef25d028f4c859b5d955c777c4c597e",
+    "text": "Weather.gov > News Around NOAA > Are You Weather-Ready for the Spring?",


This looks like a good change to me, with link texts captured below

Yeah, the HTML behind this looks like this:

<div> <a href="https://www.weather.gov">Weather.gov</a> > <a href="https://www.weather.gov/news">News Around NOAA</a> > Are You Weather-Ready for the Spring? </div>

All the content of the <div> is phrasing (inline) content so it's all considered the same paragraph and therefore Element.

The old parser would break the paragraph/element when there were two <a> elements in succession.

That's sometimes the right choice, but only when the <a> element is itself surrounding a block element like <div> or <p>.

MthwRobinson · 2024-06-25T18:01:29Z

test_unstructured_ingest/expected-structured-output/confluence-diff/MFS/1540126.json

@@ -335,7 +335,7 @@
        "eng"
      ]
    },
-    "text": "\\uD83D\\uDD17 Reference materials",
+    "text": "\\uD83D\\uDD17 Reference materials",


Is this just different whitespace?

The old one is a non-breaking space and the new one is a plain space.

This is an aspect of the whitespace-normalization difference. Each run of whitespace in the text for an element is reduced to a single space (after leading and trailing whitespace are stripped from the paragraph).

  is included in whitespace as are "\n", "\t", and several others.

MthwRobinson · 2024-06-25T18:01:48Z

test_unstructured_ingest/expected-structured-output/confluence-diff/MFS/1605942.json

@@ -37,7 +37,7 @@
        "Win"
      ],
      "emphasized_text_tags": [
-        "strong"
+        "b"


MthwRobinson · 2024-06-25T18:02:24Z

test_unstructured_ingest/expected-structured-output/confluence-diff/MFS/1605956.json

      ],
      "filetype": "text/html",
      "languages": [
        "eng"
      ]
    },
    "text": "Select Insert",
-    "type": "ListItem"
+    "type": "Title"


@scanny - Do you know what caused these types to flip?

This is an edge case we probably need to noodle a little bit. The HTML looks like this:

<li> <p>Select <strong>Insert</strong></p> </li>

Current behavior

<li> is a block element. That means it starts a new paragraph->Element, as opposed to phrasing/inline content which just extends the paragraph it belongs to.

<p> is also a block element.

A block element (like <li> in this case) generates one element for the text it contains and also one for each block-item child it contains.

<li> contains 0 text and 1 block child so it generates the Element formed from the <p> element which in this case resolves to Title.

Complication

It would be reasonable to say that because the <p> is "inside" the <li> that the element generated should be considered a ListItem instead of whatever type its text would classify to (Title in this case but could be NarrativeText etc.).

This idea is given further credence by the fact this nested <p> text displays exactly that way in the browser; there is no separate paragraph "container", the <p> text just appears as the list item.

The complication arises when the <li> contains multiple block elements or when it contains text as well as one or more block elements:

<ul> <li> <p>Lorem</p> <p>Ipsum</p> <p>Dolor</p> </li> <li>Consectetur</li> <li>Adipiscing</li> </ul>

In the "contains multiple block-items" case, the browser displays the first block item as the list item and the remaining block items as "list-continuation" paragraphs as in this screenshot:

However, there is no ListContinuation Element-type, so it's not immediately clear how to handle this conundrum.

Conclusion
Not having a good ready answer to the problem I went with the simplest thing that could possibly work and that was to leave the behavior as it defaults.

We could noodle it more or wait for feedback where it seems to matter for someone and think more about how to handle this case in the context of a concrete example.

Btw, the old parser got this particular one right, but it was at the cost of squashing the entire contents of the <li> element into a single element, including when it contained a nested list: #3245

That's the behavior that would squash a long list of nested list-items into a single ListItem element, like this example: https://github.com/Unstructured-IO/unstructured/pull/3218/files?show-deleted-files=true&show-viewed-files=true&file-filters%5B%5D=#diff-78e86d31815f4af1fe1cddeafa40432b9f5331fcd0385696ca81e1636cbdabf8L127

Original HTML:

<ul id="subMenuNav"> <li> <div class="subMenuNavList section-link"> <a href="http://www.weather.gov/safetycampaign">Weather Safety</a> </div> <div class="sub"> <ul> <li><a href="https://www.weather.gov/safety/airquality">Air Quality</a></li> <li><a href="https://www.weather.gov/safety/beachhazards"> Beach Hazards</a></li> <li><a href="https://www.weather.gov/safety/cold">Cold</a></li> <li><a href="https://www.weather.gov/safety/coldwater">Cold Water</a></li> <li><a href="https://www.weather.gov/safety/drought">Drought</a></li> <li><a href="https://www.weather.gov/safety/flood">Floods</a></li> <li><a href="https://www.weather.gov/safety/fog">Fog</a></li> <li><a href="https://www.weather.gov/safety/heat">Heat</a></li> <li><a href="https://www.weather.gov/safety/hurricane">Hurricanes</a></li> <li><a href="https://www.weather.gov/safety/lightning">Lightning Safety</a></li> <li><a href="https://www.weather.gov/safety/ripcurrent"> Rip Currents</a></li> <li><a href="https://www.weather.gov/safety/safeboating"> Safe Boating</a></li> <li><a href="https://www.weather.gov/safety/space">Space Weather</a></li> <li><a href="https://www.weather.gov/safety/heat-uv">Sun (Ultraviolet Radiation)</a></li> <li><a href="https://www.weather.gov/safety/thunderstorm"> Thunderstorms & Tornadoes</a></li> <li><a href="https://www.weather.gov/safety/tornado">Tornado</a></li> <li><a href="https://www.weather.gov/safety/tsunami">Tsunami</a></li> <li><a href="https://www.weather.gov/safety/wildfire">Wildfire</a></li> <li><a href="https://www.weather.gov/safety/wind">Wind</a></li> <li><a href="https://www.weather.gov/safety/winter">Winter</a></li> </ul> </div> </li> ...

MthwRobinson · 2024-06-25T18:04:36Z

CHANGELOG.md

@@ -1,8 +1,9 @@
-## 0.14.8-dev3
+## 0.14.8-dev4


Let's bump to 0.15.0 when we release this since it's a major change.

yep, will do.

sebastianSRS12

Change the <br</br> for <p>

scanny temporarily deployed to ci June 17, 2024 04:30 — with GitHub Actions Inactive

scanny force-pushed the scanny/replace-html-parser branch from 105fa52 to 16ade6e Compare June 17, 2024 05:26

scanny temporarily deployed to ci June 17, 2024 05:28 — with GitHub Actions Inactive

scanny force-pushed the scanny/replace-html-parser branch from 16ade6e to df885d6 Compare June 17, 2024 07:22

scanny temporarily deployed to ci June 17, 2024 07:24 — with GitHub Actions Inactive

scanny had a problem deploying to ci June 17, 2024 07:24 — with GitHub Actions Failure

scanny force-pushed the scanny/replace-html-parser branch from df885d6 to 8e136e7 Compare June 17, 2024 07:26

scanny temporarily deployed to ci June 17, 2024 07:28 — with GitHub Actions Inactive

scanny force-pushed the scanny/replace-html-parser branch from 8e136e7 to fde32a2 Compare June 17, 2024 17:40

scanny temporarily deployed to ci June 17, 2024 17:41 — with GitHub Actions Inactive

scanny force-pushed the scanny/replace-html-parser branch from fde32a2 to 80f3808 Compare June 17, 2024 19:43

scanny temporarily deployed to ci June 17, 2024 19:45 — with GitHub Actions Inactive

scanny had a problem deploying to ci June 17, 2024 19:45 — with GitHub Actions Failure

scanny force-pushed the scanny/replace-html-parser branch from 80f3808 to fb6eb31 Compare June 17, 2024 19:45

scanny temporarily deployed to ci June 17, 2024 19:46 — with GitHub Actions Inactive

scanny temporarily deployed to ci June 17, 2024 20:13 — with GitHub Actions Inactive

scanny had a problem deploying to ci June 17, 2024 20:13 — with GitHub Actions Failure

scanny force-pushed the scanny/replace-html-parser branch from 1567c91 to fa5dba3 Compare June 17, 2024 21:09

scanny temporarily deployed to ci June 17, 2024 21:20 — with GitHub Actions Inactive

scanny force-pushed the scanny/replace-html-parser branch from fa5dba3 to 51d5519 Compare June 17, 2024 22:10

scanny temporarily deployed to ci June 17, 2024 22:12 — with GitHub Actions Inactive

scanny force-pushed the scanny/replace-html-parser branch 2 times, most recently from e2ca119 to 5c85c6d Compare June 17, 2024 22:39

scanny marked this pull request as ready for review June 22, 2024 03:41

scanny temporarily deployed to ci June 22, 2024 03:43 — with GitHub Actions Inactive

scanny had a problem deploying to ci June 22, 2024 03:43 — with GitHub Actions Failure

scanny temporarily deployed to ci June 22, 2024 03:43 — with GitHub Actions Inactive

scanny force-pushed the scanny/replace-html-parser branch from 26817cb to e3a3270 Compare June 22, 2024 04:00

scanny temporarily deployed to ci June 22, 2024 04:02 — with GitHub Actions Inactive

scanny had a problem deploying to ci June 22, 2024 04:02 — with GitHub Actions Failure

scanny temporarily deployed to ci June 22, 2024 04:02 — with GitHub Actions Inactive

scanny added 4 commits June 22, 2024 13:54

rfctr(html): replace parser

f3bfd5a

rfctr(html): adjust unit tests

19f43f8

Adjust unit tests that relied on behaviors changed by the new HTML parser.

rfctr(html): adjust ingest-test expected output

390a634

These differences have all been accounted for by bug-fixes and desired behavior changes of the new HTML parser.

chore: bump CHANGELOG + __version__

ce09754

scanny force-pushed the scanny/replace-html-parser branch from e3a3270 to ce09754 Compare June 22, 2024 20:56

scanny temporarily deployed to ci June 22, 2024 21:06 — with GitHub Actions Inactive

MthwRobinson reviewed Jun 25, 2024

View reviewed changes

scanny mentioned this pull request Jun 27, 2024

bug/quotes from markdown are stripped out #3309

Open

sebastianSRS12 reviewed Jun 27, 2024

View reviewed changes

scanny mentioned this pull request Jun 30, 2024

No 'category_depth' metadata returned for markdown partitioning? #2675

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfctr(html): replace html parser #3218

rfctr(html): replace html parser #3218

scanny commented Jun 17, 2024 •

edited

Loading

MthwRobinson Jun 25, 2024

scanny Jun 25, 2024

MthwRobinson left a comment

MthwRobinson Jun 25, 2024

scanny Jun 25, 2024

MthwRobinson Jun 25, 2024

scanny Jun 25, 2024

MthwRobinson Jun 25, 2024

scanny Jun 25, 2024

MthwRobinson Jun 25, 2024

MthwRobinson Jun 25, 2024

scanny Jun 25, 2024

scanny Jun 25, 2024

MthwRobinson Jun 25, 2024

scanny Jun 25, 2024

sebastianSRS12 left a comment •

edited by scanny

Loading

		" <div>Get excited for our first annual family day!</div>\n"
		' <div>Best.<br clear=3D"all">\n'

rfctr(html): replace html parser #3218

Are you sure you want to change the base?

rfctr(html): replace html parser #3218

Conversation

scanny commented Jun 17, 2024 • edited Loading

BEHAVIOR DIFFERENCES

emphasized_text_tags encoding is changed:

<pre> text is preserved as it appears in the html

Whitespace normalization

link_start_indexes metadata is no longer captured. Rationale:

<br/> element is replaced with a single newline ("\n")

Empty h1..h6 elements are dropped.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MthwRobinson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebastianSRS12 left a comment • edited by scanny Loading

Choose a reason for hiding this comment

scanny commented Jun 17, 2024 •

edited

Loading

`emphasized_text_tags` encoding is changed:

`<pre>` text is preserved as it appears in the html

`link_start_indexes` metadata is no longer captured. Rationale:

`<br/>` element is replaced with a single newline (`"\n"`)

Empty `h1..h6` elements are dropped.

sebastianSRS12 left a comment •

edited by scanny

Loading