bug(html): distinct paragraphs within <li> are squashed into single element #3245

scanny · 2024-06-18T23:53:48Z

Summary
Block items nested within an <li> element are squashed into single ListItem element. Also, formatting whitespace is not normalized in the resulting text.

To Reproduce

from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json

html_text = """
<ul>
  <li>
    <p>One of the <b>things</b> Ford Prefect had always found.</p>
    Hardest to <i>understand</i> about humans was.
    <p>Their habit of continually <b>stating</b> and <b>repeating</b> the.</p>
    very <i>very</i> obvious.
  </li>
</ul>
"""

elements = partition_html(text=html_text)
print(f"{elements_to_json(elements, indent=2)}")

Expected:

[
  {
    "element_id": "a0d3c85c1ac52e3097090f947dd0ba4f",
    "metadata": {
      "emphasized_text_contents": ["things"],
      "emphasized_text_tags": ["b"],
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "One of the things Ford Prefect had always found.",
    "type": "NarrativeText"
  },
  {
    "element_id": "18da4b100dbb92e55b91f35fc27aa23c",
    "metadata": {
      "emphasized_text_contents": ["understand"],
      "emphasized_text_tags": ["i"],
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "Hardest to understand about humans was.",
    "type": "NarrativeText"
  },
  {
    "element_id": "5bcd5901935365c374f3170479beffdf",
    "metadata": {
      "emphasized_text_contents": [
        "stating",
        "repeating"
      ],
      "emphasized_text_tags": ["b", "b"],
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "Their habit of continually stating and repeating the.",
    "type": "NarrativeText"
  },
  {
    "element_id": "90c24bc473cb71ec531c5543af409270",
    "metadata": {
      "category_depth": 0,
      "emphasized_text_contents": ["very"],
      "emphasized_text_tags": ["i"],
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "very very obvious.",
    "type": "Title"
  }
]

Actual:

[
  {
    "element_id": "cb8c2c5a83e9e6aa308d078d4510205b",
    "metadata": {
      "category_depth": 1,
      "emphasized_text_contents": [
        "things",
        "understand",
        "stating",
        "repeating",
        "very"
      ],
      "emphasized_text_tags": [
        "b",
        "i",
        "b",
        "b",
        "i"
      ],
      "filetype": "text/html",
      "languages": [
        "eng"
      ]
    },
    "text": "One of the things Ford Prefect had always found.\n    Hardest to understand about humans was.\n    Their habit of continually stating and repeating the.\n    very very obvious.",
    "type": "ListItem"
  }
]

Additional context
Fixed by #3218. Recorded here to explain ingest test output changes and to inform CHANGELOG.

The text was updated successfully, but these errors were encountered:

**Summary** Replace legacy HTML parser with recursive version that captures all content and provides flexibility to add new metadata. It's also substantially faster although that's just a happy side-effect. **Additional Context** The prior HTML parsing algorithm that makes up the core of HTML partitioning was buggy and very difficult to reason about because it did not conform to the inherently recursive structure of HTML. The new version retains `lxml` as the performant and reliable base library but uses `lxml`'s custom element classes to efficiently classify HTML elements by their behaviors (block-item and inline (phrasing) primarily) and give those elements the desired partitioning behaviors. This solves a host of existing problems with content being skipped and elements (paragraphs) being divided improperly, but also provides a clear domain model for reasoning about its behavior and reliably adjusting it to suit our existing and future purposes. The parser's operation is recursive, closely modeling the recursive structure of HTML itself. It's behaviors are based on the HTML Standard and reliably produce proper and explainable results even for novel cases. Fixes #2325 Fixes #2562 Fixes #2675 Fixes #3168 Fixes #3227 Fixes #3228 Fixes #3230 Fixes #3237 Fixes #3245 Fixes #3247 Fixes #3255 Fixes #3309 ### BEHAVIOR DIFFERENCES #### `emphasized_text_tags` encoding is changed: - `` is encoded as `"b"` rather than `"strong"`. - `` is encoded as `"i"` rather than `"em"`. - `` is no longer recorded in `emphasized_text_tags` (because without the CSS we can't tell whether it's used for emphasis or if so what kind). - nested emphasis (e.g. bold+italic) is encoded as multiple characters ("bi"). - `emphasized_text_contents` is broken on emphasis-change boundaries, like: ```html `foo bar baz bada bing` ``` produces: ```json { "emphasized_text_contents": ["bar", "baz", "bada"], "emphasized_text_tags": ["b", "bi", "b"] } ``` whereas previously it would have produced: ```json { "emphasized_text_contents": ["bar baz bada", "baz"], "emphasized_text_tags": ["b", "i"] } ``` #### `<pre>` text is preserved as it appears in the html Except that a leading newline is removed if present (has to be in position 0 of text). Also, a trailing newline is stripped but only if it appears in the very last position ([-1]) of the `<pre>` text. Old parser stripped all leading and trailing whitespace. Result is that: ```html <pre> foo bar baz </pre> ``` parses to `"foo\nbar\nbaz"` which is the same result produced for: ```html <pre>foo bar baz</pre> ``` This equivalence is the same behavior exhibited by a browser, which is why we did the extra work to make it this way. #### Whitespace normalization Leading and trailing whitespace are removed from element text, just as it is removed in the browser. Runs of whitespace within the element text are reduced to a single space character (like in the browser). Note this means that `\t`, `\n`, and ` ` are replaced with a regular space character. All text derived from elements is whitespace normalized except the text within a `<pre>` tag. Any leading or trailing newline is trimmed from `<pre>` element text; all other whitespace is preserved just as it appeared in the HTML source. #### `link_start_indexes` metadata is no longer captured. Rationale: - It was frequently wrong, often `-1`. - It was deprecated but then added back in a community PR. - Maintaining it across any possible downstream transformations (e.g. chunking) would be expensive and almost certainly lead to wrong values as distant code evolves. - It is complex to compute and recompute when whitespace is normalized, adding substantial complexity to the code and reducing readability and maintainability #### ` ` element is replaced with a single newline (`"\n"`) but that is usually replaced with a space in `Element.text` when it is normalized. The newline is preserved within a `<pre>` element. - Related: _No paragraph-break on ` `_ #### Empty `h1..h6` elements are dropped. HTML heading elements (`<h1..h6>`) are "skipped" (do not generate a `Title` element) when they contain no text or contain only whitespace. --------- Co-authored-by: scanny <[email protected]>

scanny added bug Something isn't working html labels Jun 18, 2024

scanny self-assigned this Jun 18, 2024

scanny mentioned this issue Jun 21, 2024

rfctr(html): replace html parser #3218

Merged

scanny closed this as completed in #3218 Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(html): distinct paragraphs within <li> are squashed into single element #3245

bug(html): distinct paragraphs within <li> are squashed into single element #3245

scanny commented Jun 18, 2024

bug(html): distinct paragraphs within <li> are squashed into single element #3245

bug(html): distinct paragraphs within <li> are squashed into single element #3245

Comments

scanny commented Jun 18, 2024