Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(html): empty <li> element produces ListItem with no text #3237

Open
scanny opened this issue Jun 18, 2024 · 0 comments · May be fixed by #3218
Open

bug(html): empty <li> element produces ListItem with no text #3237

scanny opened this issue Jun 18, 2024 · 0 comments · May be fixed by #3218
Assignees
Labels
bug Something isn't working html

Comments

@scanny
Copy link
Collaborator

scanny commented Jun 18, 2024

Summary
partition_html() produces a ListItem element with no text for an empty <li> element or one that contains only whitespace.

To Reproduce

from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json

html_text = """
<ul>
  <li></li>
  <li>  \n  \t  </li>
</ul>
"""

elements = partition_html(text=html_text)
print(f"{elements_to_json(elements, indent=2)}")

Expected:

[]

Actual:

[
  {
    "element_id": "5336294a19f32ff03ef80066fbc3e0f7",
    "metadata": {
      "category_depth": 1,
      "filetype": "text/html"
    },
    "text": "",
    "type": "ListItem"
  },
  {
    "element_id": "c91476816a43e6f9216a68b58d92076a",
    "metadata": {
      "category_depth": 1,
      "filetype": "text/html"
    },
    "text": "",
    "type": "ListItem"
  }
]

Additional context
Fixed by #3218. Recorded here to explain ingest test output changes and to inform CHANGELOG.

@scanny scanny added bug Something isn't working html labels Jun 18, 2024
@scanny scanny self-assigned this Jun 18, 2024
@scanny scanny linked a pull request Jun 21, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working html
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant