Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(html): invisible links are reported in metadata #3255

Open
scanny opened this issue Jun 19, 2024 · 0 comments · May be fixed by #3218
Open

bug(html): invisible links are reported in metadata #3255

scanny opened this issue Jun 19, 2024 · 0 comments · May be fixed by #3218
Assignees
Labels
bug Something isn't working html

Comments

@scanny
Copy link
Collaborator

scanny commented Jun 19, 2024

Summary
partition_html() adds an empty link_texts entry for an <a> element that has no text or whitespace-only text (and therefore is not displayed or clickable in a browser).

To Reproduce

html_text = "<p>Time is an illusion. <a href="http://seo.com"></a>Lunchtime doubly so.</p>"

elements = partition_html(text=html_text)
print(f"{elements_to_json(elements, indent=2)}")

Expected:

[
  {
    "element_id": "5fe7349cfd2346fed5d638c216d7506e",
    "metadata": {
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "Time is an illusion. Lunchtime doubly so.",
    "type": "NarrativeText"
  }
]

Actual:

[
  {
    "element_id": "5fe7349cfd2346fed5d638c216d7506e",
    "metadata": {
      "filetype": "text/html",
      "languages": ["eng"]
      "link_texts": [""],
      "link_urls": ["http://seo.com"]
    },
    "text": "Time is an illusion. Lunchtime doubly so.",
    "type": "NarrativeText"
  }
]

Additional context
Fixed by #3218. Recorded here to explain ingest test output changes and to inform CHANGELOG.

@scanny scanny added bug Something isn't working html labels Jun 19, 2024
@scanny scanny self-assigned this Jun 19, 2024
@scanny scanny linked a pull request Jun 21, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working html
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant