Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image inside SPAN is discarded #180

Open
sglebs opened this issue Jan 23, 2025 · 3 comments
Open

Image inside SPAN is discarded #180

sglebs opened this issue Jan 23, 2025 · 3 comments

Comments

@sglebs
Copy link

sglebs commented Jan 23, 2025

If you run it on the HTML from https://softplan7189029973484399.freshdesk.com/support/solutions/articles/153000199818 you will see that the image in section "1-" and "2-" are discarded. Not all images are discarded. The common factor is the usage of SPAN. When P is used, it works.

@chrispy-snps
Copy link
Collaborator

@sglebs - this page has invalid HTML <h3> tags are being used as content containers to contain entire sections. (You can see this if you use your browser's "Inspect HTML" feature.) And because Markdownify flattens text inside heading elements, images in <h3> tags are lost.

Markdownify requires reasonably correct HTML to function. Browsers have very complex heuristics to handle bad HTML, and it is computationally impractical for Markdownify to handle these.

The HTML parser used to parse the HTML can handle some bad HTML syntaxes/structures, but using <h3> tags as containers is not one of them.

As a simple brute-force workaround, the following code unwraps all <h3> tags and allows their contents to be rendered normally:

import bs4
import markdownify
import requests

# get HTML
url = "https://softplan7189029973484399.freshdesk.com/support/solutions/articles/153000199818"
html = requests.get(url, verify=False).text

# read HTML into Beautiful Soup
soup = bs4.BeautifulSoup(html, "lxml")

# unwrap all <h3> tags
for heading in list(soup.find_all("h3")):
    heading.unwrap()

# convert to Markdown
print(markdownify.MarkdownConverter().convert_soup(soup))

@sglebs
Copy link
Author

sglebs commented Feb 3, 2025

@chrispy-snps Yes. This was content from a Freshdesk knowledge base. It is amazing that the tool allows for that kind of bad markup.

Thanks for sharing the tip on the unwrap. I guess one would need to do it for h1, h2, h3, h4, h5, h6... Anything else comes to mind?

@chrispy-snps
Copy link
Collaborator

@sglebs - yes, I was pretty horrified to see that markup. :)

I looked at a few related articles on that site. All seemed to use explicit font size and style for headings; none used real HTML heading tags. If you explore more articles and find headings being used as containers instead of headings, then specify that list of headings to find_all(), like this

find_all(["h1", "h2", "h3", "h4", "h5", "h6"])

You can also use regex patterns, like this:

find_all(re.compile(r"^h\d$"))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants