-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Image inside SPAN is discarded #180
Comments
@sglebs - this page has invalid HTML Markdownify requires reasonably correct HTML to function. Browsers have very complex heuristics to handle bad HTML, and it is computationally impractical for Markdownify to handle these. The HTML parser used to parse the HTML can handle some bad HTML syntaxes/structures, but using As a simple brute-force workaround, the following code unwraps all import bs4
import markdownify
import requests
# get HTML
url = "https://softplan7189029973484399.freshdesk.com/support/solutions/articles/153000199818"
html = requests.get(url, verify=False).text
# read HTML into Beautiful Soup
soup = bs4.BeautifulSoup(html, "lxml")
# unwrap all <h3> tags
for heading in list(soup.find_all("h3")):
heading.unwrap()
# convert to Markdown
print(markdownify.MarkdownConverter().convert_soup(soup)) |
@chrispy-snps Yes. This was content from a Freshdesk knowledge base. It is amazing that the tool allows for that kind of bad markup. Thanks for sharing the tip on the unwrap. I guess one would need to do it for h1, h2, h3, h4, h5, h6... Anything else comes to mind? |
@sglebs - yes, I was pretty horrified to see that markup. :) I looked at a few related articles on that site. All seemed to use explicit font size and style for headings; none used real HTML heading tags. If you explore more articles and find headings being used as containers instead of headings, then specify that list of headings to find_all(["h1", "h2", "h3", "h4", "h5", "h6"]) You can also use regex patterns, like this: find_all(re.compile(r"^h\d$")) |
If you run it on the HTML from https://softplan7189029973484399.freshdesk.com/support/solutions/articles/153000199818 you will see that the image in section "1-" and "2-" are discarded. Not all images are discarded. The common factor is the usage of SPAN. When P is used, it works.
The text was updated successfully, but these errors were encountered: