Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling links for headers #30

Open
Silverblix opened this issue Jan 12, 2021 · 3 comments
Open

Handling links for headers #30

Silverblix opened this issue Jan 12, 2021 · 3 comments

Comments

@Silverblix
Copy link

Silverblix commented Jan 12, 2021

The following code....

from markdownify import markdownify, ATX

clean_html = '''
<a href="https://www.google.com">
<h4>Google Search</h4>
</a>
'''

md_content = markdownify(clean_html, heading_style=ATX)

print(md_content)

generates...

  [#### Google Search](https://www.google.com) 

It should generate the following MD instead, isn't it?

#### [Google Search](https://www.google.com)

Also, there are 2 white spaces in front that are unnecessary which confuses reliable MD readers.

Fixing the unnecessary white spaces in front of headers in ATX style and a better handling of the <a> tag would allow to generate much higher quality MD output for foundational html tag structures.

Thanks!

Sharing a good reference for me: Markdown Reference Guide

@AlexVonB
Copy link
Collaborator

Hi!

The whitespace issue is handled in #17 and soon to be closed, the other issue is interesting. Headings are block elements, links are inline. It is best practice not to embed block elements into inline elements. Anchors should not include headings. I can have a look into how simple it would be to make the other way around happen, but I cannot promise a good solution.

Best! Alex

@Silverblix
Copy link
Author

Hi! Thank you for the quick reply.

Which release should include the fix #17 for the whitespace issue?

My observation is that web browsers do render headers (h1, h2, etc.) with links (<a><h1>My Header</h1></a>) properly. I am finding modern websites using headers with links that are not anchors (#...).

@AlexVonB
Copy link
Collaborator

Hi, the fix of #17 was released in 0.6.3.

Regarding the a/h1 issue: for this we would have to do what browsers do, they read the HTML and apply a multitude of error correction, so that non-standard HTML gets rendered to the best of their abilities. The parser used in this project tries to do this, too, but it seems that this exact issue is not addressed. So we would have to change the DOM tree for all inlines that contain a block -- or we hardcode the exception for a/h1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants