Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retain authored HTML for empty elements #95

Open
kartikprabhu opened this issue Mar 5, 2018 · 8 comments
Open

retain authored HTML for empty elements #95

kartikprabhu opened this issue Mar 5, 2018 · 8 comments

Comments

@kartikprabhu
Copy link
Member

Currently mf2py due to using BeautifulSoup closes empty HTML tags. e.g. <br> gets converted to <br/> and <hr> gets converted into <hr/>. This makes the e-content[html] different from the authored one.

This does not seem to be an issue in actual use but will be for any tests. So I am documenting this here.

Details

html5lib by default does not do this see: https://github.com/html5lib/html5lib-python/blob/5e6b61b4630165dd4765fff41d0f855534d5e2fe/html5lib/serializer.py#L114

The relevant lines in BeautifulSoup which explicitly do this are https://github.com/waylan/beautifulsoup/blob/480367ce8c8a4d1ada3012a95f0b5c2cce4cf497/bs4/element.py#L1106-L1107 (Note that this is not the canonial source for BS4)

@Zegnat
Copy link
Member

Zegnat commented Mar 8, 2018

Note that “retain” is actually the wrong behaviour. You should normalise to having no closing trailing solidus. The microformats parsing specification for e- says to fill the html property using HTML’s serialising algorithm, which never includes it.

@sknebel
Copy link
Member

sknebel commented Oct 1, 2018

BS4 recently added an "html5" formatter that apparently does this:
https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/NEWS.txt#L14

I propose switching output to that, since we do not make any other promise about the HTML output AFAIK?

sknebel added a commit to sknebel-forks/mf2py that referenced this issue Oct 2, 2018
Beautifulsoup 4.6.2 introduced the ability to control the slashes in void elements through formatters.
This adds a formatter that does this, but otherwise only does minimal encoding (as the previous, default, formatter did)
@sknebel
Copy link
Member

sknebel commented Oct 5, 2018

@snarfed While investigating another issue I discovered today that granary appears to rely on mf2py to produce somewhat XHTML-compatible output, which this would break. We could expose the HTML formatter in the API to allow granary to force the old behavior?

EDIT: alternatively, some variation of #97 that allows granary to force the serialization downstream, which might give it more options to generate proper XML.

@snarfed
Copy link
Member

snarfed commented Oct 8, 2018

thanks for the heads up! i'm not entirely sure how this would affect granary yet, but i can always update it. let me know when you have an mf2py PR you want me to test!

@sknebel
Copy link
Member

sknebel commented Oct 8, 2018

#136 is said PR. Granary produces Atom feeds with the content declared to be XHTML and unless I missed something taken straight from mf2 parsing when turning an mf2 html feed into Atom. Not closing void elements isn't allowed in XHTML as far as I know, and would make those feeds invalid?

@snarfed
Copy link
Member

snarfed commented Oct 8, 2018

thanks! yeah, i got that part, i just don't remember the exact transformation steps in granary. no matter, i'll try it and see.

@snarfed
Copy link
Member

snarfed commented Oct 8, 2018

@sknebel you're right. thanks for thinking of granary! it does pass the HTML content pretty much straight through to Atom.

ideally yes, i'd love a flag or exposed HTML formatter in mf2py so i can control this in granary. i tested just now though, and if i change the Atom to <content type="html">, and fully escape it, HTML5 content validates ok and renders ok in readers. so i can handle whatever you all end up choosing.

@kevinmarks
Copy link
Member

If you use html5lib you can reserialize with options turned on - I think use_trailing_solidus is the one to autoclose null elements; I don't know that it can guarantee full XHTML compliance though, which is what you need to inline them in Atom.
Reparsing and serializing each html chunk might not be optimal though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants