Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserving HTML entities #26

Closed
imalsogreg opened this issue Feb 13, 2017 · 9 comments
Closed

Preserving HTML entities #26

imalsogreg opened this issue Feb 13, 2017 · 9 comments

Comments

@imalsogreg
Copy link
Contributor

parseHTML parses " " into a text node with the unicode code point for nonbreaking space: [TextNode "\160"]

Is it possible and safe to add a parser that leaves HTML entities like   as they are in text nodes? I have a project where xmlhtml is the first stage in a text pipeline spread across different languages/libraries/environments, where it would be more reliable to pass   all the way through than to count on the text encoding of the different components.

Preserving HTML entities could make more sense in contexts like heist as well, where the mental model is that heist is stitching together different html fragments. snapframework/heist#37 for instance.

@mightybyte
Copy link
Member

Hmmm, I'm not sure how this would work. TextNode exists specifically for parsed text, so if we want to support this use case I think we'd have to add another constructor UnparsedText or something like that. @cdsmith can you weigh in on this?

@cdsmith
Copy link
Member

cdsmith commented Feb 13, 2017

My opinion is that, as tempting as it may be to try to work around weird corner cases, if you're transmitting text with an unclear encoding between tools, that is already broken. This seems to be asking to make xmlhtml less correct, in the hopes that the incorrect result will make it through a broken pipeline safely, and then (again incorrectly) not be properly re-encoded when converting back to HTML at the other end? That seems like a clearly bad idea in the general case, however convenient it might be for this combination of parts.

The referenced issue, snapframework/heist#37, is somewhat more compelling. In that case, there are actually two valid options, since it's equally correct to emit either the entity or the plain character. So there, we're comparing practical problems with one (output sometimes fails with an incorrect HTML validator) versus the other (output is misparsed in attributes by incorrect HTML parsers). That's a harder choice.

@cdsmith
Copy link
Member

cdsmith commented Feb 13, 2017

Incidentally, I don't think it's a good idea to think of Heist as stitching together fragments of HTML-encoded text, as suggested. One of the more important jobs done by templating systems is to let higher-level code work with the semantic model (which in this case is not HTML-encoded) and handle the translation to encoded form correctly. That's part of how you avoid injection attacks and the like.

@imalsogreg
Copy link
Contributor Author

@cdsmith I was worried when I wrote "stitching together" and mentioned the unclear encoding issues that these would become the focus of my issue here :) I agree with the sentiment that having the encoding straight is a better solution than working around corner cases.

Still, it really seems that render . parse should be id, or else there is simply no way to parse a fragment with   in it and receive an   out the other end (or is there?). Reframing the issue around that, what would the proper solution be? @mightybyte's UnparsedText node? Maybe an ISO-8859-1 constructor for the Encoding type? Or maybe a data HTMLEntity = HtmlEntity Text | UtfEntity Text to track how entities were parsed, so that we can render them back out the same way?

@cdsmith
Copy link
Member

cdsmith commented Feb 13, 2017

Yes, I would say that an ISO 8859-1 (or ASCII) constructor for the encoding type is likely the right answer. Implementing the HTML algorithm for sniffing document encoding would also be a positive change. Unfortunately, I don't have the ability to work on these at the moment.

It's possible to design a parser and renderer for which render . parse == id, but it would be a much more complex beast than this library intends to be. You'd need to annotate elements of the parse tree with all the choices that can be made in encoding that don't affect the result. There are a bunch, particularly in HTML! A few to get you started: comments, whitespace inside of tags, whitespace outside of the top-level elements, case of element and attribute names, implicit end tags that may be omitted, and a bunch of details of the doctype decl. If you go this way, it would become practically too painful to work with the node type as an algebraic data type any more, which would change the character of the library enough that you probably just want to write a new library instead.

@mightybyte
Copy link
Member

mightybyte commented Feb 13, 2017

I totally sympathize with this aversion to stitching together fragments of HTML-encoded text. I've resisted adding an escape hatch constructor for raw HTML for exactly this reason. But on the other hand, the people over at Position Dev created a completely new template system called larceny because they felt that Heist's lack of raw HTML support was too problematic for them. I take that as a pretty compelling argument that we should have some kind of support for this use case.

@cdsmith
Copy link
Member

cdsmith commented Feb 14, 2017

Fair enough, and it's entirely your call, of course! I haven't touched this library for years, and wouldn't want to block changes that users want, anyway. My contribution is just: let's make sure that really is the right answer to the problems people have, before making things more complex. I'd hate to lose the simplicity of the API, in exchange for something working partly by accident.

I do think implementing more encodings would be a great first step here. That was a great suggestion, and if the core requirement here is to be able to output entity references for non-ASCII characters, that will do the job.

For the larceny use case, sure, an UnparsedContent constructor in Node seems reasonable. But it wouldn't help, here. You wouldn't ever parse to UnparsedContent, right? It would just be for rendering generated content.

@imalsogreg
Copy link
Contributor Author

Thanks for the tips! Here's a PR for discussion #27

@mightybyte
Copy link
Member

Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants