Preserving HTML entities #26

imalsogreg · 2017-02-13T20:05:36Z

parseHTML parses " " into a text node with the unicode code point for nonbreaking space: [TextNode "\160"]

Is it possible and safe to add a parser that leaves HTML entities like   as they are in text nodes? I have a project where xmlhtml is the first stage in a text pipeline spread across different languages/libraries/environments, where it would be more reliable to pass   all the way through than to count on the text encoding of the different components.

Preserving HTML entities could make more sense in contexts like heist as well, where the mental model is that heist is stitching together different html fragments. snapframework/heist#37 for instance.

The text was updated successfully, but these errors were encountered:

mightybyte · 2017-02-13T21:18:04Z

Hmmm, I'm not sure how this would work. TextNode exists specifically for parsed text, so if we want to support this use case I think we'd have to add another constructor UnparsedText or something like that. @cdsmith can you weigh in on this?

cdsmith · 2017-02-13T21:43:46Z

My opinion is that, as tempting as it may be to try to work around weird corner cases, if you're transmitting text with an unclear encoding between tools, that is already broken. This seems to be asking to make xmlhtml less correct, in the hopes that the incorrect result will make it through a broken pipeline safely, and then (again incorrectly) not be properly re-encoded when converting back to HTML at the other end? That seems like a clearly bad idea in the general case, however convenient it might be for this combination of parts.

The referenced issue, snapframework/heist#37, is somewhat more compelling. In that case, there are actually two valid options, since it's equally correct to emit either the entity or the plain character. So there, we're comparing practical problems with one (output sometimes fails with an incorrect HTML validator) versus the other (output is misparsed in attributes by incorrect HTML parsers). That's a harder choice.

cdsmith · 2017-02-13T21:51:35Z

Incidentally, I don't think it's a good idea to think of Heist as stitching together fragments of HTML-encoded text, as suggested. One of the more important jobs done by templating systems is to let higher-level code work with the semantic model (which in this case is not HTML-encoded) and handle the translation to encoded form correctly. That's part of how you avoid injection attacks and the like.

imalsogreg · 2017-02-13T22:04:48Z

@cdsmith I was worried when I wrote "stitching together" and mentioned the unclear encoding issues that these would become the focus of my issue here :) I agree with the sentiment that having the encoding straight is a better solution than working around corner cases.

Still, it really seems that render . parse should be id, or else there is simply no way to parse a fragment with   in it and receive an   out the other end (or is there?). Reframing the issue around that, what would the proper solution be? @mightybyte's UnparsedText node? Maybe an ISO-8859-1 constructor for the Encoding type? Or maybe a data HTMLEntity = HtmlEntity Text | UtfEntity Text to track how entities were parsed, so that we can render them back out the same way?

cdsmith · 2017-02-13T22:22:35Z

Yes, I would say that an ISO 8859-1 (or ASCII) constructor for the encoding type is likely the right answer. Implementing the HTML algorithm for sniffing document encoding would also be a positive change. Unfortunately, I don't have the ability to work on these at the moment.

It's possible to design a parser and renderer for which render . parse == id, but it would be a much more complex beast than this library intends to be. You'd need to annotate elements of the parse tree with all the choices that can be made in encoding that don't affect the result. There are a bunch, particularly in HTML! A few to get you started: comments, whitespace inside of tags, whitespace outside of the top-level elements, case of element and attribute names, implicit end tags that may be omitted, and a bunch of details of the doctype decl. If you go this way, it would become practically too painful to work with the node type as an algebraic data type any more, which would change the character of the library enough that you probably just want to write a new library instead.

mightybyte · 2017-02-13T22:44:20Z

I totally sympathize with this aversion to stitching together fragments of HTML-encoded text. I've resisted adding an escape hatch constructor for raw HTML for exactly this reason. But on the other hand, the people over at Position Dev created a completely new template system called larceny because they felt that Heist's lack of raw HTML support was too problematic for them. I take that as a pretty compelling argument that we should have some kind of support for this use case.

cdsmith · 2017-02-14T01:50:11Z

Fair enough, and it's entirely your call, of course! I haven't touched this library for years, and wouldn't want to block changes that users want, anyway. My contribution is just: let's make sure that really is the right answer to the problems people have, before making things more complex. I'd hate to lose the simplicity of the API, in exchange for something working partly by accident.

I do think implementing more encodings would be a great first step here. That was a great suggestion, and if the core requirement here is to be able to output entity references for non-ASCII characters, that will do the job.

For the larceny use case, sure, an UnparsedContent constructor in Node seems reasonable. But it wouldn't help, here. You wouldn't ever parse to UnparsedContent, right? It would just be for rendering generated content.

imalsogreg · 2017-02-14T16:38:19Z

Thanks for the tips! Here's a PR for discussion #27

mightybyte · 2017-03-14T02:51:27Z

Merged.

mightybyte closed this as completed Mar 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserving HTML entities #26

Preserving HTML entities #26

imalsogreg commented Feb 13, 2017

mightybyte commented Feb 13, 2017

cdsmith commented Feb 13, 2017

cdsmith commented Feb 13, 2017 •

edited

Loading

imalsogreg commented Feb 13, 2017

cdsmith commented Feb 13, 2017 •

edited

Loading

mightybyte commented Feb 13, 2017 •

edited

Loading

cdsmith commented Feb 14, 2017

imalsogreg commented Feb 14, 2017

mightybyte commented Mar 14, 2017

Preserving HTML entities #26

Preserving HTML entities #26

Comments

imalsogreg commented Feb 13, 2017

mightybyte commented Feb 13, 2017

cdsmith commented Feb 13, 2017

cdsmith commented Feb 13, 2017 • edited Loading

imalsogreg commented Feb 13, 2017

cdsmith commented Feb 13, 2017 • edited Loading

mightybyte commented Feb 13, 2017 • edited Loading

cdsmith commented Feb 14, 2017

imalsogreg commented Feb 14, 2017

mightybyte commented Mar 14, 2017

cdsmith commented Feb 13, 2017 •

edited

Loading

cdsmith commented Feb 13, 2017 •

edited

Loading

mightybyte commented Feb 13, 2017 •

edited

Loading