-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserving HTML entities #26
Comments
Hmmm, I'm not sure how this would work. |
My opinion is that, as tempting as it may be to try to work around weird corner cases, if you're transmitting text with an unclear encoding between tools, that is already broken. This seems to be asking to make xmlhtml less correct, in the hopes that the incorrect result will make it through a broken pipeline safely, and then (again incorrectly) not be properly re-encoded when converting back to HTML at the other end? That seems like a clearly bad idea in the general case, however convenient it might be for this combination of parts. The referenced issue, snapframework/heist#37, is somewhat more compelling. In that case, there are actually two valid options, since it's equally correct to emit either the entity or the plain character. So there, we're comparing practical problems with one (output sometimes fails with an incorrect HTML validator) versus the other (output is misparsed in attributes by incorrect HTML parsers). That's a harder choice. |
Incidentally, I don't think it's a good idea to think of Heist as stitching together fragments of HTML-encoded text, as suggested. One of the more important jobs done by templating systems is to let higher-level code work with the semantic model (which in this case is not HTML-encoded) and handle the translation to encoded form correctly. That's part of how you avoid injection attacks and the like. |
@cdsmith I was worried when I wrote "stitching together" and mentioned the unclear encoding issues that these would become the focus of my issue here :) I agree with the sentiment that having the encoding straight is a better solution than working around corner cases. Still, it really seems that |
Yes, I would say that an ISO 8859-1 (or ASCII) constructor for the encoding type is likely the right answer. Implementing the HTML algorithm for sniffing document encoding would also be a positive change. Unfortunately, I don't have the ability to work on these at the moment. It's possible to design a parser and renderer for which |
I totally sympathize with this aversion to stitching together fragments of HTML-encoded text. I've resisted adding an escape hatch constructor for raw HTML for exactly this reason. But on the other hand, the people over at Position Dev created a completely new template system called larceny because they felt that Heist's lack of raw HTML support was too problematic for them. I take that as a pretty compelling argument that we should have some kind of support for this use case. |
Fair enough, and it's entirely your call, of course! I haven't touched this library for years, and wouldn't want to block changes that users want, anyway. My contribution is just: let's make sure that really is the right answer to the problems people have, before making things more complex. I'd hate to lose the simplicity of the API, in exchange for something working partly by accident. I do think implementing more encodings would be a great first step here. That was a great suggestion, and if the core requirement here is to be able to output entity references for non-ASCII characters, that will do the job. For the larceny use case, sure, an |
Thanks for the tips! Here's a PR for discussion #27 |
Merged. |
parseHTML
parses" "
into a text node with the unicode code point for nonbreaking space:[TextNode "\160"]
Is it possible and safe to add a parser that leaves HTML entities like
as they are in text nodes? I have a project where xmlhtml is the first stage in a text pipeline spread across different languages/libraries/environments, where it would be more reliable to pass
all the way through than to count on the text encoding of the different components.Preserving HTML entities could make more sense in contexts like
heist
as well, where the mental model is thatheist
is stitching together different html fragments. snapframework/heist#37 for instance.The text was updated successfully, but these errors were encountered: