Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with Newline Characters in Attributes #177

Open
jaunruh opened this issue May 22, 2022 · 4 comments
Open

Problems with Newline Characters in Attributes #177

jaunruh opened this issue May 22, 2022 · 4 comments

Comments

@jaunruh
Copy link

jaunruh commented May 22, 2022

I am using xml-conduit to parse xml, adjust it and render it again.

I don't know if this is a bug or expected behaviour, but when parsing newlines encoded as 
 they are stored as newlines \n in memory. When rendered as xml they do not seem to be replaced by 
.

So following element goes from:

<someElement id="1" name="Picture 1" descr="A picture containing window, indoor&#xA;&#xA;Description automatically generated"/>

to:

<someElement id="1" name="Picture 1" descr="A picture containing window, indoor

Description automatically generated"/>

Is that expected behaviour? I have found the ParsingSetting: DecodeIllegalCharacters but the output type is just a single character. So I am not sure how exactly I would use this to prevent such behaviour. I also didn't find any RenderSettings that could be used.

@k0ral
Copy link
Collaborator

k0ral commented May 22, 2022

I believe this is the expected behaviour.

Indeed, character references are parsed into Chars and the fact that they were originally represented as references is lost in the process. The rendering logic only cares about ensuring that the output XML is valid, and thus only encodes delimiter characters that already have an XML semantic.

In other words, the parsing logic is known to be lossy (as is the rendering logic), and in general, xml-conduit doesn't guarantee that roundtriping works (render ∘ parse ≠ identity).

To support your specific use case, I guess we could add a rsEncodeCharacters :: Set Char in RenderSettings to let users choose which characters should be hex-encoded.

Note: DecodeIllegalCharacters is only useful at parsing time, to map custom hexadecimal sequences to arbitrary Chars, and won't help in your case ; see the original use case for this setting.

@jaunruh
Copy link
Author

jaunruh commented May 22, 2022

A rsEncodeCharacters :: Set Char option in RenderSettings would be much appreciated from my side.

Why do roundtrips not work? Does this have to do with the general complexity of XML? Are other xml libraries (maybe also based on other languages) roundtrip save or is roundtrip safety not feasible from an implementation point of view?

@k0ral
Copy link
Collaborator

k0ral commented May 23, 2022

Why do roundtrips not work?

The following is just my 2 cents, feel free to challenge what I'm about to say :) .

The XML standard is such that for a given abstract piece of information, there are multiple valid XML documents to express it (e.g. newlines may, or may not, be hex-encoded). xml-conduit cares about preserving the semantic of the information encoded by an XML document, not so much about preserving the XML representation of that information.

This relaxed contract allows simplifications/shortcuts to optimize performance, memory usage, and incidentally to have a more maintainable codebase that doesn't inherit all peculiarities of the XML standard (for the rendering part, at least).

If you care about enforcing a particular XML representation of your data, then I guess you're looking for an XML formatter, which xml-conduit is not :) .

@jaunruh
Copy link
Author

jaunruh commented Jul 3, 2022

@k0ral I have made a suggestion on fixing this using the ParseSettings instead of the RenderSettings. See: #178

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants