-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide option to canonicalize the internal subset #6
Comments
Thanks for the suggestion, @dwcramer - this should be perfectly possible, and I'll build it into the implementation of #2. An extension of this would be to report out the internal subset itself as XML, in the manner of https://github.com/AndrewSales/dtd2xml. For example, if I run that tool on this document:
it produces (assuming
which may be useful information to have as XML per se, or a downstream process could format these to suit and supply the input envisaged in #5. |
That would be wonderful! I think you'd want an Note that entities can contain other unexpanded entities as well as xml. I feel like there will be some subtle issues to consider with nested entities, namespaces, and whitespace.
See also https://en.wikipedia.org/wiki/Billion_laughs_attack |
And of course once I have |
Good points all, @dwcramer - thanks. |
Enhancement: One challenge with operating on internal subsets via sed or grep is that spacing can be all over the place across a set of XML files:
is the same as:
It would be useful if doctype-tool could format all these in a canonical way (e.g. 'one line per declaration with non-meaningful whitespace normalized') so that you can then write simple grep or sed commands to search or modify them across a tree of files. Both of the above examples when normalized would become:
The text was updated successfully, but these errors were encountered: