Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide option to canonicalize the internal subset #6

Open
dwcramer opened this issue Aug 1, 2022 · 4 comments
Open

Provide option to canonicalize the internal subset #6

dwcramer opened this issue Aug 1, 2022 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@dwcramer
Copy link

dwcramer commented Aug 1, 2022

Enhancement: One challenge with operating on internal subsets via sed or grep is that spacing can be all over the place across a set of XML files:

<!DOCTYPE html "about:legacy-compat" [
       <!ENTITY % foo 
           SYSTEM "path/to/bar.ent"> %foo;
]>

is the same as:

<!DOCTYPE html "about:legacy-compat" [
  <!ENTITY 
   % foo 
  SYSTEM 
  "path/to/bar.ent">
 %foo;
]>

It would be useful if doctype-tool could format all these in a canonical way (e.g. 'one line per declaration with non-meaningful whitespace normalized') so that you can then write simple grep or sed commands to search or modify them across a tree of files. Both of the above examples when normalized would become:

<!DOCTYPE html "about:compat" [
       <!ENTITY % foo SYSTEM "path/to/bar.ent">
       %foo;
]>
@AndrewSales AndrewSales self-assigned this Aug 1, 2022
@AndrewSales AndrewSales added the enhancement New feature or request label Aug 1, 2022
@AndrewSales
Copy link
Owner

Thanks for the suggestion, @dwcramer - this should be perfectly possible, and I'll build it into the implementation of #2.

An extension of this would be to report out the internal subset itself as XML, in the manner of https://github.com/AndrewSales/dtd2xml.

For example, if I run that tool on this document:

<!DOCTYPE html [
  <!ENTITY 
  % foo 
  SYSTEM
  "c:/temp/bar.ent">
 %foo;
]>
<html/>

it produces (assuming bar.ent contains just <!ELEMENT foo EMPTY>) for example:

<dtd>
<element id="foo" name="foo">
<model>EMPTY</model>
<parents></parents>
</element>
<externalEntity name="foo" systemId="c:/temp/bar.ent"/>
</dtd>

which may be useful information to have as XML per se, or a downstream process could format these to suit and supply the input envisaged in #5.

@dwcramer
Copy link
Author

dwcramer commented Aug 1, 2022

That would be wonderful! I think you'd want an xml:base attribute on that dtd element so you can know where each report came from when doing downstream processing. I'm imagining a use case where you generate a report from a tree of files, modify the report via XSLT, then use doctype-tool to reapply the doctypes the files.

Note that entities can contain other unexpanded entities as well as xml. I feel like there will be some subtle issues to consider with nested entities, namespaces, and whitespace.

    <!ENTITY foo "<p>this is a
    para with whitespace that we can't mess with. And a random other &bar; entity
    for good measure.
    </p>">
    <!ENTITY bar "<mml:math xmlns:mml='http://www.w3.org/1998/Math/MathML'/>">

See also https://en.wikipedia.org/wiki/Billion_laughs_attack

@dwcramer
Copy link
Author

dwcramer commented Aug 1, 2022

And of course once I have <externalEntity name="foo" systemId="c:/temp/bar.ent"/> I'll also want to be able to turn the contents of c:/temp/bar.ent into a report, but that means you would have to take catalog files into account...

@AndrewSales
Copy link
Owner

Good points all, @dwcramer - thanks.
I'll set to work on this and see where it leads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants