Provide option to canonicalize the internal subset #6

dwcramer · 2022-08-01T14:40:39Z

Enhancement: One challenge with operating on internal subsets via sed or grep is that spacing can be all over the place across a set of XML files:

<!DOCTYPE html "about:legacy-compat" [
       <!ENTITY % foo 
           SYSTEM "path/to/bar.ent"> %foo;
]>

is the same as:

<!DOCTYPE html "about:legacy-compat" [
  <!ENTITY 
   % foo 
  SYSTEM 
  "path/to/bar.ent">
 %foo;
]>

It would be useful if doctype-tool could format all these in a canonical way (e.g. 'one line per declaration with non-meaningful whitespace normalized') so that you can then write simple grep or sed commands to search or modify them across a tree of files. Both of the above examples when normalized would become:

<!DOCTYPE html "about:compat" [
       <!ENTITY % foo SYSTEM "path/to/bar.ent">
       %foo;
]>

The text was updated successfully, but these errors were encountered:

AndrewSales · 2022-08-01T17:53:30Z

Thanks for the suggestion, @dwcramer - this should be perfectly possible, and I'll build it into the implementation of #2.

An extension of this would be to report out the internal subset itself as XML, in the manner of https://github.com/AndrewSales/dtd2xml.

For example, if I run that tool on this document:

<!DOCTYPE html [
  <!ENTITY 
  % foo 
  SYSTEM
  "c:/temp/bar.ent">
 %foo;
]>
<html/>

it produces (assuming bar.ent contains just <!ELEMENT foo EMPTY>) for example:

<dtd>
<element id="foo" name="foo">
<model>EMPTY</model>
<parents></parents>
</element>
<externalEntity name="foo" systemId="c:/temp/bar.ent"/>
</dtd>

which may be useful information to have as XML per se, or a downstream process could format these to suit and supply the input envisaged in #5.

dwcramer · 2022-08-01T20:59:19Z

That would be wonderful! I think you'd want an xml:base attribute on that dtd element so you can know where each report came from when doing downstream processing. I'm imagining a use case where you generate a report from a tree of files, modify the report via XSLT, then use doctype-tool to reapply the doctypes the files.

Note that entities can contain other unexpanded entities as well as xml. I feel like there will be some subtle issues to consider with nested entities, namespaces, and whitespace.

    <!ENTITY foo "<p>this is a
    para with whitespace that we can't mess with. And a random other &bar; entity
    for good measure.
    </p>">
    <!ENTITY bar "<mml:math xmlns:mml='http://www.w3.org/1998/Math/MathML'/>">

See also https://en.wikipedia.org/wiki/Billion_laughs_attack

dwcramer · 2022-08-01T21:59:25Z

And of course once I have <externalEntity name="foo" systemId="c:/temp/bar.ent"/> I'll also want to be able to turn the contents of c:/temp/bar.ent into a report, but that means you would have to take catalog files into account...

AndrewSales · 2022-08-02T15:41:50Z

Good points all, @dwcramer - thanks.
I'll set to work on this and see where it leads.

AndrewSales self-assigned this Aug 1, 2022

AndrewSales added the enhancement New feature or request label Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide option to canonicalize the internal subset #6

Provide option to canonicalize the internal subset #6

dwcramer commented Aug 1, 2022 •

edited

Loading

AndrewSales commented Aug 1, 2022

dwcramer commented Aug 1, 2022 •

edited

Loading

dwcramer commented Aug 1, 2022 •

edited

Loading

AndrewSales commented Aug 2, 2022

Provide option to canonicalize the internal subset #6

Provide option to canonicalize the internal subset #6

Comments

dwcramer commented Aug 1, 2022 • edited Loading

AndrewSales commented Aug 1, 2022

dwcramer commented Aug 1, 2022 • edited Loading

dwcramer commented Aug 1, 2022 • edited Loading

AndrewSales commented Aug 2, 2022

dwcramer commented Aug 1, 2022 •

edited

Loading

dwcramer commented Aug 1, 2022 •

edited

Loading

dwcramer commented Aug 1, 2022 •

edited

Loading