Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize DOCTYPE? #858

Closed
JakeQZ opened this issue Apr 7, 2020 · 2 comments · Fixed by #866
Closed

Normalize DOCTYPE? #858

JakeQZ opened this issue Apr 7, 2020 · 2 comments · Fixed by #866
Assignees
Milestone

Comments

@JakeQZ
Copy link
Contributor

JakeQZ commented Apr 7, 2020

Noted in #831 is that masterminds/html5 will always output the HTML5 DOCTYPE as <!DOCTYPE html>, i.e. uppercase DOCTYPE and lowercase html.

This is consistent with the Polyglot Markup recommendation for producers to maximize support.

For #831, we would, at least, need to change a test to expect the above DOCTYPE form in the output where the input is <!DOCTYPE HTML> (with uppercase HTML).

But should we, in any case, normalize the DOCTYPE as above? I.e. always output DOCTYPE in uppercase (which we do anyway), and html in lowercase? If so, should this be done when serializing (rendering) the DOM, parsing the HTML (when the DTD name is read into the DOMDocumentType::$name property), or both (or even possibly in fromDomDocument and/or getDomDocument)?

@oliverklee
Copy link
Contributor

Yes, let's normalize the DOCTYPE to <DOCTYPE html>.

Let's do it at the place where it the most simple. Officially, we do not support getting DOMDocuments from other sources anyway, so we might do it at any place.

@JakeQZ
Copy link
Contributor Author

JakeQZ commented Apr 8, 2020

Simplest I imagine would to be to lowercase the DOMDocumentType::$name property. To do this at render() time would create a (possibly undesirable) side-effect, thus we should do it at fromHtml() time. That would also fit consistently with the fact that some other processing may occur at that stage, such as adding a Content-Type.

@JakeQZ JakeQZ self-assigned this Apr 20, 2020
@JakeQZ JakeQZ added this to the 4.0.0 milestone Apr 20, 2020
JakeQZ added a commit that referenced this issue Apr 23, 2020
Ensure that the DOCTYPE declaration consists of uppercase `DOCTYPE` and
lowercase root element name (`html`).

This is done when the `DOMDocument` is created from an HTML source.  Once the
`DOMDocument` has been created, the `DOMDocumentType` cannot be changed, so the
document type declaration must be manipulated (if necessary) in the HTML
beforehand.  (Since only HTML documents are supported, the declaration is only
normalized when the root element name is HTML, in whatever case - the precise
specification for any element name involves lists of various Unicode character
ranges which it would be superfluous to allow for and try to match.  PHP's
`DOMDocument`/`libxml` itself will output the `DOCTYPE` keyword in uppercase in
any case.)

This normalization is consistent with the relevant part of the
[polyglot markup specification](
  https://dev.w3.org/html5/html-polyglot/html-polyglot.html#doctype
).
 While polyglot markup is primarily intended for serialization of HTML as XML
(we don't actually support outputting as XHTML), is also recommended for maximum
interoperability and robustness when rendering HTML.

This also makes the output consistent with that of `Masterminds/html5-php` and
would eliminate the need to change associated tests specifically for #831.

Closes #858.
oliverklee pushed a commit that referenced this issue Apr 23, 2020
Ensure that the DOCTYPE declaration consists of uppercase `DOCTYPE` and
lowercase root element name (`html`).

This is done when the `DOMDocument` is created from an HTML source.  Once the
`DOMDocument` has been created, the `DOMDocumentType` cannot be changed, so the
document type declaration must be manipulated (if necessary) in the HTML
beforehand.  (Since only HTML documents are supported, the declaration is only
normalized when the root element name is HTML, in whatever case - the precise
specification for any element name involves lists of various Unicode character
ranges which it would be superfluous to allow for and try to match.  PHP's
`DOMDocument`/`libxml` itself will output the `DOCTYPE` keyword in uppercase in
any case.)

This normalization is consistent with the relevant part of the
[polyglot markup specification](
  https://dev.w3.org/html5/html-polyglot/html-polyglot.html#doctype
).
 While polyglot markup is primarily intended for serialization of HTML as XML
(we don't actually support outputting as XHTML), is also recommended for maximum
interoperability and robustness when rendering HTML.

This also makes the output consistent with that of `Masterminds/html5-php` and
would eliminate the need to change associated tests specifically for #831.

Closes #858.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants