-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid parsing result when head/body tag is missing #166
Comments
Can you post the References of the Mozilla documentation about this? |
https://html5.validator.nu/ says is not valid |
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/html I tried https://validator.w3.org, it also returns an error, however it looks strange to me. "Element head is missing a required instance of child element title" while there's no head at all. |
Official HTML5.2 documentation says:
|
Can you please post the exact references to the documentation instead the main links... is really hard to find the sentences you are referring |
Look for "Tag omission". |
The document you have posted refers to the latest HTML 5.2 specs. This library implements most of the 5.0 specs. |
The old HTML5 documentation is the same in this context: https://www.w3.org/TR/2014/REC-html5-20141028/semantics.html#the-html-element |
Also, don't miss the fact DOMDocument parses these correctly. |
Good to know |
well, DOMDocument does not follow that much the HTML5 logic... is just a relaxed XML parser internally. |
Yeah, the main reason we switched from DOMDocument to this lib was to get better results. And in many cases the result is better, but this case obviously looks like a bug. Such "dummy" HTML code is not that uncommon in email world. |
Parsing such chunks of HTML would be useful when dealing with some ajax responses containing partials when scraping the web. |
Hi, adding another test case to this issue: Using the native $doc = new DOMDocument();
$doc->loadHTML('<title>Foo');
echo $doc->saveHTML(); <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Foo</title></head></html> Using this library's implementation: $parser = new HTML(['disable_html_ns' => true]);
$doc = $parser->loadHTML('<title>Foo');
echo $doc->saveHTML(); <html><title>Foo</title></html> |
Also, citing the spec for tag omission:
This implies that:
|
@ju1ius that is a valid point, see my comment #182 (comment) for a possible solution |
The changes to achieve this are difficult and break several existing tests. Adding those elements means that they will also be output - as far as I'm aware it's not possible to parse but not output them... For starters, the document ends after |
Consider this:
Imo, this is valid HTML and it is also parsed correctly by DOMDocument. However, HTML5 parser will ignore the first line of text. We're using loadHTML() method.
Even this one works with DOMDocument:
According to Mozilla documentation:
Reference: roundcube/roundcubemail#6713 (comment)
The text was updated successfully, but these errors were encountered: