Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confused/missing tags in HTML output #208

Open
cxj opened this issue Sep 8, 2018 · 16 comments
Open

Confused/missing tags in HTML output #208

cxj opened this issue Sep 8, 2018 · 16 comments

Comments

@cxj
Copy link
Contributor

cxj commented Sep 8, 2018

This is so strange, it might even be a PHP bug:

<?php
require "vendor/autoload.php";

$xml = <<<XML
<div>
<script>
    var msgs += "<div>";
    msgs += "<button>X</button>";
    msgs += "</div>";
</script>
</div>
XML;

$tpl = new Transphporm\Builder($xml, '');
echo $tpl->output()->body . PHP_EOL;

Produces:

<div>
<script>
	var msgs += "<div>";
	msgs += "<button>X";
	msgs += "</script>
</div>";

Notice how the </button> got swallowed and the closing </div> got included into the Javascript literal and the final </div> went missing as well.

@garrettw
Copy link
Contributor

garrettw commented Sep 10, 2018

Even though it's not being rendered as XML, I wonder if wrapping it in a CDATA thing would make any difference.

Apparently issues with the <script> tag are well-known. www.php.net/manual/en/domdocument.savehtml.php#usernotes

@TomBZombie - It might be good to add some code resembling this comment somewhere between the load and save calls, though I know it only addresses externally-sourced scripts.

@cxj
Copy link
Contributor Author

cxj commented Sep 10, 2018

CDATA wrapping the Javascript does not seem to help. The output I get then becomes:

<div>
<script>
<![CDATA[
	var msgs += "<div>";
	msgs += "<button>X";
	msgs += "</script>
</div>";
]]&gt;

@TRPB
Copy link
Member

TRPB commented Sep 10, 2018

Here's a minimal test script:

<?php

$xml = <<<XML
<div>
<script>
    var msgs += "<div>";
    msgs += "<button>X</button>";
    msgs += "</div>";
</script>
</div>
XML;

$document = new \DomDocument();
$document->loadHtml('<' . '?xml encoding="UTF-8">' . $xml,  LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);

echo $document->saveHtml();

For some reason this gives me the output:

PHP Warning:  DOMDocument::loadHTML(): Unexpected end tag : button in Entity, line: 4 in /home/tom/tmp.php on line 14
PHP Warning:  DOMDocument::loadHTML(): Unexpected end tag : script in Entity, line: 6 in /home/tom/tmp.php on line 14
PHP Warning:  DOMDocument::loadHTML(): Unexpected end tag : div in Entity, line: 7 in /home/tom/tmp.php on line 14
<?xml encoding="UTF-8"><div>
<script>
    var msgs += "<div>";
    msgs += "<button>X";
    msgs += "</script></div><p>";

</p>

I have no idea where the <p> tags are coming from!

I think this will have to be reported upstream as I don't think it's something I can fix.

@garrettw
Copy link
Contributor

Ultimately what seems clear to me is that DOMDocument is not able to understand that HTML tags inside of a <script> tag are to be ignored -- and furthermore, HTML tags inside of JS quotes. Fixing this could involve a lot of work upstream, so even if they were to decide to solve it, it may not be any time soon. So maybe coming up with a workaround should be our priority. Obviously making all scripts externally-sourced is not the solution; I wonder if the DOM API provides some way to handle it correctly.

@TRPB
Copy link
Member

TRPB commented Sep 10, 2018

It looks like this is something PHP can't fix: https://bugs.php.net/bug.php?id=74858

http://www.flightlab.com/~joe/sgml/cdata.html dated but still mostly accurate
#5 is what is not supported in HTML 4.

Look. Whether you consider the current behavior to be a bug ("libxml does not parse using current rules") or a request ("libxml should add support for HTML 5"), there is nothing for PHP to do here because this is not a PHP issue. It is a libxml issue. If you want to keep pushing for this then head over to their project.
http://xmlsoft.org/

I'd recommend following best practice and storing your javascript in external files. There are too many permutations for this fix to be applied inside Transphporm.

The bug report on php.net suggests this should work:

$xml = <<<XML
<div>
<script>
    var msgs += "<div>";
    msgs += "<button>X<\/button>";
    msgs += "<\/div>";
</script>
</div>
XML;

But it doesn't, the backslashes are included in the final output.

So maybe coming up with a workaround should be our priority.

Unfortunately I don't think there is a simple solution short of writing a new DOM parser.

As a hack it might be possible to regex /<script>(.*)<\/script>/, replace $1 with a placeholder, parse to DomDocument, then replace the placeholder with the stored value after saveHtml().

A better solution is using pure XML as it's handled correctly there.

<?php

$xml = <<<XML
<?xml version="1.0"?>
<div>
<script>
    var msgs += "<div>";
    msgs += "<button>X</button>";
    msgs += "</div>";
</script>
</div>
XML;

$document = new \DomDocument();
$document->loadXml($xml);

echo $document->saveXml();

outputs:

<?xml version="1.0"?>
<div>
<script>
    var msgs += "<div>";
    msgs += "<button>X</button>";
    msgs += "</div>";
</script>
</div>

tl;dr put <?xml version="1.0"?> at the top of your template and it should just work.

edit: I don't know why using loadXML/saveXML works. I'd expected you to need cdata tags for it to work as intended.

@garrettw
Copy link
Contributor

garrettw commented Sep 10, 2018

That's a shame.
Your hack idea is similar to what I was thinking, though. Any downside to going that route for non-XML documents?

@TRPB
Copy link
Member

TRPB commented Sep 10, 2018

Your hack idea is similar to what I was thinking, though. Any downside to going that route?

It would have to be applied every time output is called so would bypass the cache and create a regex search on every single template load which would be slow. It could be mitigated by if (strpos($xml, '<script> though. I'm guessing most templates that represent a complete web page would have a <script> tag so the regex search would be run pretty much every time. And 99% of the time there is a <script> tag, there won't be any HTML inside it. Is the overhead of running a regex search/replace on pretty much every parse and serialize worth it?

You'd also have to write some fairly complex logic because cases like this would break it:

<div>
<script>
	document.write('<script>some code</script>');
</script>
</div>

Comments <!-- </script> --> will also cause problems. Then you have an issue if someone has done something like foo="<script>". Admittedly a very unlikely scenario but anywhere <script> appears inside a string would need to be detected and accounted for.

It looks like libXML already fixed this in loadXML but for some reason not loadHTML.

We could do something like:

<script transphporm-libxml-bug-hack="true">

and strpos transphporm-libxml-bug-hack to determine whether to apply the fix, but it's still a fair amount of logic for something that can be bypassed by using <?xml version="1.0"?> at the top of the template.

At minimum, we do need to put a warning somewhere in the documentation though. Ideally this will be fixed in PHP and/or libxml.

@garrettw
Copy link
Contributor

In the case of document.write('<script>some code</script>'); one could do something I've seen before: document.write('<scr' + 'ipt>some code</scr' + 'ipt>');

But I agree that given all of that, the overhead isn't worth it. (My idea for attribute sniffing was to use something that's actually valid, such as class="inline-js" or to follow your example, maybe data-transphporm-inline-js="true".)

@cxj
Copy link
Contributor Author

cxj commented Sep 10, 2018

Wow. So if I'm composing my final page out of multiple templates, each template file needs to be proper XML and contain the <?xml version="1.0"?> introductory line?

That certainly seems preferable to placing all of my Javascript in external files. I do place all shared JS in separate files, but some pages need their own few lines of special JS glue to put it all together.

@TRPB
Copy link
Member

TRPB commented Sep 10, 2018

Wow. So if I'm composing my final page out of multiple templates, each template file needs to be proper XML and contain the introductory line?

I wonder if we can be a bit smarter. We had issues when importing HTML into XML but I'd have thought the other way should be ok. I'll need to test it, but it might be possible to just add the XML tag to the template with the script tag.

really though, it'd be better if libxml fixed this as it's clearly a bug their end.

@TRPB
Copy link
Member

TRPB commented Sep 10, 2018

Apparently document.write('<script>some code<\/script>'); is valid HTML or javascript as it works correctly. Maybe javascript sees the \ as an escape character and ignores it so that's another solution:

$xml = <<<XML
<?xml version="1.0"?>
<div>
<script>
    var msgs += "<div>";
    msgs += "<button>X<\/button>";
    msgs += "<\/div>";
</script>
</div>
XML;

$document = new \DomDocument();
$document->loadHtml($xml, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);

echo $document->saveHtml();

@cxj
Copy link
Contributor Author

cxj commented Sep 10, 2018

@garrettw Whoops, did I goof you up by deleting my comment? I noticed that Tom had directly addressed most of my concerns earlier, so I removed it as redundant.

The following seems to work for me so far. Is this the proper way to do this? In particular, the repeated uses of <?xml version="1.0"?>, the DOCTYPE declaration, and the CDATA usage?

Main content in a partial, e.g. content.xml:

<?xml version="1.0"?>
<div>
    <!-- bunch of XHTML -->

    <script>
         <![CDATA[
            // Javascript unique to this content/page only.
           ]]>
    </script>
</div>

Main content TSS, e.g. content.tss:

main {
    content: template("content.xml");
}

Some default page layout, e.g. default.xml:

<?xml version="1.0"?>
<!DOCTYPE html>
<html xml:lang="en" lang="en">

<head>
    <!-- The usual JS libs, CSS libs and other head elements -->
</head>
<body>
<main>
    <!-- main content gets inserted here -->
</main>
</body>
</html>

Called with something like:

$tpl = new Transphporm\Builder("default.xml", "content.tss");

@TRPB
Copy link
Member

TRPB commented Sep 10, 2018

You shouldn't need CDATA tags anywhere, libxml seems to handle <script> correctly without it as long as it's loading XML and not HTML. You may also find it easier to use the workaround I posted above

<div>
<script>
   var msgs += "<div>";
   msgs += "<button>X<\/button>";
   msgs += "<\/div>";
</script>
</div>

edit: Which works with XML or HTML.

@cxj
Copy link
Contributor Author

cxj commented Sep 10, 2018

Transphporm throws an exception if I remove the CDATA in my real project.

[Mon Sep 10 17:17:05 2018] code: 8, msg: Trying to get property of non-object
[Mon Sep 10 17:17:05 2018] line 27 in file /staging/vendor/level-2/transphporm/src/Template.php

@TRPB
Copy link
Member

TRPB commented Sep 11, 2018

If you're using HTML everywhere (no <?xml tags anywhere) then this works for content.xml:

<div>
    <!-- bunch of XHTML -->

 <script>
    var msgs += "<div>";
    msgs += "<button>X<\/button>";
    msgs += "<\/div>";
</script>


</div>

For xml you don't need the slashes:

<?xml version="1.0"?>
<div>
    <!-- bunch of XHTML -->

 <script>
    var msgs += "<div>";
    msgs += "<button>X</button>";
    msgs += "</div>";
</script>


</div>

and works as-is in your example.

I can't replicate the error you had, I'm guessing there must be a libxml version difference.

Can you run

php --info | grep libxml

I get

libxml Version => 2.9.8
libxml
libxml2 Version => 2.9.8

@TRPB
Copy link
Member

TRPB commented Sep 11, 2018

Ok I think I can see what's happening here. XML works because it is well formed and libxml does not do anything special with the script tag.

using the XML

 <script>
    var msgs += "<div>";
    msgs += "<button>X</button>";
    msgs += "</div>";

</script>

It generates:

   DomElement: script
        DomText "var msgs += ""
        DomElement div
               DomText "; msgs += ""
               DomElement button
                       DomText "x"
               DomText "msgs += ""
               DomText "";"

If the javascript isn't well formed XML e.g. if (x < y) then it breaks and you will need CDATA. Contrary to what I said above, the better option is to use <\/div> with the escape character.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants