Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow parsing of HTML documents with big number of tags #7331

Closed
dpc22 opened this issue Apr 16, 2020 · 8 comments
Closed

Slow parsing of HTML documents with big number of tags #7331

dpc22 opened this issue Apr 16, 2020 · 8 comments

Comments

@dpc22
Copy link
Contributor

dpc22 commented Apr 16, 2020

Investigating a curious load spike on our Roundcube 1.4.2 installation this morning, I discovered:

Apr 16 03:53:55 webmail-1a roundcube: PHP Fatal error:  Maximum execution time of 120 seconds exceeded in /usr/share/php/Masterminds/HTML5/Parser/DOMTreeBuilder.php on line 435
Apr 16 03:54:00 webmail-1a roundcube: PHP Fatal error:  Maximum execution time of 120 seconds exceeded in /usr/share/php/Masterminds/HTML5/Parser/DOMTreeBuilder.php on line 435
Apr 16 03:54:17 webmail-1a roundcube: PHP Fatal error:  Maximum execution time of 120 seconds exceeded in /usr/share/php/Masterminds/HTML5/Parser/DOMTreeBuilder.php on line 435
Apr 16 03:54:17 webmail-1a roundcube: PHP Fatal error:  Maximum execution time of 120 seconds exceeded in /usr/share/php/Masterminds/HTML5/Parser/DOMTreeBuilder.php on line 435
Apr 16 03:54:24 webmail-1a roundcube: PHP Fatal error:  Maximum execution time of 120 seconds exceeded in /usr/share/php/Masterminds/HTML5/Parser/DOMTreeBuilder.php on line 435
Apr 16 03:57:01 webmail-1a roundcube: PHP Fatal error:  Maximum execution time of 120 seconds exceeded in /usr/share/php/Masterminds/HTML5/Parser/DOMTreeBuilder.php on line 435
Apr 16 03:57:03 webmail-1a roundcube: PHP Fatal error:  Maximum execution time of 120 seconds exceeded in /usr/share/php/Masterminds/HTML5/Parser/DOMTreeBuilder.php on line 435

each of which corresponds to a httpd process running at 100% user CPU for 120 seconds.

Checking back through my roundcube logs for the last 28 days, this does not appear to be a common error.

This was one user repeatedly attempting to view a single message in their inbox.

The message in question was generated by:

X-Mailer: Microsoft Office Outlook 12.0

and is a multipart/alternative message which contains a text/plain part and the equivalent text/html.

The only particularly unusual thing about the message is that it seems to be part of a long running conversation with lots of quoted text using ">>>" in both the text/plain and text/html parts, and absolutely no attempt to prune the content. If I separate the text/plain and text/html content I am left with:

-rw-r--r--  1 cyrus cyrus 243K Apr 16 09:15 11288.
-rw-r--r--  1 cyrus cyrus 4.0M Apr 16 09:16 11289.

where almost all of the 4 MBytes of the html version is quoting of the form:

<P DIR=3DLTR><SPAN LANG=3D"en-gb"><FONT FACE=3D"Consolas">&gt;&gt;&gt; =
</FONT></SPAN></P>

I appreciate that this may be a case of "well, don't do that then", but I though that I ought to at least report this as a potential issue. This is with Roundcube 1.4.2 running on RHEL 7. I can't immediately see anything in the ChangeLog for Roundcube 1.4.3 which is relevant.

Presumably it is just trying to build a horribly messy DOMTree. I can see a small spike in memory usage at the same time that the 5 minute load average on my machines jumped from 0 to 9 for a few minutes.

For what it is worth: /usr/share/php/Masterminds/HTML5/Parser/DOMTreeBuilder.php is provided by:

php-masterminds-html5-2.7.0-1.el7.noarch

and line 435 of that file appears to be the appendChild call in:

      } else {
            // Otherwise, it's a standard element.
            $this->current->appendChild($ele);

Redhat don't appear to have a more recent version of that rpm available.

@dpc22 dpc22 changed the title Attempt to view specific message in triggers "Maximum execution time of 120 seconds" timeout Attempt to view specific message triggers "Maximum execution time of 120 seconds" timeout Apr 16, 2020
@alecpl
Copy link
Member

alecpl commented Apr 16, 2020

Looks like Masterminds/HTML5 bug. What PHP version? Could you provide the sample message?

@dpc22
Copy link
Contributor Author

dpc22 commented Apr 16, 2020

We are using the following rpm from Redhat:

php-5.4.16-46.1.el7_7.x86_64

I appreciate that PHP 5.4 is ancient, but this is the version that Redhat provide in RHEL 7. You have to go to third party repositories to get PHP 7. The INSTALL file that ships with Roundcube does state:

PHP Version 5.4 or greater ...

I don't think that I can give you the original message as that is personal/private data. I will see if I can manufacture something of a similar form which exhibits the same effect.

@dpc22
Copy link
Contributor Author

dpc22 commented Apr 16, 2020

The attached message provokes the problem.

test.eml.gz

That is just 46396 repetitions of the following to generate a 4 MByte text/html section.

<P DIR=3DLTR><SPAN LANG=3D"en-gb"><FONT FACE=3D"Consolas">&gt;&gt; =
</FONT></SPAN></P>

@alecpl
Copy link
Member

alecpl commented Apr 16, 2020

Thanks. I confirmed the issue with PHP 7.4. I also tested older Mastermind's lib version with no luck.

If you need a workaround, you can uninstall the lib, in this case Roundcube will fallback to built-in parser. I tested the built-in parser, it works (still it needs 12 seconds to parse such a big message).

@alecpl
Copy link
Member

alecpl commented Apr 19, 2020

I investigated this more in Masterminds/html5-php#181. It looks like the HTML5 parser is just slow for documents with a big number of tags. And it probably will be hard to make it faster. That's why I think we'll disable HTML5 parser (and use the built-in parser) for documents with >10000 tags.

@alecpl alecpl changed the title Attempt to view specific message triggers "Maximum execution time of 120 seconds" timeout Slow parsing of HTML documents with big number of tags Apr 19, 2020
@alecpl alecpl modified the milestones: later, 1.4.4 Apr 22, 2020
alecpl added a commit that referenced this issue Apr 22, 2020
alecpl added a commit that referenced this issue Apr 22, 2020
@alecpl
Copy link
Member

alecpl commented Apr 22, 2020

There's no easy fix for that. Fixed by disabling the HTML5 parser when HTML contains more than 10000 tags.

@alecpl alecpl closed this as completed Apr 22, 2020
@alecpl
Copy link
Member

alecpl commented Apr 22, 2020

Oops, fixed typo in b35b5a1.

@trasherdk
Copy link

Is there a setting to limit number of tags allowed?
I'd love to bounce stuff like excessive nested tags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants