-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow parsing of HTML documents with big number of tags #7331
Comments
Looks like Masterminds/HTML5 bug. What PHP version? Could you provide the sample message? |
We are using the following rpm from Redhat: php-5.4.16-46.1.el7_7.x86_64 I appreciate that PHP 5.4 is ancient, but this is the version that Redhat provide in RHEL 7. You have to go to third party repositories to get PHP 7. The INSTALL file that ships with Roundcube does state:
I don't think that I can give you the original message as that is personal/private data. I will see if I can manufacture something of a similar form which exhibits the same effect. |
The attached message provokes the problem. That is just 46396 repetitions of the following to generate a 4 MByte text/html section.
|
Thanks. I confirmed the issue with PHP 7.4. I also tested older Mastermind's lib version with no luck. If you need a workaround, you can uninstall the lib, in this case Roundcube will fallback to built-in parser. I tested the built-in parser, it works (still it needs 12 seconds to parse such a big message). |
I investigated this more in Masterminds/html5-php#181. It looks like the HTML5 parser is just slow for documents with a big number of tags. And it probably will be hard to make it faster. That's why I think we'll disable HTML5 parser (and use the built-in parser) for documents with >10000 tags. |
There's no easy fix for that. Fixed by disabling the HTML5 parser when HTML contains more than 10000 tags. |
Oops, fixed typo in b35b5a1. |
Is there a setting to limit number of tags allowed? |
Investigating a curious load spike on our Roundcube 1.4.2 installation this morning, I discovered:
each of which corresponds to a httpd process running at 100% user CPU for 120 seconds.
Checking back through my roundcube logs for the last 28 days, this does not appear to be a common error.
This was one user repeatedly attempting to view a single message in their inbox.
The message in question was generated by:
and is a multipart/alternative message which contains a text/plain part and the equivalent text/html.
The only particularly unusual thing about the message is that it seems to be part of a long running conversation with lots of quoted text using ">>>" in both the text/plain and text/html parts, and absolutely no attempt to prune the content. If I separate the text/plain and text/html content I am left with:
where almost all of the 4 MBytes of the html version is quoting of the form:
I appreciate that this may be a case of "well, don't do that then", but I though that I ought to at least report this as a potential issue. This is with Roundcube 1.4.2 running on RHEL 7. I can't immediately see anything in the ChangeLog for Roundcube 1.4.3 which is relevant.
Presumably it is just trying to build a horribly messy DOMTree. I can see a small spike in memory usage at the same time that the 5 minute load average on my machines jumped from 0 to 9 for a few minutes.
For what it is worth:
/usr/share/php/Masterminds/HTML5/Parser/DOMTreeBuilder.php
is provided by:and line 435 of that file appears to be the appendChild call in:
Redhat don't appear to have a more recent version of that rpm available.
The text was updated successfully, but these errors were encountered: