-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: massive memory usage for parsing a revision #286
Comments
Related to #40 Is there some unclosed tag or wiki markup in that revision? |
Thanks for chiming in. Yes, I do think that's what is going on but my understanding is that the max recursion depth (code) should prevent these unclosed tags from consuming so much memory? Perhaps it's because the wikitext has so many in this example (>1000 I guess that hints at my cheap fix too (check for number of open tags and number of close tags and make sure they are not too uneven before parsing) but that's not an easy check to add before parsing and has all sorts of edge cases (e.g., these tags don't have equivalent closing tags) |
I guess the runaway memory usage is from our cache of which combinations of tag locations are known to be invalid. Probably we're storing that in a very inefficient way and doing unnecessary backtracking. Thanks for the bug report and sorry for being unresponsive lately. I'll think more about this. |
that explanation makes sense. so decreasing recursion depth would probably help but is a blunt solution for something that's really about how to manage the possible combinations for these complicated, unclosed tag situations. thanks in advance for taking a look! |
We've been doing some parsing of lots of historical Wikipedia revisions so are running into weird edge cases. One of these is the revision below that seems to somehow escape the recursion depth and consumes a massive amount of memory when parsing. Obviously a patch would be great but also I'm happy to hear if there are any ideas for easily killing the parsing before its memory usage gets out of hand (without knowing the problematic revision IDs in advance).
Details:
The wikitext itself isn't particularly long but it's a bunch of HTML tags all nested within links etc.: https://simple.wikipedia.org/w/index.php?title=2000s_(decade)&action=edit&oldid=69784
Code to replicate -- very quickly consumed memory and was killed on PAWs (3GB RAM max) but I'm told it gets up to about 30GB if that's available:
The text was updated successfully, but these errors were encountered: