-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CsQuery.HtmlParser.HtmlData.TokenIDs is getting too big! #189
Comments
There's actually an open issue related to this (more specifically, the dictionary overflowing). Several solutions have been discussed such as making the token cache per-instance and using two dictionaries, since the dictionary is intialized with "hardcoded" tokens and it would be expensive to repopulate it each time. Or alternatively using a smart cache that GC's things that are infrequently accessed, or something. I just haven't had any time to do it because I don't use CsQuery during my day job any more. In the meantime your idea is actually not bad, it would at least make the bug manageable by end users. It's not quite that trivial since it would have to be reinitialized with the static data, but still pretty simple compared to the permanent fix. |
Jamie, thanks for your quick answer. I realised the code I had wasn's working this morning :) So after a bit of trial and error, this seems to work: public static void ClearCachedTokens()
{
nextID = 2;
TokenIDs = new Dictionary<string, ushort>(); //not necessary actually, it's in the constructor
Tokens = new List<string>();
InitInternal();
} InitInternal is actually the contents of the static constructor method, moved outside to be reused. Maybe this helps anyone interested. |
…mplemented in HtmlData.cs. This addresses issues where the `nextID` counter overflows ushort.MaxValue and wraps around to return the wrong index value, resulting in wrong strings output from the `Render` method. Fix jamietre#204, jamietre#205, jamietre#189, jamietre#164
Hey! First of all, GREAT library. From what I noticed so far, it's pretty fast.
I discovered a problem while using CsQuery on multiple html files, from a WARC archive containing 4gb of data. The property CsQuery.HtmlParser.HtmlData.TokenIDs it getting too big for its own good. In my case, after a few minutes, it reaches over 10mb in size, and because of it, parsing is getting slower and slower.
Since you don't know when to clear it, I would suggest just adding a cleanup method, like:
The text was updated successfully, but these errors were encountered: