-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer gets stuck on bad-match regex matching #14
Comments
Another input I found that causes the stuck in while-loop is here: String test = "7h... nh? a nhìu th?t nhìu ^^~..........................................................................." Looks like the "....." pattern is the problem? |
This is not the only bad example. Also I don't think the problem is the while loop, it might be due to some catastrophic regex matching which can take years. |
Yeah it's one of the regexes for sure... On Sun, Oct 21, 2012 at 11:10 PM, haijieg [email protected] wrote:
|
If you could assemble as many as you can find, it would be helpful. I'm trying to create minimal test cases and it's very subtle. (The problem is in non-determinancy of the regex engine... when you start using lookaheads/lookbehinds it's no longer a DFA and horrible things can happen. I'm afraid this is looking complicated to figure out.) |
actually no don't bother i think i see the problem (the eastern emoticon system) |
OK, it's due to a use of a backreference (or some sort of regex pattern reference) in "basicface" when embedded inside "eastEmote". I have to consult with the author of this to see what's going on |
Fixed error where ........ would tokenize to ... ... ... and the error where ~......... would hang Hopefully fixes github issue #14 (from tobiowo@471d223 but with CRLF->LF) Change analysis: sample of 1,355,000 tweets changes 30,000 of them they all look like improvements, mostly the ellipsis
Hi, I just pushed a bugfix to master that makes these examples work. Could you try it on the large dataset? Let me know if building it is annoying (see the comment I wrote on #15 ) and re:
Of course. Welcome to large-scale text processing: every one-out-of-a-billion bug will happen :) .... so especially in a distributed system, have to log what section of the dataset your runner is running on... When I used to use Hadoop for NLP software that had once per 10mil or 100mil or so bugs like this, we had to run the NLP system out-of-process to monitor deadlocks like this one, so you can notice when the process isn't working and kill it. Not a fun situation, sorry :( |
I just ran it on 1489999 or so tweets just now and here is the distribution of number of milliseconds per tweet. The top 8 worst ones are more than 500 ms, which isn't good (max 921 ms) ... this sounds like another regex bug, perhaps ... but the rest seem to terminate ok. Maybe if you go to a billion tweets the maxes are worse? I started a larger job but won't finish for a while. (Another caveat with this experiment, this includes JSON parsing, done in-process with --input-format json.) |
Cool. I'm going to test it on the daily.10k tonight. On Tue, Oct 23, 2012 at 1:16 AM, brendano [email protected] wrote:
|
It passed the test of daily.10k with 14720000 tweets. Will try on larger Jay On Tue, Oct 23, 2012 at 1:25 AM, Haijie Gu [email protected] wrote:
|
i found a problem by using your Twokenize tokenizer.
The program stucks in the Twokenize.simpleTokenize() method in this part:
while(matches.find()){
// The spans of the "bads" should not be split.
if (matches.start() != matches.end()){ //unnecessary?
List bad = new ArrayList(1);
bad.add(splitPunctText.substring(matches.start(),matches.end()));
bads.add(bad);
badSpans.add(new Pair<Integer, Integer>(matches.start(),matches.end()));
}
}
Example string: String test = "@rkdalswl0302 완댜)물론~얼굴은아니지만....................................................................................................";
The text was updated successfully, but these errors were encountered: