Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer gets stuck on bad-match regex matching #14

Open
AWe opened this issue Oct 10, 2012 · 11 comments
Open

Tokenizer gets stuck on bad-match regex matching #14

AWe opened this issue Oct 10, 2012 · 11 comments

Comments

@AWe
Copy link

AWe commented Oct 10, 2012

i found a problem by using your Twokenize tokenizer.

The program stucks in the Twokenize.simpleTokenize() method in this part:
while(matches.find()){
// The spans of the "bads" should not be split.
if (matches.start() != matches.end()){ //unnecessary?
List bad = new ArrayList(1);
bad.add(splitPunctText.substring(matches.start(),matches.end()));
bads.add(bad);
badSpans.add(new Pair<Integer, Integer>(matches.start(),matches.end()));
}
}

Example string: String test = "@rkdalswl0302 완댜)물론~얼굴은아니지만....................................................................................................";

@haijieg
Copy link

haijieg commented Oct 19, 2012

Another input I found that causes the stuck in while-loop is here:

String test = "7h... nh? a nhìu th?t nhìu ^^~..........................................................................."

Looks like the "....." pattern is the problem?

@haijieg
Copy link

haijieg commented Oct 22, 2012

This is not the only bad example. Also I don't think the problem is the while loop, it might be due to some catastrophic regex matching which can take years.

@brendano
Copy link
Owner

Yeah it's one of the regexes for sure...

On Sun, Oct 21, 2012 at 11:10 PM, haijieg [email protected] wrote:

This is not the only bad example. Also I don't think the problem is the
while loop, it might be due to some catastrophic regex matching which can
take years.


Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-9651481.

@brendano
Copy link
Owner

If you could assemble as many as you can find, it would be helpful. I'm trying to create minimal test cases and it's very subtle. (The problem is in non-determinancy of the regex engine... when you start using lookaheads/lookbehinds it's no longer a DFA and horrible things can happen. I'm afraid this is looking complicated to figure out.)

@brendano
Copy link
Owner

actually no don't bother i think i see the problem (the eastern emoticon system)

@brendano
Copy link
Owner

OK, it's due to a use of a backreference (or some sort of regex pattern reference) in "basicface" when embedded inside "eastEmote". I have to consult with the author of this to see what's going on

brendano added a commit that referenced this issue Oct 23, 2012
Fixed error where ........ would tokenize to ... ... ...
and the error where ~......... would hang

Hopefully fixes github issue #14

(from tobiowo@471d223
but with CRLF->LF)

Change analysis: 
sample of 1,355,000 tweets
changes 30,000 of them
they all look like improvements, mostly the ellipsis
@brendano
Copy link
Owner

Hi, I just pushed a bugfix to master that makes these examples work. Could you try it on the large dataset? Let me know if building it is annoying (see the comment I wrote on #15 )

and re:

This is not the only bad example.

Of course. Welcome to large-scale text processing: every one-out-of-a-billion bug will happen :) .... so especially in a distributed system, have to log what section of the dataset your runner is running on...

When I used to use Hadoop for NLP software that had once per 10mil or 100mil or so bugs like this, we had to run the NLP system out-of-process to monitor deadlocks like this one, so you can notice when the process isn't working and kill it. Not a fun situation, sorry :(

@brendano
Copy link
Owner

I just ran it on 1489999 or so tweets just now and here is the distribution of number of milliseconds per tweet. The top 8 worst ones are more than 500 ms, which isn't good (max 921 ms) ... this sounds like another regex bug, perhaps ... but the rest seem to terminate ok. Maybe if you go to a billion tweets the maxes are worse? I started a larger job but won't finish for a while. (Another caveat with this experiment, this includes JSON parsing, done in-process with --input-format json.)

@haijieg
Copy link

haijieg commented Oct 23, 2012

Cool. I'm going to test it on the daily.10k tonight.

On Tue, Oct 23, 2012 at 1:16 AM, brendano [email protected] wrote:

I just ran it on 1489999 or so tweets just now and here is the
distribution of number of milliseconds per tweet. The top 8 worst ones are
more than 500 ms, which isn't good (max 921 ms) ... this sounds like
another regex bug, perhaps ... but the rest seem to terminate ok. Maybe if
you go to a billion tweets the maxes are worse? I started a larger job but
won't finish for a while. (Another caveat with this experiment, this
includes JSON parsing, done in-process with --input-format json.)

https://a248.e.akamai.net/camo.github.com/3b827bd72554c0b981f8972d277e0656495432f6/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30392e3237253230414d2e706e67

https://a248.e.akamai.net/camo.github.com/1b54beac9b63cf3b5a548c6106036fcd766882a2/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30392e3538253230414d2e706e67

https://a248.e.akamai.net/camo.github.com/79711448c761d9b5c635ff39c1e8473ecc7f0205/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30382e3035253230414d2e706e67


Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-9690885.

@haijieg
Copy link

haijieg commented Oct 23, 2012

It passed the test of daily.10k with 14720000 tweets. Will try on larger
dataset later tomorrow.

Jay

On Tue, Oct 23, 2012 at 1:25 AM, Haijie Gu [email protected] wrote:

Cool. I'm going to test it on the daily.10k tonight.

On Tue, Oct 23, 2012 at 1:16 AM, brendano [email protected]:

I just ran it on 1489999 or so tweets just now and here is the
distribution of number of milliseconds per tweet. The top 8 worst ones are
more than 500 ms, which isn't good (max 921 ms) ... this sounds like
another regex bug, perhaps ... but the rest seem to terminate ok. Maybe if
you go to a billion tweets the maxes are worse? I started a larger job but
won't finish for a while. (Another caveat with this experiment, this
includes JSON parsing, done in-process with --input-format json.)

https://a248.e.akamai.net/camo.github.com/3b827bd72554c0b981f8972d277e0656495432f6/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30392e3237253230414d2e706e67

https://a248.e.akamai.net/camo.github.com/1b54beac9b63cf3b5a548c6106036fcd766882a2/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30392e3538253230414d2e706e67

https://a248.e.akamai.net/camo.github.com/79711448c761d9b5c635ff39c1e8473ecc7f0205/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30382e3035253230414d2e706e67


Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-9690885.

@brendano
Copy link
Owner

OK, from a bigger sample of 143mil tweets ("daily 100k" here) I'm seeing a similar distribution for the lower <2ms range

There are 4146 tweets that took more than 10ms, and 96 that took longer than 1000ms (max 143 seconds); those last ones should be investigated at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants