Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searcher treats ^ as literal #10

Open
OrangeDog opened this issue Apr 18, 2016 · 7 comments
Open

Searcher treats ^ as literal #10

OrangeDog opened this issue Apr 18, 2016 · 7 comments
Labels

Comments

@OrangeDog
Copy link

Because a searcher is constructed by prefixing with .*, any patterns starting with ^ have that treated as a literal instead of a start of line anchor.

@neilireson
Copy link

I've added a fix to my fork (https://github.com/neilireson/multiregexp), where if the pattern startsWith the specified exceptions (i.e. ".*", "^") then the prefix is not added.

However the fork also contains a raft of other changes. Mainly these are optimisations, as I'm trying to get multiregexp to work with 20,000+ patterns, the base functionality is (or should be) the same, as all the previous methods should default to previous behaviour. The only exception being that I'm using a multithreaded make to build the MultiPatternAutomaton.

@fulmicoton
Copy link
Owner

@neilireson Interesting.

When I got to deal with many pattern, I just grouped them in pack of 50 or so patterns. 20000+ sounds like a gigantic DFA after the powerset operation! Is it working alright? Also if you have pattern, that are really just strings and not pattern, it might be interesting to treat them separately with an implementation of ahocorasick.

@fulmicoton
Copy link
Owner

@OrangeDog Oh yes this is a valid point. If you guys have working code for this, I welcome pull request.

@fulmicoton fulmicoton added the bug label Apr 19, 2016
@neilireson
Copy link

Firstly, thanks very much for providing this code, it's very cool.

OK I've added all my current commits to the Pull Request. To be honest I've been using SVN for years so I'm new to the GIT world. Let me know if I need to do anything else.

@neilireson
Copy link

neilireson commented Apr 19, 2016

I could use Aho-Corasick, do you think it would be faster?

Multiregexp offers some advantages. One use case is person names where I use the patterns " Smith " John Smith ", " David Smith ", ... I then have a disambiguation process where everyone would match "Smith", but "David Smith" only matches the David's. I also have generic patterns " .* Smith", which enables me to check for names outside my dictionary (e.g. "Fred Smith").

If Aho-Corasick would be faster I could use a combination of the two approaches

I also use it for words with multiple suffixes, e.g. "word[a-z]* ", but I could probably just enumerate all the possibilities.

@OrangeDog
Copy link
Author

OrangeDog commented Apr 19, 2016

TBH I'm just using java.util.regex.Pattern now, as there are fewer surprises.

Pattern.compile(patterns.stream()
    .map(s -> "(?:" + s + ")")
    .collect(Collectors.joining("|")))

@fulmicoton
Copy link
Owner

@OrangeDog thanks for reporting the issue anyway :) This is very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants