-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Searcher treats ^
as literal
#10
Comments
I've added a fix to my fork (https://github.com/neilireson/multiregexp), where if the pattern startsWith the specified exceptions (i.e. ".*", "^") then the prefix is not added. However the fork also contains a raft of other changes. Mainly these are optimisations, as I'm trying to get multiregexp to work with 20,000+ patterns, the base functionality is (or should be) the same, as all the previous methods should default to previous behaviour. The only exception being that I'm using a multithreaded make to build the MultiPatternAutomaton. |
@neilireson Interesting. When I got to deal with many pattern, I just grouped them in pack of 50 or so patterns. 20000+ sounds like a gigantic DFA after the powerset operation! Is it working alright? Also if you have pattern, that are really just strings and not pattern, it might be interesting to treat them separately with an implementation of ahocorasick. |
@OrangeDog Oh yes this is a valid point. If you guys have working code for this, I welcome pull request. |
Firstly, thanks very much for providing this code, it's very cool. OK I've added all my current commits to the Pull Request. To be honest I've been using SVN for years so I'm new to the GIT world. Let me know if I need to do anything else. |
I could use Aho-Corasick, do you think it would be faster? Multiregexp offers some advantages. One use case is person names where I use the patterns " Smith " John Smith ", " David Smith ", ... I then have a disambiguation process where everyone would match "Smith", but "David Smith" only matches the David's. I also have generic patterns " .* Smith", which enables me to check for names outside my dictionary (e.g. "Fred Smith"). If Aho-Corasick would be faster I could use a combination of the two approaches I also use it for words with multiple suffixes, e.g. "word[a-z]* ", but I could probably just enumerate all the possibilities. |
TBH I'm just using Pattern.compile(patterns.stream()
.map(s -> "(?:" + s + ")")
.collect(Collectors.joining("|"))) |
@OrangeDog thanks for reporting the issue anyway :) This is very helpful. |
Because a searcher is constructed by prefixing with
.*
, any patterns starting with^
have that treated as a literal instead of a start of line anchor.The text was updated successfully, but these errors were encountered: