-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental Improved Search Algorithm #524
Experimental Improved Search Algorithm #524
Conversation
{"the group for the preservation of the holy sites", "the group", 0.416}, | ||
{precompute("the group for the preservation of the holy sites"), precompute("the group"), 0.416}, | ||
{"group preservation holy sites", "group", 0.460}, | ||
{"the group for the preservation of the holy sites", "the group", 0.880}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good example results changing because of the preference for matching all of the search name over matching all of the indexed name. If this is undesirable for a user, they can increase UNMATCHED_INDEX_TOKEN_WEIGHT. It's currently set very low
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I agree. Adding that knob will help some use-cases to lower these types of scores.
//TODO should use a phonetic comparison here, like Soundex | ||
score = score * differentLetterPenaltyWeight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I had been thinking we should detect the same. Soundex is fairly focused on English words though, so it may need adapted for international -> english translations and names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great info. Thanks Adam, I'll have a read
This is excellent @tomdaffurn! Thank you for the contribution. I had been thinking about how to implement a couple of these improvements, but your solution is excellent. From the results I've seen this could be merged and replace the existing algorithm. We've made similar releases in the past. |
Thanks for the review and tick Adam! You've got a great tool here, and it's fun to work on. There were some linting errors in my code, so I've fixed those and added to README.md |
Codecov Report
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #524 +/- ##
=========================================
+ Coverage 8.18% 9.81% +1.63%
=========================================
Files 44 38 -6
Lines 3531 2811 -720
=========================================
- Hits 289 276 -13
+ Misses 3219 2511 -708
- Partials 23 24 +1 |
This is a re-write of the
jaroWinkler
function with the goal of improving the scoring performance. The new algorithm changes several things:The resulting search behaviour has significantly better true positive rate AND false positive rate. Examples of this are shown in
cmd/server/new_algorithm_test.go
.I've done testing with 2000 real customer names, and with 50 sanctioned names. The aggregated results are below. I can share the 50 sanctioned names data, but the 2000 customer names are too sensitive to share.
I haven't fixed all of the tests and written enough new tests, but I'm happy to do so if you like this change.