-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Other string matching algorithms? #13
Comments
My first goal with this library was to reimplement the functionality of FuzzyWuzzy since I needed it for rhasspy, so I only implemented the algorithms relevant for this. Now that this is done I am open to suggestions on other cool string matching algorithms that could be added in future versions. (I always love seeing Pull Requests adding them aswell ^^) |
Related to #8 and this issue, I am working on https://github.com/logannc/fuzzywuzzy-rs, if you're looking to collaborate. (please forgive the mess, I ported it years ago without much understanding and now am trying to clean it up) Relevant for this issue, something I'm looking at is whether certain functions are living up to their purpose and, if not, whether an algorithm change is appropriate. Specifically, For example, take strings that are the (human readable) mirror of each other:
My expectation of the optimal local alignment (because that seems to implicitly be the unmentioned kind of alignment we want) for strings like this whose opposite ends are strongly aligned should be just that end piece. That is, for my example strings, the optimal local alignment would just be It's unclear to me what algorithm rapidfuzz and fuzzywuzzy are actually using to align strings, so I'd love to be enlightened on that point. Let me know if you'd like to take this conversation elsewhere (a different issue, email, discord, IRC, etc). |
fuzz.partial_ratio uses difflib.get_matching_blocks to find the optimal alignment between two strings and then calculates a ratio on this. However it is only allowed to shift the shorter one to the right side, so when you have two strings of equal length as you do it would either find the alignment
or
depending on the way you pass it the strings. Afterwards it will take a part of the long string starting where they align, that has the length of the shorter string and match them. When there are not enough characters left in the longer string it will simply stop when there are no more characters. So in your example it would match either:
or
So in this scenario it does perfom really bad and it even matters which way around you pass it the strings. Generally it is ment to be used when one string is a lot smaller than the other string (e.g. in fuzz.WRatio it is used when one string is 1.5 times longer than the other string). So while it might not be perfect I guess it is good enough for this use case. I personally only implemented it this way, since people already use fuzzywuzzy and so I want to give them a similar behaviour (I would however be absolutely fine with having another algorithm that actually optimally aligns them ;) ) Btw for strings that are e.g. mirrored like yours theres fuzz.token_sort_ratio |
This is a wonderful clarification, thank you! Yea, I'm not particularly satisfied with that behaviour. I may implement a new version using Smith-Waterman and have a legacy version of |
Yes I think Smith-Waterman would be much better suited for this purpose. |
Jaro-Winkler similarity and Jaro similarity got added in one of the recent releases. The other requested distance will be a added and are tracked in their own issues:
|
Have you considered other algorithms such as Smith-Waterman, Jaro-Winkler, or even the new/improved Levenshtein-Damerau?
The text was updated successfully, but these errors were encountered: