-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeating bigrams handled incorrectly #8
Comments
13xforever
added a commit
to 13xforever/FuzzyStrings
that referenced
this issue
Mar 2, 2021
I've been using it for almost a year now, and in my benchmarks it performs much better than original implementation (at least for shorter strings, I'm limited to <2048 characters). And as pointed out in tylerjensen#8, current implementation may produce incorrect score for repeating bigrams. The downside is that this is O(n*m) implementation and will perform much worse on long strings. Fixes tylerjensen#8
13xforever
added a commit
to 13xforever/FuzzyStrings
that referenced
this issue
Mar 2, 2021
Removes input string padding as it introduces bias for the first and last character matches. As a consequence, comparing with one-letter string will return score of 0 as it has zero bigrams now. Fixes tylerjensen#8
13xforever
added a commit
to 13xforever/FuzzyStrings
that referenced
this issue
Mar 2, 2021
Removes input string padding as it introduces bias for the first and last character matches. As a consequence, comparing with one-letter string will return score of 0 as it has zero bigrams now. Fixes tylerjensen#8
13xforever
added a commit
to 13xforever/FuzzyStrings
that referenced
this issue
Mar 2, 2021
Removes input string padding as it introduces bias for the first and last character matches. As a consequence, comparing with one-letter string will return score of 0 as it has zero bigrams now. Fixes tylerjensen#8
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This can be demonstrated by the below code snippet:
The test fails. The reason is Intersect returns the distinct overlapping bigrams. But you then divide by the number of bigrams in the two strings including duplicates, which is incorrect.
https://github.com/tylerjensen/FuzzyStrings/blob/master/src/DuoVia.FuzzyStrings/DuoVia.FuzzyStrings/DiceCoefficientExtensions.cs#L36
The text was updated successfully, but these errors were encountered: