Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeating bigrams handled incorrectly #8

Open
stagha opened this issue Oct 15, 2020 · 0 comments · May be fixed by #9
Open

Repeating bigrams handled incorrectly #8

stagha opened this issue Oct 15, 2020 · 0 comments · May be fixed by #9

Comments

@stagha
Copy link

stagha commented Oct 15, 2020

This can be demonstrated by the below code snippet:

var testString = "aaaaaaaaa";

var diceCoefficient = testString.DiceCoefficient(testString);

Assert.Equal(1, diceCoefficient);

The test fails. The reason is Intersect returns the distinct overlapping bigrams. But you then divide by the number of bigrams in the two strings including duplicates, which is incorrect.

https://github.com/tylerjensen/FuzzyStrings/blob/master/src/DuoVia.FuzzyStrings/DuoVia.FuzzyStrings/DiceCoefficientExtensions.cs#L36

13xforever added a commit to 13xforever/FuzzyStrings that referenced this issue Mar 2, 2021
I've been using it for almost a year now, and in my benchmarks it performs much better than original implementation (at least for shorter strings, I'm limited to <2048 characters). And as pointed out in tylerjensen#8, current implementation may produce incorrect score for repeating bigrams.

The downside is that this is O(n*m) implementation and will perform much worse on long strings.

Fixes tylerjensen#8
13xforever added a commit to 13xforever/FuzzyStrings that referenced this issue Mar 2, 2021
Removes input string padding as it introduces bias for the first and last character matches.
As a consequence, comparing with one-letter string will return score of 0 as it has zero bigrams now.

Fixes tylerjensen#8
13xforever added a commit to 13xforever/FuzzyStrings that referenced this issue Mar 2, 2021
Removes input string padding as it introduces bias for the first and last character matches.
As a consequence, comparing with one-letter string will return score of 0 as it has zero bigrams now.

Fixes tylerjensen#8
@13xforever 13xforever linked a pull request Mar 2, 2021 that will close this issue
13xforever added a commit to 13xforever/FuzzyStrings that referenced this issue Mar 2, 2021
Removes input string padding as it introduces bias for the first and last character matches.
As a consequence, comparing with one-letter string will return score of 0 as it has zero bigrams now.

Fixes tylerjensen#8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant