Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected result of partial_ratio #173

Closed
hongduosun opened this issue Jan 1, 2022 · 3 comments
Closed

Unexpected result of partial_ratio #173

hongduosun opened this issue Jan 1, 2022 · 3 comments
Labels
question Further information is requested

Comments

@hongduosun
Copy link

Hi, when calculating the partial_raito between abcd and abd, it gives:

> partial_ratio('abcd','abd')
80.0

According to my understranding, the optimal alignment would be:

> ratio('abcd','abd')
85.71428571428571

and 80.0 is probably from:

> ratio('ab','abd')
80.0

Did I misunderstand this? I would like to check the docs but it fails to load the contents under these functions.
And is it possible to calculate the partial_ratio with equal weights for indels and substitutions?

Thanks!

@maxbachmann
Copy link
Member

I would like to check the docs but it fails to load the contents under these functions.

Thanks for mentioning. It appears the CI job currently generates the docs in a broken way for some reason. I manually fixed the docs and opened #174 to track the issue

@maxbachmann maxbachmann added the question Further information is requested label Jan 2, 2022
@maxbachmann
Copy link
Member

maxbachmann commented Jan 2, 2022

fuzz.partial_ratio is calculated using a sliding window of length min(len(s1), len(s2)) and then calculating the fuzz.ratio of each alignment. So for the two sequences "abdc" and "abd" it calculates the fuzz.ratio for the following alignments:

Alignment similarity
abd <-> a 50
abd <-> ab 80
abd <-> abc 66.67
abd <-> bcd 66.67
abd <-> cd 40
abd <-> bcd 50

For this reason it is actually not guaranteed to be at least as high as the fuzz.ratio between the two full sequences. fuzz.partial_ratio always uses fuzz.ratio which uses the Indel Distance (just uses Insertions/Deletions, so similar to Levenshtein distance with a weight of 2 for Substitutions) and there is no option to select a different metric as of now.

From your question I think your searching for the Smith Waterman algorithm: https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm. This metric is not supported in RapidFuzz yet, but I plan to add it in v2.0.0 (#175).

@hongduosun
Copy link
Author

Thanks for the explanation, Smith Waterman is definitely what I want. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants