Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with partial_ratio_alignment #323

Closed
laphang opened this issue Apr 27, 2023 · 3 comments
Closed

Issue with partial_ratio_alignment #323

laphang opened this issue Apr 27, 2023 · 3 comments
Labels
question Further information is requested

Comments

@laphang
Copy link

laphang commented Apr 27, 2023

In my example below, partial_ratio_alignment seems to cut short the matching in the 2nd string, I was expecting it to include the additional "et."

Code:
query_string="Business's say they got nothing out of last night's budget."
contains_string="Business's say they've got nothing out of last night's budget. It's really hard out there!"

match = fuzz.partial_ratio_alignment(query_string, contains_string, score_cutoff = 90)

print(match)
print(query_string[match.src_start:match.src_end], contains_string[match.dest_start:match.dest_end])

Output:
ScoreAlignment(score=94.91525423728814, src_start=0, src_end=59, dest_start=0, dest_end=59)
("Business's say they got nothing out of last night's budget.",  "Business's say they've got nothing out of last night's budg")
@maxbachmann maxbachmann added the question Further information is requested label Apr 27, 2023
@maxbachmann
Copy link
Member

partial_ratio uses a sliding window approach to find the optimal alignment of the shorter string with the longer string. So it will not find an alignment, where the subsequence in the longer string is longer than the shorter string. The subequence can be either as long as the shorter string or if it starts/ends at the start/end of the longer string can be shorter.

The metric you are searching for is Smith Waterman, which is not implemented in rapidfuzz yet: #175

@laphang
Copy link
Author

laphang commented Apr 27, 2023

Thanks for the fast response, and also for pointing out the parasail package in the issue you linked, that seems interesting.

@laphang
Copy link
Author

laphang commented Apr 28, 2023

FWIW, I had pretty good results with parasail. Here's an example:

query_string="Business's say they got nothing out of last night's budget."
contains_string="Business's say they've got nothing out of last night's budget. It's really hard out there!"

result = parasail.ssw(query_string, contains_string, 10, 1, parasail.blosum50) 

print(query_string[result.read_begin1:result.read_end1+1])
print(contains_string[result.ref_begin1:result.ref_end1+1])

output:
Business's say they got nothing out of last night's budget.
Business's say they've got nothing out of last night's budget.

I later used rapidfuzz again for distance / score calculations.

@laphang laphang closed this as completed Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants