Issue with partial_ratio_alignment #323

laphang · 2023-04-27T09:38:57Z

In my example below, partial_ratio_alignment seems to cut short the matching in the 2nd string, I was expecting it to include the additional "et."

Code:
query_string="Business's say they got nothing out of last night's budget."
contains_string="Business's say they've got nothing out of last night's budget. It's really hard out there!"

match = fuzz.partial_ratio_alignment(query_string, contains_string, score_cutoff = 90)

print(match)
print(query_string[match.src_start:match.src_end], contains_string[match.dest_start:match.dest_end])

Output:
ScoreAlignment(score=94.91525423728814, src_start=0, src_end=59, dest_start=0, dest_end=59)
("Business's say they got nothing out of last night's budget.",  "Business's say they've got nothing out of last night's budg")

The text was updated successfully, but these errors were encountered:

maxbachmann · 2023-04-27T12:16:14Z

partial_ratio uses a sliding window approach to find the optimal alignment of the shorter string with the longer string. So it will not find an alignment, where the subsequence in the longer string is longer than the shorter string. The subequence can be either as long as the shorter string or if it starts/ends at the start/end of the longer string can be shorter.

The metric you are searching for is Smith Waterman, which is not implemented in rapidfuzz yet: #175

laphang · 2023-04-27T23:21:53Z

Thanks for the fast response, and also for pointing out the parasail package in the issue you linked, that seems interesting.

laphang · 2023-04-28T07:07:35Z

FWIW, I had pretty good results with parasail. Here's an example:

query_string="Business's say they got nothing out of last night's budget."
contains_string="Business's say they've got nothing out of last night's budget. It's really hard out there!"

result = parasail.ssw(query_string, contains_string, 10, 1, parasail.blosum50) 

print(query_string[result.read_begin1:result.read_end1+1])
print(contains_string[result.ref_begin1:result.ref_end1+1])

output:
Business's say they got nothing out of last night's budget.
Business's say they've got nothing out of last night's budget.

I later used rapidfuzz again for distance / score calculations.

maxbachmann added the question Further information is requested label Apr 27, 2023

laphang closed this as completed Apr 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with partial_ratio_alignment #323

Issue with partial_ratio_alignment #323

laphang commented Apr 27, 2023

maxbachmann commented Apr 27, 2023

laphang commented Apr 27, 2023

laphang commented Apr 28, 2023 •

edited

Loading

Issue with partial_ratio_alignment #323

Issue with partial_ratio_alignment #323

Comments

laphang commented Apr 27, 2023

maxbachmann commented Apr 27, 2023

laphang commented Apr 27, 2023

laphang commented Apr 28, 2023 • edited Loading

laphang commented Apr 28, 2023 •

edited

Loading