Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algorithm used for partial_ratio_impl #113

Closed
niansong1996 opened this issue Feb 25, 2024 · 4 comments
Closed

Algorithm used for partial_ratio_impl #113

niansong1996 opened this issue Feb 25, 2024 · 4 comments

Comments

@niansong1996
Copy link

Hi,

I'm wondering what algorithm is used for partial_ratio_impl (link here)? Is there any paper references that I can follow?

I am trying to implement a CUDA-accelerated version of edit distances computation, and would definitely benefit a lot from the implementations here.

Thanks!

@maxbachmann
Copy link
Member

Partial Ratio

The basic algorithm searches for an optimal alignment of the shorter sequence in the longer sequence. The compared subsequence has to be either as long as the shorter string, or placed at the start/end of the longer sequence. So e.g. for two sequences "ab" and "abcd" it would compare the following alignments:

ab <-> a
ab <-> ab
ab <-> bc
ab <-> cd
ab <-> d

So a basic sliding window algorithm. The similarity between the two sequences is calculated in terms of the normalized Indel similarity. This basic implementation is based on https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/. However their initial implementation did not guarantee finding of the optimal alignment.

This is pretty slow, since the Indel similarity is calculated in O(len(shorter)²) and it has to perform O(len(longer)) comparisons. So it would be O(len(shorter)² * M). My implementation improves this in multiple regards:

  1. the Indel similarity is calculated using the bitparallel implementation described in the paper "Bit-Parallel LCS-length Computation Revisited" from Heikki Hyyrö. This can calculate the similarity for 64 characters in parallel. So as long as the shorter string is <= 64 characters it's O(N) time.
  2. I cache some common parts of the Indel similarity calculation. I think for shorter strings this saved about 50% runtime. For strings with lengths > 64 characters, this helps less.
  3. for the sliding window at most one character is added+removed when moving the sliding window. This means that the score can only improve by a certain amount for each move. Based on previous scores + the score_cutoff passed in by the user, I am able to skip calculations for positions that can't have a score that is better than the best score found so far + the score_cutoff. There is not really any paper / documentation so far except for the code you linked.

This is pretty fast as long as the shorter string is not > 64 characters, which is usually the case for a metric like this.

Cuda

I think this branching is probably somewhat of a pain on the GPU though. The basic bitparallel algorithms should lend itself fairly well to execution on a GPU. Especially the SIMD versions to compare multiple strings in parallel. I never tested whether it's worth the overhead of transferring them to the GPU though. E.g. when calculating the Indel distance for multiple 8 character strings, my SIMD implementation already runs in under 1 cpu cycle per string comparision.
Would be interested in your findings though and if you manage to get some of the algorithms faster on GPUs :)

@niansong1996
Copy link
Author

Thanks a bunch for the quick and detailed reply! To provide a bit more background, I'm trying to solve a problem where I need to match a shorter string (e.g., length 100-500) in a very, very long string (e.g., length 1,000,000,000+), and it's a setting where accuracy (i.e., optimality) is very important.

Partial Ratio

My initial implementation is also based on a sliding window, and indeed it's of $O(len(shorter)^2\cdot len(longer))$, so it's quite slow. Then I found an algorithm called Smith-Waterman which seems to be reducing the complexity to $O(len(shorter)\cdot O(len(longer))$.

Here are the slides which made it easier for me to understand it (especially this slide):
image

Have you come across this algorithm or considered using this instead of the sliding window for partial ratio? Maybe I'm missing something and it doesn't really solve the same problem.

CUDA

For CUDA, it's actually quite interesting because the computation dependency is along the diagonal when we do the dynamic programming, so we can actually vectorize and parallelize it very efficiently. I have an implementation that works quite well, and here is a doodle to illustrate it:
image

As you can see, the computation of each diagonal group (in red circles) only depend on the previous two groups, so the memory usage is also smaller, as $O(min(len(short), len(long)))$.

Happy to discuss more if you are interested :)

@maxbachmann
Copy link
Member

Have you come across this algorithm

I still plan to add an implementation of smith-waterman rapidfuzz/RapidFuzz#175. However there are already a lot of very optimized bitparallel implementations of this in the space of bioinformatics.
On the GPU there is e.g. https://cudasw.sourceforge.net/homepage.htm#latest

or considered using this instead of the sliding window for partial ratio? Maybe I'm missing something and it doesn't really solve the same problem.

It doesn't lead to quite the same results. Smith Waterman searches for a local alignment, but this alignment can have different lengths. The sliding window algorithm ensures that strings have the compared strings have the same length.
This is not necessarily better though just different. The reason I use the sliding window is because I want to keep the same behaviour that was present in the original implementation in fuzzywuzzy.

@niansong1996
Copy link
Author

However there are already a lot of very optimized bitparallel implementations of this in the space of bioinformatics.
On the GPU there is e.g. https://cudasw.sourceforge.net/homepage.htm#latest

Yes, I found that too. CUDASW is quite old though, and it is not straightforward to apply to strings. But I do find the paper useful as references for my implementation.

Smith Waterman searches for a local alignment, but this alignment can have different lengths.

I see, I think this works well for my use case, but keep the same behavior as the original fuzzywuzzy makes sense to me.

Thanks again for the helpful discussion and pointers to resources! Closing this issue on my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants