-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Algorithm used for partial_ratio_impl
#113
Comments
Partial RatioThe basic algorithm searches for an optimal alignment of the shorter sequence in the longer sequence. The compared subsequence has to be either as long as the shorter string, or placed at the start/end of the longer sequence. So e.g. for two sequences "ab" and "abcd" it would compare the following alignments:
So a basic sliding window algorithm. The similarity between the two sequences is calculated in terms of the normalized Indel similarity. This basic implementation is based on https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/. However their initial implementation did not guarantee finding of the optimal alignment. This is pretty slow, since the Indel similarity is calculated in O(len(shorter)²) and it has to perform O(len(longer)) comparisons. So it would be O(len(shorter)² * M). My implementation improves this in multiple regards:
This is pretty fast as long as the shorter string is not > 64 characters, which is usually the case for a metric like this. CudaI think this branching is probably somewhat of a pain on the GPU though. The basic bitparallel algorithms should lend itself fairly well to execution on a GPU. Especially the SIMD versions to compare multiple strings in parallel. I never tested whether it's worth the overhead of transferring them to the GPU though. E.g. when calculating the Indel distance for multiple 8 character strings, my SIMD implementation already runs in under 1 cpu cycle per string comparision. |
Thanks a bunch for the quick and detailed reply! To provide a bit more background, I'm trying to solve a problem where I need to match a shorter string (e.g., length 100-500) in a very, very long string (e.g., length 1,000,000,000+), and it's a setting where accuracy (i.e., optimality) is very important. Partial RatioMy initial implementation is also based on a sliding window, and indeed it's of Here are the slides which made it easier for me to understand it (especially this slide): Have you come across this algorithm or considered using this instead of the sliding window for partial ratio? Maybe I'm missing something and it doesn't really solve the same problem. CUDAFor CUDA, it's actually quite interesting because the computation dependency is along the diagonal when we do the dynamic programming, so we can actually vectorize and parallelize it very efficiently. I have an implementation that works quite well, and here is a doodle to illustrate it: As you can see, the computation of each diagonal group (in red circles) only depend on the previous two groups, so the memory usage is also smaller, as Happy to discuss more if you are interested :) |
I still plan to add an implementation of smith-waterman rapidfuzz/RapidFuzz#175. However there are already a lot of very optimized bitparallel implementations of this in the space of bioinformatics.
It doesn't lead to quite the same results. Smith Waterman searches for a local alignment, but this alignment can have different lengths. The sliding window algorithm ensures that strings have the compared strings have the same length. |
Yes, I found that too. CUDASW is quite old though, and it is not straightforward to apply to strings. But I do find the paper useful as references for my implementation.
I see, I think this works well for my use case, but keep the same behavior as the original Thanks again for the helpful discussion and pointers to resources! Closing this issue on my end. |
Hi,
I'm wondering what algorithm is used for
partial_ratio_impl
(link here)? Is there any paper references that I can follow?I am trying to implement a CUDA-accelerated version of edit distances computation, and would definitely benefit a lot from the implementations here.
Thanks!
The text was updated successfully, but these errors were encountered: