Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Smith Waterman algorithm #175

Open
maxbachmann opened this issue Jan 2, 2022 · 4 comments
Open

Add support for Smith Waterman algorithm #175

maxbachmann opened this issue Jan 2, 2022 · 4 comments
Labels
enhancement New feature or request

Comments

@maxbachmann
Copy link
Member

The Smith Waterman algorithm is a commonly used metric to compare strings. It would be useful to add it to RapidFuzz.

@maxbachmann maxbachmann added the enhancement New feature or request label Jan 2, 2022
@maxbachmann maxbachmann added this to the v2.0 milestone Jan 2, 2022
@maxbachmann maxbachmann modified the milestones: v2.0, v2.1 Jan 24, 2022
@maxbachmann maxbachmann modified the milestones: v2.1, v2.2.0 Jun 29, 2022
@maxbachmann maxbachmann removed this from the v2.5.0 milestone Aug 13, 2022
@hongduosun
Copy link

Hi, will this algorithm be added recently?

@maxbachmann
Copy link
Member Author

I am not sure yet. I think I could add a simple implementation in the close future. When matching long sequences you would probably want to use a more optimized implementation like: https://github.com/jeffdaily/parasail (has python bindings)
I do not think I will have the time to write an implementation that is even close to this level of optimization.

@hongduosun
Copy link

Got it, thanks for the great work!

@maxbachmann
Copy link
Member Author

After looking at the paraseil python bindings I found them way to hard to use. E.g.

Be careful using the attributes of the Result object - especially on Result instances constructed on the fly. For example, calling parasail.sw_trace("asdf", "asdf", 11, 1, parasail.blosum62).cigar.seq returns a numpy.ndarray that wraps a pointer to memory that is invalid because the Cigar is deallocated before the seq statement. You can avoid this problem by assigning Result instances to variables as in the example above.

Is not really acceptable in my opinion. So I decided to add at least a simple implementation for people who just want to have a properly working implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants