-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster suffix array computation #59
Comments
The suffix crate is interesting in this respect: https://crates.io/crates/suffix but it's UTF-8 only. I found this description interesting:
How does that relate to fm-index, sa-is and storing UTF-8? |
I haven't figured out the decoding overhead they mean well. Perhaps they mean that the located position must be based on UTF-8 characters; for instance, the location of a character "え" (0xE38188) in "あいうえお" must be 3 rather than 9. |
It looks this is not true. The suffix crate even doesn't provide pattern positions based on UTF-8 characters. use suffix::SuffixTable;
fn main() {
let st = SuffixTable::new("あいうえお");
assert_eq!(st.positions("え"), &[3]);
}
|
I also found libsais and libcubwt are considered modern fastest libraries in some literatures and maintained. They are both C libraries, and I found a Rust binding libsais-rs for the former. But this binding seems to be a work in progress. They also provide Burrows-Wheeler transformation required by FM-Index. |
I would like to avoid depending on non-rust code though - it will complicate installation. |
One of the most time-consuming parts of FM-Index construction is building a suffix array.
This crate currently uses the SA-IS algorithm implemented by @ajalab. There might be room for performance improvement:
The text was updated successfully, but these errors were encountered: