Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster suffix array computation #59

Open
ajalab opened this issue Feb 14, 2025 · 5 comments
Open

Faster suffix array computation #59

ajalab opened this issue Feb 14, 2025 · 5 comments

Comments

@ajalab
Copy link
Owner

ajalab commented Feb 14, 2025

One of the most time-consuming parts of FM-Index construction is building a suffix array.

This crate currently uses the SA-IS algorithm implemented by @ajalab. There might be room for performance improvement:

  • It's said that SA-IS can construct suffix array in linear time. However, since the algorithm scans the given text multiple times, we can't ignore the hidden constant factor in O(n) time. Improvements on the multi-text support has been proposed in Implement FM-Index with multi-text support #52 (comment).
  • We may consider implementing other suffix array construction algorithms like DivSufSort.
  • We may even consider adopting third-party libraries/crates for suffix array construction developed by experts.
@faassen
Copy link
Collaborator

faassen commented Feb 14, 2025

The suffix crate is interesting in this respect:

https://crates.io/crates/suffix

but it's UTF-8 only.

I found this description interesting:

Moreover, most (all?) don't support Unicode and instead operate on bytes, which means they aren't paying the overhead of decoding UTF-8.

How does that relate to fm-index, sa-is and storing UTF-8?

@ajalab
Copy link
Owner Author

ajalab commented Feb 15, 2025

which means they aren't paying the overhead of decoding UTF-8.

I haven't figured out the decoding overhead they mean well. Perhaps they mean that the located position must be based on UTF-8 characters; for instance, the location of a character "え" (0xE38188) in "あいうえお" must be 3 rather than 9.

@ajalab
Copy link
Owner Author

ajalab commented Feb 15, 2025

Perhaps they mean that the located position must be based on UTF-8 characters; for instance, the location of a character "え" (0xE38188) in "あいうえお" must be 3 rather than 9.

It looks this is not true. The suffix crate even doesn't provide pattern positions based on UTF-8 characters.

use suffix::SuffixTable;

fn main() {
  let st = SuffixTable::new("あいうえお");

  assert_eq!(st.positions("え"), &[3]);
}
thread 'main' panicked at src/main.rs:6:3:
assertion `left == right` failed
  left: [9]
 right: [3]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@ajalab
Copy link
Owner Author

ajalab commented Feb 15, 2025

I also found libsais and libcubwt are considered modern fastest libraries in some literatures and maintained. They are both C libraries, and I found a Rust binding libsais-rs for the former. But this binding seems to be a work in progress. They also provide Burrows-Wheeler transformation required by FM-Index.

@faassen
Copy link
Collaborator

faassen commented Feb 15, 2025

I would like to avoid depending on non-rust code though - it will complicate installation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants