Faster suffix array computation #59

ajalab · 2025-02-14T11:19:38Z

One of the most time-consuming parts of FM-Index construction is building a suffix array.

This crate currently uses the SA-IS algorithm implemented by @ajalab. There might be room for performance improvement:

It's said that SA-IS can construct suffix array in linear time. However, since the algorithm scans the given text multiple times, we can't ignore the hidden constant factor in O(n) time. Improvements on the multi-text support has been proposed in Implement FM-Index with multi-text support #52 (comment).
We may consider implementing other suffix array construction algorithms like DivSufSort.
We may even consider adopting third-party libraries/crates for suffix array construction developed by experts.

faassen · 2025-02-14T16:56:55Z

The suffix crate is interesting in this respect:

https://crates.io/crates/suffix

but it's UTF-8 only.

I found this description interesting:

Moreover, most (all?) don't support Unicode and instead operate on bytes, which means they aren't paying the overhead of decoding UTF-8.

How does that relate to fm-index, sa-is and storing UTF-8?

ajalab · 2025-02-15T03:43:15Z

which means they aren't paying the overhead of decoding UTF-8.

I haven't figured out the decoding overhead they mean well. Perhaps they mean that the located position must be based on UTF-8 characters; for instance, the location of a character "え" (0xE38188) in "あいうえお" must be 3 rather than 9.

ajalab · 2025-02-15T03:49:14Z

Perhaps they mean that the located position must be based on UTF-8 characters; for instance, the location of a character "え" (0xE38188) in "あいうえお" must be 3 rather than 9.

It looks this is not true. The suffix crate even doesn't provide pattern positions based on UTF-8 characters.

use suffix::SuffixTable;

fn main() {
  let st = SuffixTable::new("あいうえお");

  assert_eq!(st.positions("え"), &[3]);
}

thread 'main' panicked at src/main.rs:6:3:
assertion `left == right` failed
  left: [9]
 right: [3]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

ajalab · 2025-02-15T03:57:19Z

I also found libsais and libcubwt are considered modern fastest libraries in some literatures and maintained. They are both C libraries, and I found a Rust binding libsais-rs for the former. But this binding seems to be a work in progress. They also provide Burrows-Wheeler transformation required by FM-Index.

faassen · 2025-02-15T12:12:43Z

I would like to avoid depending on non-rust code though - it will complicate installation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster suffix array computation #59

Faster suffix array computation #59

ajalab commented Feb 14, 2025

faassen commented Feb 14, 2025

ajalab commented Feb 15, 2025

ajalab commented Feb 15, 2025

ajalab commented Feb 15, 2025 •

edited

Loading

faassen commented Feb 15, 2025

Faster suffix array computation #59

Faster suffix array computation #59

Comments

ajalab commented Feb 14, 2025

faassen commented Feb 14, 2025

ajalab commented Feb 15, 2025

ajalab commented Feb 15, 2025

ajalab commented Feb 15, 2025 • edited Loading

faassen commented Feb 15, 2025

ajalab commented Feb 15, 2025 •

edited

Loading