-
Notifications
You must be signed in to change notification settings - Fork 93
Suffix Array thoughts
A Suffix Array is a sorted array of all suffixes of a string. This data structure is useful for certain string algorithms including lossless data compression and string sorting
Example:
Given a string "banana$", the suffixes are as follows
s[0]="banana$"
s[1]="anana$"
s[2]="nana$"
s[3]="ana$"
s[4]="na$"
s[5]="a$"
s[6]="$"
The suffix array of string "banana$" is the array of indices of sorted suffixes
s[6]="$"
s[5]="a$"
s[3]="ana$"
s[1]="anana$"
s[0]="banana$"
s[4]="na$"
s[2]="nana$"
so sa=[6,5,3,1,0,4,2]
Arkouda has pull requests (See PR #865 and #627) which implement suffix arrays with linear time construction
In the current implementation, suffix array is a standalone object with the client having disjoint SArrays
and Strings
classes. I've started thinking that suffix array construction should be an operation performed on a Strings
object.
My reasoning is a suffix array is an intermediary representation which doesn't make sense without an associated string. There isn't a use case for having a suffix array without a string, so it doesn't need to be it's own class. Instead of converting Strings
objects into SArrays
, we could have the suffix array construction methods reside in the strings class.
There is the added benefit of reducing duplicated code since much of the suffix array class code (suffix_array.py
and SegmentedSuffixArray.chpl
) is very similar to strings class code (strings.py
and SegmentedArray.chpl
) with a few added functions
- Have all algorithms which require suffix arrays to create them on the fly as necessary
- This shouldn't be an algorithmic bottle neck since the construction is linear
We build suffix arrays as needed while we are implementing the algorithms which require them. We can then do a bake-off to see how the performance compares with our current non-suffix array methods. If these perform better, we can look into whether having more permanent storage for suffix arrays as an additional component in SegStrings
on the server is worthwhile
- Automatically create the suffix array as an additional component of
SegStrings
on the server- Increased memory required to avoid repeated construction of suffix arrays if multiple algorithms requiring them are called on the same
SegString
- Can be done using lazy initialization, only constructing the suffix arrays the first time a function is called.
- This extra memory is already happening when building
SArrays
objects fromStrings
- Increased memory required to avoid repeated construction of suffix arrays if multiple algorithms requiring them are called on the same