-
Notifications
You must be signed in to change notification settings - Fork 82
Search Module
This gives an overview of the search module and is a good start to get familiar with the basics of it.
The search module covers to areas: string indices (such as FM indices, suffix arrays or k-mer indices) and search algorithms (those that need a string index and those who don't). Currently, only FM indices and search algorithms for FM indices are implemented.
The string indices in SeqAn3 are based on the SDSL (Succinct Data Structure Library) and so far only cover FM indices (i.e. the Burrows-Wheeler-Transform and a sampled suffix array). In the SDSL FM indices are referred to as compressed suffix arrays (CSAs). SeqAn3 provides wrapper classes for indices to have a consistent interface with other SeqAn3 classes and for Cereal Support (Serialization). It also takes care of indexing string collections (formerly known as string sets), e.g. a vector of DNA sequences.
SeqAn3 index wrappers only construct and hold the data structures and allow serialization. The actual search is performed by iterators. These iterators (which actually do not behave like actual iterators. Thus, they will be renamed to cursors soon) can search a sequence in the indexed text. The text is searched character by character, either to the right or to the left (if the index is bidirectional).
Currently only search schemes are implemented. To keep the maintenance low and since other search algorithms based on indices can be formulated as search schemes (e.g. 01*0 seeds, pigeonhole, etc.), this is the only (index-based) search algorithm. A trivial backtracking approach is implemented for unit testing only. If for a given error number a precomputed optimal search scheme does not exist, a (most likely not-optimal) search scheme is constructed using the pigeonhole principle, 01*0 seeds or similar.
Search schemes can be stored in either search_scheme_type
or search_scheme_dyn_type
. The first is based on std::array
, the second on std::vector
. Since the size of search schemes (number of searches and number of blocks) are known for optimal search schemes at compile time, they are of type search_scheme_type
. Search schemes computed at compile time for larger errors (for which no optimal search scheme has been computed and implemented yet) are usually of type search_scheme_dyn_type
since the number of blocks and searches are not known at compile time (mainly because the number of errors is not a compile-time parameter).
Search schemes have a lower error bound. This is not part of the public interface, i.e. the user cannot specify a minimum number of errors. This feature is only used for internal implementations of strata searches.
Unfortunately, commits have been squashed. The algorithmic differences of the search schemes from SeqAn2 and SeqAn3 are not visible in the commits anymore. This might be helpful if unexpected behaviour occurs. Some if clauses were superfluous and the behaviour for insertions/deletion at the beginning and end of a sequence changed. All changes refer to search/algorithm/detail/search_scheme_algorithm.hpp
.
inline bool search_ss_deletion(...)
{
...
// Insert deletions into the current block as long as possible
- if (!(search.pi[block_id] == 1 && !go_right) && max_error_left_in_block > 0 &&
- error_left.total > 0 && error_left.deletion > 0 &&
+ if (!(search.pi[block_id] == 1 && !go_right) && // Do not allow deletions at the beginning of the leftmost block
+ !(search.pi[block_id] == search.blocks() && go_right) && // Do not allow deletions at the end of the rightmost block
+ max_error_left_in_block > 0 && error_left.total > 0 && error_left.deletion > 0 &&
...
}
inline bool search_ss_children(...)
{
...
- // TODO: incorporate error_left.deletion into formula and simplify a bit
- if (error_left.deletion == 0 && min_error_left_in_block > 0 && chars_left + delta < min_error_left_in_block + 1u)
+ // TODO: incorporate error_left.deletion into formula
+ if (error_left.deletion == 0 && chars_left + delta < min_error_left_in_block + 1u)
...
// Deletion
- if (error_left.deletion > 0)
+ // TODO: check whether the conditions for deletions at the beginning/end of the query are really necessary
+ if (error_left.deletion > 0 &&
+ !(go_right && (rb == 1 || rb == query.size() + 1)) && // No deletion at the beginning of the leftmost block.
+ !(!go_right && (lb == 0 || lb == query.size()))) // No deletion at the end of the rightmost block.
...
}