-
Notifications
You must be signed in to change notification settings - Fork 82
Module: IO Alignments
-
TODO: Right now the REF_ID field for the SAM format is a string but it should be a number, like in BAM files. That way SAM and BAM formats are consistent about the REF_ID
-
Alignment Classupdate : There is no alignment class anymore since we just have a pair of aligned sequences and a class wrapper would introduce unneccessary overhead.- replace
stream_alignment
function from PR #233 when the concept of a "seqan pretty debug stream" is implemented (#121) - close PR #233 if code is really not needed anymore
- Add
get_cigar_vector
function (introduced in PR #414)
- replace
-
SAM tag dictionary (#415)
- PR #415 introduced a consexpr string literal in file sam_tag_dictionary.hpp that introduces a warning on clang. If we support clang and require C++20 on GCC we can replace this with a constexpr string. This bullet point is also in meta issue #404
-
cigar_op and cigar alphabet class (#414)
- Sam Format
- Bam Format
- Blast Format (For time and simplicity reasons only the tabular format is considered for now and the Blast format will only be available as output format)
- Read binary data
- Write binary data
- Endianness
- Formatted Numbers
- BGZF input stream (#317)
- BGZF output stream (#317)
The key field of the alignment record will be seqan3::field::alignment
that features an seqan3::alignment
object with two sequences. The alignment IO thereby only considers pairwise alignment as multiple alignments are a different use case. The two sequences in the alignment object will preferably be of the gap_decorator
type that only stores the gap information and a reference to the seqan3::field::SEQ
/seqan3::field::REF_SEQ
objects.
For SAM/BAM file the alignment is usually incomplete since the reference sequence is unknown. This will be handled by introducing a dummy/proxy sequence that consists of N
's. If the reference characters are queried, an exception is thrown to warn the user that no reference information is available. The alignment file (at least for BAM/SAM format) should be also constructable with a second file, the reference sequence file, in which case the reference is loaded and the alignments properly instantiated. There is also the idea of "lazy loading" where the provided reference sequence file is not opened unless an alignment is queried for reference related information, at which point the reference sequence file is accessed (via the index?) and the corresponding information extracted.
SAM required format column names (as in the SAM specifications) to filelds:
# | SAM name | FIELD name |
---|---|---|
1 | QNAME | ID |
2 | FLAG | FLAG |
3 | RNAME | REF_ID |
4 | POS | REF_OFFSET |
5 | MAPQ | MAPQ |
6 | CIGAR | implicilty stored in ALIGNMENT |
7 | RNEXT | MATE (tuple pos 0) |
8 | PNEXT | MATE (tuple pos 1) |
9 | TLEN | MATE (tuple pos 2) |
10 | SEQ | SEQ |
11 | QUAL | QUAL |
The REF_SEQ field will be required to store the alignment information correctly. The (sequence/query) OFFSET will be required to store the soft clipping information at the read start (end clipping will be visible by the length of the alignment that holds the infix of SEQ).
Note: Due to representing the alignment as an infix of only the aligned parts of the two sequences, clipping and local alignments of BLAST can be easily visulized via the alignment and the (REF_)OFFSET values with view_to_position/position_to_view/view_to_clipped_position etc. The only downside is that Hard clipping information in the SAM file will be lost and cannot be recovered. As Hard clipping is usually a quality cut off and the respective sequence part is not present in the SAM file, we expect that this information is in 99% of the cases irrelevant. If requested, the hard clipping information could be stored in a seperate field later on.
Column names as in the NCBI Handbook
# | BLAST name | FIELD name |
---|---|---|
1 | Query id | ID |
2 | Subject id | REF_ID |
3 | % identity | implicilty stored in ALIGNMENT |
4 | alignment length | implicilty stored in ALIGNMENT |
5 | mismatches | implicilty stored in ALIGNMENT |
6 | gap openings | implicilty stored in ALIGNMENT |
7 | q. start | OFFSET |
8 | q. end | implicilty stored in OFFSET and ALIGNMENT |
9 | s. start | REF_OFFSET |
10 | s. end | implicilty stored in REF_OFFSET and ALIGNMENT |
11 | e-value | EVALUE |
11 | hit | BIT_SCORE |
in bold are the fields that occur in BLAST and SAM BLAST output format will always print the 12 default columns, so no extra fields will be present for the user to specify.
ID, SEQ, REF_ID, REF_SEQ, ALIGNMENT, OFFSET, REF_OFFSET