Add a function to read mate alignments #329
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a function
cljam.io.sam/read-mate-alignments
for BAM file which reads R1/R2 counterpart alignments of given alignments.Added functions
This function can be used for collecting alignments around some regions keeping pairs like
samtools view --fetch-pairs
.In the example above, it is necessary to recheck which alignment pairs with which. To make things easier, I added another function
cljam.io.sam/make-pairs
that returns a sequence with paired alignments grouped together.To achieve these functionalities, I implemented a function
cljam.io.bam-index/get-spans-for-regions
that queries chunks corresponding to multiple regions at once against the BAI index.Implementation details
cljam.io.sam/read-mate-alignments
for SAM and CRAM is not supported yet.When searching for mates, a large number of alignment blocks that do not meet the criteria must be discarded. Therefore, the current implementation decodes only the minimal set of fields necessary for condition evaluation and immediately rejects blocks that do not satisfy the criteria.
As a result, a significant part of the execution time is spent decompressing BGZF blocks, so reducing the number of chunks accessed would make a big difference.
However, the current implementation is limited to what can be achieved by combining existing functionalities, leaving room for further optimization in the future.
Tests
Since we don't have a good BAM file with paired reads of appropriate size, I added auto-generated test cases for
cljam.io.sam/read-mate-alignments
. It writes a temporary BAM file containing paired alignments on the fly, and then perform reading and checking.