Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

easy-search SAM alignment extends off the end of the reference #952

Open
schorlton-bugseq opened this issue Feb 17, 2025 · 0 comments
Open

Comments

@schorlton-bugseq
Copy link

Thanks for your hard work on MMSeqs!

I ran the following using v17.b804f installed via bioconda:

mmseqs easy-search --search-type 2 --format-mode 1 --greedy-best-hits 1 query.fna hiv.fna result.sam tmp

Query:

>test
CAAAGACTGCTGACACAGACTGCTGACATGAAGATTGCTGACAGGGACTTTCCGCGGGACTTTCCAGGGGGGTGTGGTTTGGGCGGAGTTGGGGAGTGGCTAACCCTCAGATGCTGCATATAAGCAGCGGCTTCTCGCTTGTACTGGGTCTCTCTGATT

Reference: NC_001802.1

I got the following alignment:

@HD	VN:1.4	SO:queryname
@SQ	SN:NC_001802.1	LN:9181
test	16	NC_001802.1	9094	254	111M	*	0	0	AGAGAGACCCAGTACAAGCGAGAAGCCGCTGCTTATATGCAGCATCTGAGGGTTAGCCACTCCCCAACTCCGCCCAAACCACACCCCCCTGGAAAGTCCCGCGGAAAGTCC	*	AS:i:118	NM:i:13

However, I think there is a bug as the start coordinate is 9094 and alignment length 111 (from CIGAR) = end of alignment 9195, which is longer than the reference (9181bp). What I think is going on is the start coordinate is flipped with the end coordinate, as when I TBLASTX these sequences, the alignment spans ~8994 to ~9098:

Image

OR

Image

and when I output in SAM via blastn, I get the following:

@HD     VN:1.2  SO:coordinate   GO:reference
@SQ     SN:Query_1      LN:9181
@PG     ID:0    VN:2.16.0+      CL:blastn.REAL -query hiv.fna -db query.fna -outfmt 17 -task blastn    PN:blastn
BL_ORD_ID:0     0       Query_1 8956    255     18H37M2D104M    *       0       0       *    *AS:i:168        EV:f:1.36602e-40        NM:i:2  PI:f:85.11      BS:f:152.769

Not sure if related to #845. I tried reviewing the source but couldn't spot the issue and it looks like MMSeqs tries to do the correct thing (as per SAM spec) here,

int start = std::min(res.qStartPos, res.qEndPos);
int end = std::max(res.qStartPos, res.qEndPos);
, so I don't know - any ideas? Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant