You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, it was fun to look at something other than "just" k-mers ;) nice idea!
I was looking at the strobemers papers/code and in the original strobemers repo I see that the window is constrained near the end of the sequence. This of course makes sense as we can't get the window beyond the end of the sequence. However, in my case this cause issues when a query is present in a reference flanked by a context that produces a lower hash and/or when w_min goes out of bounds leaving the end of the query unsampled.
Would then a simple solution be index from both ends (reversing the hash list), and storing only the cannonical hashes? (in strobealig the reverse iteration is necessary anyways for the reverse complement). I did notice that in the multi context seed preprint you mentioned:
The B and p make strobealign pick strobes earlier in the window more often, which is deemed suitable for the shortest reads
So perhaps this is already addresses to some extent although I could not entirely get from the code how it actually considers the distance between strobe1 and strobe2 in its selection.
Anyway, here an example
Example
In my cases the query sequences are short so perhaps this is barely a problem for longer sequences. A minimal example:
For this test I just used the python example code of the repo and adjusted it to also get the reverse strobes and the cannonical hash.
In the images:
rectangle = forward strobemers
triangle = reverse strobemer
color = cannonical hash (forward and reverse can of course produce the same pairs which can later be filtered out), or gray when not shared between seq1 and seq2
This of course gets much more pronounced when we create a bigger window (basically targetting the A's), below most forward strobemers pair with CTAA.
In this example, the forward strobes can not detect the similarity anymore cause they immediately pick the lower k-mers, whereas the reverse can as at one point they skip over the AAAs since the "start" is fixed in the strobemers.
Is this something that is still relevant or is this already addressed in strobealign?
The text was updated successfully, but these errors were encountered:
Thanks for your thorough description and examples! Canonical strobemers are something we are really interested in, both within and outside strobealign. It is on our TODO list to try it within strobealign.
If I interpret your examples correctly, you are saying that by considering canonical strobemers, we may have more 'canonical' strobemer matches in a read than the matches in each individual direction for non-canonical randstrobes. This is important and may help read mapping.
We should of course, keep in mind that the the total number of canonical strobemers is fewer than creating strobemers in both directions, so there may also be a negative effect. I guess we have to evaluate this.
Would then a simple solution be to index from both ends (reversing the hash list) and store only the canonical hashes? (in strobealign, the reverse iteration is necessary anyway for the reverse complement).
Yes, we are hoping to try this soon.
The B and p make strobealign pick strobes earlier in the window more often, which is deemed suitable for the shortest reads
I consider this a bit of a hack that worked in practice that we may be able to avoid down the road with multi-context seeds.
(Also, sorry for the slow reply - I realized my old university email was discontinued on Nov 18th, so I haven't gotten any notifications from GitHub since then. Trying to catch up now.)
Hey @ksahlin!
Background
First of all, it was fun to look at something other than "just" k-mers ;) nice idea!
I was looking at the strobemers papers/code and in the original strobemers repo I see that the window is constrained near the end of the sequence. This of course makes sense as we can't get the window beyond the end of the sequence. However, in my case this cause issues when a query is present in a reference flanked by a context that produces a lower hash and/or when w_min goes out of bounds leaving the end of the query unsampled.
Would then a simple solution be index from both ends (reversing the hash list), and storing only the cannonical hashes? (in strobealig the reverse iteration is necessary anyways for the reverse complement). I did notice that in the multi context seed preprint you mentioned:
So perhaps this is already addresses to some extent although I could not entirely get from the code how it actually considers the distance between strobe1 and strobe2 in its selection.
Anyway, here an example
Example
In my cases the query sequences are short so perhaps this is barely a problem for longer sequences. A minimal example:
For this test I just used the python example code of the repo and adjusted it to also get the reverse strobes and the cannonical hash.
In the images:
This of course gets much more pronounced when we create a bigger window (basically targetting the A's), below most forward strobemers pair with
CTAA
.In this example, the forward strobes can not detect the similarity anymore cause they immediately pick the lower k-mers, whereas the reverse can as at one point they skip over the AAAs since the "start" is fixed in the strobemers.
Is this something that is still relevant or is this already addressed in strobealign?
The text was updated successfully, but these errors were encountered: