Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using strobemers bidirectionally to minimize effect of low hash bias + read border cases #17

Open
rickbeeloo opened this issue Nov 25, 2024 · 1 comment

Comments

@rickbeeloo
Copy link

Hey @ksahlin!

Background

First of all, it was fun to look at something other than "just" k-mers ;) nice idea!

I was looking at the strobemers papers/code and in the original strobemers repo I see that the window is constrained near the end of the sequence. This of course makes sense as we can't get the window beyond the end of the sequence. However, in my case this cause issues when a query is present in a reference flanked by a context that produces a lower hash and/or when w_min goes out of bounds leaving the end of the query unsampled.

Would then a simple solution be index from both ends (reversing the hash list), and storing only the cannonical hashes? (in strobealig the reverse iteration is necessary anyways for the reverse complement). I did notice that in the multi context seed preprint you mentioned:

The B and p make strobealign pick strobes earlier in the window more often, which is deemed suitable for the shortest reads

So perhaps this is already addresses to some extent although I could not entirely get from the code how it actually considers the distance between strobe1 and strobe2 in its selection.

Anyway, here an example

Example

In my cases the query sequences are short so perhaps this is barely a problem for longer sequences. A minimal example:

seq1 = "ATGCATCGACT"
seq2 = "ATGCATCGACTAAAAAA"

k_size = 10
w_min = 2
w_max = 4

For this test I just used the python example code of the repo and adjusted it to also get the reverse strobes and the cannonical hash.

In the images:

  • rectangle = forward strobemers
  • triangle = reverse strobemer
  • color = cannonical hash (forward and reverse can of course produce the same pairs which can later be filtered out), or gray when not shared between seq1 and seq2

image

This of course gets much more pronounced when we create a bigger window (basically targetting the A's), below most forward strobemers pair with CTAA.

seq1 = "ATGCATCGACT"
seq2 = "ATGCATCGACTAAAAAA"

k_size = 10
w_min = 5
w_max = 10

Frame 11

In this example, the forward strobes can not detect the similarity anymore cause they immediately pick the lower k-mers, whereas the reverse can as at one point they skip over the AAAs since the "start" is fixed in the strobemers.

Is this something that is still relevant or is this already addressed in strobealign?

@ksahlin
Copy link
Owner

ksahlin commented Dec 10, 2024

Hi @rickbeeloo (@marcelm and @drtconway),

Thanks for your thorough description and examples! Canonical strobemers are something we are really interested in, both within and outside strobealign. It is on our TODO list to try it within strobealign.

If I interpret your examples correctly, you are saying that by considering canonical strobemers, we may have more 'canonical' strobemer matches in a read than the matches in each individual direction for non-canonical randstrobes. This is important and may help read mapping.

We should of course, keep in mind that the the total number of canonical strobemers is fewer than creating strobemers in both directions, so there may also be a negative effect. I guess we have to evaluate this.

Would then a simple solution be to index from both ends (reversing the hash list) and store only the canonical hashes? (in strobealign, the reverse iteration is necessary anyway for the reverse complement).

Yes, we are hoping to try this soon.

The B and p make strobealign pick strobes earlier in the window more often, which is deemed suitable for the shortest reads

I consider this a bit of a hack that worked in practice that we may be able to avoid down the road with multi-context seeds.

(Also, sorry for the slow reply - I realized my old university email was discontinued on Nov 18th, so I haven't gotten any notifications from GitHub since then. Trying to catch up now.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants