Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AggregateSvPileup should account for inaccurate split-read breakpoint positions #13

Open
pamelarussell opened this issue May 24, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@pamelarussell
Copy link
Contributor

Currently AggregateSvPileup merges breakpoints that have left and right breakpoints within a distance threshold of each other, regardless of the type of read evidence of the breakpoints: split-read (breakpoint occurs inside sequenced read) or read-pair (breakpoint occurs in the unsequenced insert between mates).

However, these two types of evidence have different precision of the breakpoint position and should use different distance thresholds. While split-read evidence is likely to point to a very precise position, the position for a read-pair event can be off by as much as the inner distance (insert size minus read lengths). Something similar to the following procedure should be used instead:

  1. "Seed" clusters by clustering only breakpoints that have split-read evidence
  2. "Seed" additional clusters with breakpoints that have read-pair evidence
  3. Use read-pair events to aggregate clusters when the distance is within the inner distance (computed empirically by sampling)
@pamelarussell pamelarussell added the enhancement New feature or request label May 24, 2022
@tfenne
Copy link
Member

tfenne commented Sep 28, 2023

Agreed - I think a multi-pass strategy would work, though I think I would suggest something different:

  • Have parameters max-split-read-distance and read-pair-inner-distance (or compute the latter)
  • Aggregate events with split-read evidence within max-split-read-distance; this parameter should probably be set based on aligner parameters (e.g. a single sequencing error how far from the breakpoint would cause the read to get clipped at that point?)
  • Take all read-pair evidence and see if it can be said to support a single event defined by aggregating split reads, and if so assign it; in this case I think it should determine compatibility by whether the sum of the distances on both sides is < the max inner distance, rather than evaluating each side independently.
  • Take remaining read pairs, and if they could support multiple events, try and tie break based on position or split the count?
  • Take the remaining read pairs and cluster those independently

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants