Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More complete rGFA support (experimental) #4113

Open
wants to merge 48 commits into
base: master
Choose a base branch
from
Open

More complete rGFA support (experimental) #4113

wants to merge 48 commits into from

Conversation

glennhickey
Copy link
Contributor

Changelog Entry

To be copied to the draft changelog by merger:

  • vg paths -R option added to compute an rGFA cover (based on a reference sample) from a (mutable) graph, and add it as path fragments.

Description

This is a fixup and refactor of #3891. It uses the same greedy (on steps) method to select snarl traversals to add to the cover. The cover is now stored in a (really simple) in-memory index that maps nodes to fragments that cover them, and these fragments can be walked back to the rank-0 reference.

rGFA covers can be computed from full paths, loaded from path fragments, and saved to path fragments. They are always defined relative to a rank-0 / reference sample which stays as a normal reference path in the graph.

Off-reference rGFA path fragments are REFERENCE-sense paths with a special sample name (_rGFA_) which allows them to be easily selected. Would like to find something less hacky, but this should work for experiments now.

Corresponding Cactus update is: ComparativeGenomicsToolkit/cactus#1186

@JosephLalli
Copy link

Just wanted to say I've been experimenting with an end user hack to do something similar*, and I've seen the largest impact in accuracy from variants called around insertions. Reads which straddle a reference and insertion contig will align to one or the other.

As a result, depth around insertions flattens out, or even dips a little. Here's an example in HG002 around chr10:133111600, where vg haplotypes identified two overlapping 900bp insertions (one maternal, one paternal).

Example insertion chr10-133mb

Surjecting reads to both insertions and GRCh38 led to reduced depth in the region, causing deepvariant to miss some GIAB annotated variants. However, the reads which surjected to the reference sequence now had their non-reference sequence softclipped. This led to fewer false positive calls around insertions due to misaligned reads.

PS: This is a really good test region. Both the insertions and the surrounding reference sequence are complex enough to allow for uniquely mapping reads, and the accuracy of variant calling is greatly improved when reads from either insertion are not mismapped to the linear reference sequence. It is also in an intron of a CNS expressed gene that has been linked to appetite and overeating, suggesting potential clinical relevance.

I'm very excited to eventually see this code on the main branch!


  1. KMC
  2. Create personalized GFA with haplotypes
  3. Deconstruct GFA against reference of interest
  4. Isolate insertions > 100bp, extract paths from deconstruction vcf
  5. Format paths as GFA haplotypes, append to end of GFA file
    • contig names were just chr|start|stop|unique_id
  6. Convert to GBZ, mark all insertion paths as references
  7. Extract per-sample fastas
  8. Bring all fastas in cohort together, harmonize the unique id of each insertion. Create translation file of old contig name -> harmonized contig name.

Then:
Map reads to GBZ
Use samtools reheader, using sed to remove haplotype information and awk to apply the old contig name -> harmonized contig name translation

@JosephLalli
Copy link

Hi @glennhickey,

I wonder if you are still planning to implement this feature?

-Joe

@glennhickey
Copy link
Contributor Author

Yes! But I've been completely bogged down with other things. The only good news is that I think I'm on the hook to present about it in a few weeks so I won't have much choice but to get moving. I still have your examples to reproduce problems that I will check out. Sorry for the long wait.

@JosephLalli
Copy link

Not a problem @glennhickey, I'm so grateful for the work you've put in. I just wanted to double-check, as I'm a few months out from putting together a manuscript comparing the effect of vg-deepvariant with bwamem-haplotypecaller and bwamem-deepvariant, and I think this would be a good addition to the paper.

I'm in the middle of a too-many-projects-too-little-time period myself, so I am very, very familiar with the feeling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants