Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add normalizer that snaps together redundant path traversals through sites #4396

Merged
merged 5 commits into from
Sep 17, 2024

Conversation

glennhickey
Copy link
Contributor

Changelog Entry

To be copied to the draft changelog by merger:

  • vg paths -n option added to normalize graphs using path information to "snap together" redundant paths through snarls. After running, no two path traversals through a snarl will ever produce the same sequence string without the traversals themselves being identical.

Description

As mentioned here, AT fields in deconstructed VCFs can be wrong in the sense that they do not reflect the actual path in the graph, but rather an equivalent (produces the same DNA sequence) path from some other haplotype.

This PR adds an option to explicitly check a graph for these cases and remove them. The logic is

  • for every snarl (visiting top-down)
    • identify all path traversals of the snarl (same logic as deconstruct) and determine which if any produce the same string.
    • of these equivalent paths, choose the selected path (from CLI options) if possible, otherwise use lowest path name as the unique representation.
    • modify all other (redundant) paths through the snarl so they have identical traversals to the selected path

I was a bit surprised how little this ended up changing the graph in the end (which I guess means cactus/abpoa/gfaffix are doing a pretty good job already). On hprc-mc-v1.1-grch38 the normalized graph has only

4,855    fewer nodes
15,816   fewer edges
78,922   fewer bases

So not much impact. But, while I don't have a log to count, the majority of paths snapped do not result in nodes/edges lost, so I think/hope the path representation is cleaned up more than these numbers indicate.

In any case, it's fast enough to run and the fact that it guarantees correct AT fields in the VCF seems like good enough reason to run it by default minigraph-cactus...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants