Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP [refine] unique internal node names #1451

Draft
wants to merge 3 commits into
base: james/export-multitree
Choose a base branch
from

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Apr 24, 2024

This builds on #1450 to rename the internal nodes of trees to be unique when comparing against other trees. For a pipeline which produces multiple trees and exports them together it's necessary to have unique internal nodes.

Ideally this code would be upstream in TreeTime, but it's ok within refine IMO.

Not intending to merge this PR as is - but i'm actively using it in pipelines which use multi-trees and so putting it here for 👀 and discussion.

Checking for duplicated node names and missing node names is in line
with the schema. Previously some calls to `export v2` would be ok with
missing node names (e.g. see the updated tests in `minify-output.t`) but
any usage with metadata would result in an uncaught error.
Multiple trees ("subtrees") have been available in Auspice since late
2021¹ and part of the associated schema since early 2022². Despite this
there was no way to produce such datasets within Augur itself, and
despite the schema changes the associated `augur validate` command was
never updated to allow them.

This commit adds multi-tree inputs to `augur export v2` as well as
allowing them to validate with our associated validation commands.

¹ <nextstrain/auspice#1442>
² <#851>
Needed for pipelines which will produce multiple trees via `augur
refine` and then supply these trees to `augur export v2`
Comment on lines +353 to +356
id = hashlib.sha256("".join(terminals).encode('utf-8')).hexdigest()[0:7]
def rename(name):
if name not in internals: return name
return f"NODE_{id}_{name.split('_')[1]}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hashing of the terminals of the tree feels arbitrary and not driven by the actual properties of the hash, esp. given the truncation of it. For example, two trees with the same terminals but different structures will collide. Why not simply produce unique ids for each node instead? (I still think we should do that.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, two trees with the same terminals but different structures will collide.

This functionality is motivated by multi-trees where terminal node names cannot be shared across trees.

@jameshadfield jameshadfield force-pushed the james/export-multitree branch 2 times, most recently from ddba55a to c40b821 Compare May 6, 2024 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants