Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider how to load in BEAST trees with compact descriptions of a history #42

Open
matsen opened this issue Nov 29, 2022 · 9 comments
Open

Comments

@matsen
Copy link
Contributor

matsen commented Nov 29, 2022

Thanks to @jsigao we now understand that BEAST can output compact history descriptions that read like

[&rate=0.760250317223809,mutations=2.0,history_all={{13163,0.016625667248442385,A,G},{23756,0.020319042566174454,T,G}}]

We'd like to be able to load these and make a history DAG of them.

Here is a zipped up file with the original .trees file from BEAST, and also a file that has those annotations converted to NHX like so:

import dendropy

dp_trees = dendropy.TreeList.get(path="clade_13.GTR.history.trees", schema="nexus")
nhx_string = dp_trees.as_string(schema="newick", suppress_annotations=False, annotations_as_nhx=True, suppress_rooting=True)

with open("clade_13.GTR.history.nhx", "w") as fp:
    fp.write(nhx_string)

I then went in by hand and deleted the first comment on each line, which ete3 didn't like.

However, I wasn't able to get these immediately loaded into ete3 using something like

with open("clade_13.GTR.history.nhx", "r") as fp:
    ete_trees = [ete3.TreeNode(nhx_string, quoted_node_names=True) for nhx_string in fp]

So I'm afraid this isn't the tidiest issue, but hopefully it's a start.


Before diving into any particular approach, let's keep the big picture in mind. We want to be able to load these trees into the hDAG. We could

  1. have methods that would parse a dendropy tree into a MAD
  2. have an external script that would convert a dendropy tree into a MAD protobuf
  3. do what we're doing here: load an NHX parseable by ete3 and then turn those into a DAG, perhaps after writing to a protobuf?
@willdumm
Copy link
Collaborator

I just had a look, and noticed that the newicks exported by dendropy have commas in the extended newick field (for example, in history_all:

[&&NHX:rate=1.0:mutations=7.0:history_all={{11029,0.03866966440113948,T,A}]

I've run into this before... Unfortunately, the ete newick parser treats all commas in a newick string as node separators, and has no concept of context for commas appearing in the extended newick field.

The workaround is to convert these fields to a string format that's compatible with ete before the dendropy export. For example, if the node attribute history_all started off as a set, ete would export this node annotation as:

[&&NHX:rate=1.0:mutations=7.0:history_all=11029|0.03866966440113948|T|A]

However, if you load the newick containing this annotation back into ete, the node attribute history_all would now contain the string "11029|0.03866966440113948|T|A".

I think the easiest thing would be to add dendropy tree loading to historydag

@matsen
Copy link
Contributor Author

matsen commented Nov 30, 2022

K, thanks for looking into this. We can put it on ice for the time being.

@willdumm
Copy link
Collaborator

willdumm commented Dec 3, 2022

#47 provides a generalized dag.history_dag_from_trees function that can load dendropy trees in a list treelist, with something like:

dag = hdag.history_dag_from_trees(
     [tree.seed_node for tree in treelist],
     [],
     label_functions={'sequence': lambda node: node.annotations.get_value('history_all')},
     child_node_func=dendropy.Node.child_nodes,
     leaf_node_func=dendropy.Node.leaf_iter
)

However, it looks to me like these are mutations relative to parent node, not relative to a fixed reference? If so, we're going to need the reference sequence, and probably also the complete leaf sequences, if they contain any ambiguities.

I'll look into this more next week.

@matsen
Copy link
Contributor Author

matsen commented Dec 3, 2022 via email

@willdumm
Copy link
Collaborator

willdumm commented Dec 8, 2022

Here's a working script to load these trees:

import historydag as hdag
from functools import lru_cache
import dendropy

# dendropy doesn't parse nested lists correctly in metadata, so we load the
# trees with raw comment strings using `extract_comment_metadata`
dp_trees = dendropy.TreeList.get(
    path="clade_13.GTR.history.trees", schema="nexus", extract_comment_metadata=False
)

# This should be replaced with the true reference genome.
reference_genome = "A" * 30000
reference_cg = hdag.compact_genome.CompactGenome({}, reference_genome)


def comment_parser(node_comments):
    if len(node_comments) == 0:
        yield from ()
        return
    elif len(node_comments) == 1:
        comment_string = node_comments[0]
    else:
        raise ValueError("node_comments has more than one element" + str(node_comments))
    if "history_all=" not in comment_string:
        yield from ()
        return
    else:
        mutations_string = comment_string.split("history_all=")[-1]
        stripped_mutations_list = mutations_string[2:-2]
        if stripped_mutations_list:
            mutations_list = stripped_mutations_list.split("},{")
            for mut in mutations_list:
                try:
                    idx_str, _, from_base, to_base = mut.split(",")
                except ValueError:
                    raise ValueError("comment_parser failed on: " + str(node_comments))
                assert to_base in 'AGCT'
                yield from_base + idx_str + to_base
        else:
            yield from ()
            return


_comment_parser_test_data = [
    (["&rate=1.0,mutations=0.0"], []),
    (["&rate=1.0,mutations=0.0,history_all={}"], []),
    (["&rate=1.0,mutations=0.0,history_all={{}}"], []),
    (["&rate=1.0,mutations=0.0,history_all={{12,1.1,A,G}}"], ["A12G"]),
    (
        ["&rate=1.0,mutations=0.0,history_all={{12,1.1,A,G},{1,0.99,A,C}}"],
        ["A12G", "A1C"],
    ),
]

for in_data, expected in _comment_parser_test_data:
    parsed_data = list(comment_parser(in_data))
    if parsed_data != expected:
        raise RuntimeError(
            f"comment_parser: unexpected output {parsed_data} from input: {in_data}. Expected {expected}"
        )


@lru_cache(maxsize=(2 * len(dp_trees[0].nodes())))
def compute_cg(node):
    if node.parent_node is None:
        # base case: node is a root node
        parent_cg_mut = reference_cg
    else:
        parent_cg_mut = compute_cg(node.parent_node)
    return parent_cg_mut.apply_muts(comment_parser(node.comments))


dag = hdag.history_dag_from_trees(
    [tree.seed_node for tree in dp_trees],
    [],
    label_functions={
        "compact_genome": compute_cg,
    },
    attr_func=lambda n: {
        "name": (n.taxon.label if n.is_leaf() else "internal"),
    },
    child_node_func=dendropy.Node.child_nodes,
    leaf_node_func=dendropy.Node.leaf_iter,
)

dag = hdag.mutation_annotated_dag.CGHistoryDag.from_history_dag(dag)
dag.summary()

The output from dag.summary:

<class 'historydag.mutation_annotated_dag.CGHistoryDag'>
Nodes:  315
Edges:  315
Histories:      3
Unique leaves in DAG:   159
Leaf node count range: 53 to 53
Smallest history has 105 nodes, largest history has 105 nodes
Average number of parents of internal nodes:    1.0
Parsimony score range 421 to 586

I'll open a dendropy issue to get correct parsing of these newicks figured out.

Note that just counting the mutations along all the branches of the beast trees gives wildly different (larger) parsimony scores, because there are lots of mutation reversions on single branches.

@willdumm
Copy link
Collaborator

willdumm commented Dec 8, 2022

Issue opened jeetsukumaran/DendroPy#145

@willdumm
Copy link
Collaborator

willdumm commented Dec 29, 2022

Here's a modified script that fixes the reference sequence to be the sequence associated to the leaf England/CAMC-F00E8F/2021|2021-01-07. However, right now this sequence is assumed to be all A's. Since this is certainly incorrect, reverted mutations on other branches are causing all of the other leaves to have different compact genomes on each of the trees. Having the true sequence for this reference leaf might fix that...

import historydag as hdag
from functools import lru_cache
import dendropy
import xml.etree.ElementTree as ET

# get alignment from xml:
_etree = ET.parse('clade_13.GTR.xml')
_alignment = _etree.getroot().find('alignment')
fasta = {a[0].attrib['idref'].strip(): a[0].tail.strip() for a in _alignment}

# dendropy doesn't parse nested lists correctly in metadata, so we load the
# trees with raw comment strings using `extract_comment_metadata`
dp_trees = dendropy.TreeList.get(
    path="clade_13.GTR.history.trees", schema="nexus", extract_comment_metadata=False
)

# This should be replaced with the true reference genome.
reference_leaf = 'England/CAMC-F00E8F/2021|2021-01-07'
reference_leaf_sequence = fasta[reference_leaf]
reference_leaf_cg = hdag.compact_genome.CompactGenome({}, reference_leaf_sequence)



# ### Comment parser and testing
def comment_parser(node_comments):
    if len(node_comments) == 0:
        yield from ()
        return
    elif len(node_comments) == 1:
        comment_string = node_comments[0]
    else:
        raise ValueError("node_comments has more than one element" + str(node_comments))
    if "history_all=" not in comment_string:
        yield from ()
        return
    else:
        mutations_string = comment_string.split("history_all=")[-1]
        stripped_mutations_list = mutations_string[2:-2]
        if stripped_mutations_list:
            mutations_list = stripped_mutations_list.split("},{")
            for mut in mutations_list:
                try:
                    idx_str, _, from_base, to_base = mut.split(",")
                except ValueError:
                    raise ValueError("comment_parser failed on: " + str(node_comments))
                assert to_base in 'AGCT'
                yield from_base + idx_str + to_base
        else:
            yield from ()
            return


_comment_parser_test_data = [
    (["&rate=1.0,mutations=0.0"], []),
    (["&rate=1.0,mutations=0.0,history_all={}"], []),
    (["&rate=1.0,mutations=0.0,history_all={{}}"], []),
    (["&rate=1.0,mutations=0.0,history_all={{12,1.1,A,G}}"], ["A12G"]),
    (
        ["&rate=1.0,mutations=0.0,history_all={{12,1.1,A,G},{1,0.99,A,C}}"],
        ["A12G", "A1C"],
    ),
]

for in_data, expected in _comment_parser_test_data:
    parsed_data = list(comment_parser(in_data))
    if parsed_data != expected:
        raise RuntimeError(
            f"comment_parser: unexpected output {parsed_data} from input: {in_data}. Expected {expected}"
        )
# ### End comment parser testing

def compute_cgs(tree):
    ref_node = tree.find_node_with_taxon_label(reference_leaf)
    ref_cg = reference_leaf_cg
    while True:
        ref_cg = ref_cg.apply_muts(list(comment_parser(ref_node.comments)), reverse=True)
        ref_node = ref_node.parent_node
        if ref_node is None:
            break

    @lru_cache(maxsize=(2 * len(dp_trees[0].nodes())))
    def compute_cg(node):
        if node.parent_node is None:
            # base case: node is a root node
            parent_cg_mut = ref_cg
        else:
            parent_cg_mut = compute_cg(node.parent_node)
        return parent_cg_mut.apply_muts(comment_parser(node.comments))

    for node in tree.preorder_node_iter():
        node.cg = compute_cg(node)

for tree in dp_trees:
    compute_cgs(tree)

dag = hdag.history_dag_from_trees(
    [tree.seed_node for tree in dp_trees],
    [],
    label_functions={
        "compact_genome": lambda n: n.cg,
    },
    attr_func=lambda n: {
        "name": (n.taxon.label if n.is_leaf() else "internal"),
    },
    child_node_func=dendropy.Node.child_nodes,
    leaf_node_func=dendropy.Node.leaf_iter,
)

dag = hdag.mutation_annotated_dag.CGHistoryDag.from_history_dag(dag)
dag.summary()

@willdumm
Copy link
Collaborator

^ My last comment has now been edited to load the alignment from the beast XML file, and use the appropriate one as the starting sequence for attempting to reconstruct the reference. More work needs to be done though, since all of the leaves are ambiguous, and are resolved differently in each tree. I haven't completely decided how to resolve this, but I'll come back to it tomorrow.

@willdumm
Copy link
Collaborator

I've included a generalized version of this, with fixes thanks to Jiansi, in the historydag.beast_loader submodule in the wd-load-beast-trees branch. This may still be a work in progress for awhile, but it seems to work as expected at the moment, and correctly reconstructs trees' reference sequences.

In addition to using the most recent state of that branch, you must have dendropy installed to use the feature.

Example use, equivalent to the scripts above (with some additional fixes):

import historydag
dag = historydag.beast_loader.dag_from_beast_trees('clade_13.GTR.xml', 'clade_13.GTR.history.trees')
dag.summary()

Now the output looks like this:

<class 'historydag.mutation_annotated_dag.CGHistoryDag'>
Nodes:  281
Edges:  299
Histories:      3
Unique leaves in DAG:   133
Average number of parents of internal nodes:    1.027027027027027

In histories in the DAG:
Leaf node count range: 53 to 53
Total node count range: 105 to 105
Average pairwise RF distance:   177.33333333333334
Parsimony score range 328 to 474

So, we still have some different realizations of the same leaf, but not as bad as before.

The key insight to correctly load these trees was that BEAST ignores sites that are all N or ?, and then the site indices of mutations are in the sequence with those sites removed.

In addition, I remove sites which are all - (or some combination of only N, ? or -) because strange mutations seem to happen at such sites, and we're treating - as N anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants