-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parsenewick can parse Newick files with support values instead of node names #74
Comments
In case it is helpful to someone; I'm doing this quick and dirty trick to get the files parsed: function _make_internal_names_unique(newick_str)
@assert isascii(newick_str)
internal_nodes = findall(r"\d+\.\d+\:\d+\.\d+", newick_str)
new_str = String[]
previous = 1
for node in internal_nodes
a = first(node)
b = last(node)
push!(new_str, newick_str[previous:a-1])
name, dist = split(newick_str[a:b], ':')
push!(new_str, join(rand('a':'z', 20)))
push!(new_str, ":")
push!(new_str, dist)
previous = b
end
push!(new_str, newick_str[previous:end])
replace(join(new_str), "\n" => "")
end
tree = parsenewick(_make_internal_names_unique(newick_str))
|
Thanks @diegozea ! Could this be added in some form to the package? |
Hi @diegozea - I'm just coming back to issues related to parsing files, and the problem is that this isn't legal for newick format as far as I can tell, and there's no real way of telling that this has been done either (besides the normal newick parsing failing). There is an extension that allows support values, Rich Newick, but it looks like |
Hi @richardreeve, I'm not certain about why they use that format. Perhaps, we could seek the opinion of @stephaneguindon from PhyML to shed some light on the matter. Best regards, |
Hi, |
Oh, I see - so you're saying that the support value is really intended to be the node label. That's interesting - it hadn't occurred to me that might be what was going on. You're right that allowing duplicates isn't explicitly forbidden by the standard, but I don't think we're going to be able to support it for now at least because we store node (meta)data as key-value pair in a Dict, which doesn't support duplicate entries. I'll have a think about it though. |
Hi! I was thinking that a trick to make the string identifier unique without compromising the numerical value is to add a trailing 0 to the identifier if it is a number and is already present. For example, if "100.0" is already in the dictionary, we change the name to "100.00". What do you think? Best regards. |
I'm totally happy with this as a stopgap if anyone wants to create a PR? Hint, hint... Earlier in the year I started writing a completely new parser using Automa as the tokenizer, and then a new handwritten parser (since Automa can't handle recursive structures). The way that was (is!) going to work is to have a separate parser for this dialect of Newick, which just uses the label directly as the support value. Unfortunately other work has overwhelmed this, so I haven't got close enough to completion to do a PR. |
Hi! Readding files from PhyML and FastTree like the following one gives an error with Phylo when support values are duplicated in the tree:
Cheers,
The text was updated successfully, but these errors were encountered: