Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsenewick can parse Newick files with support values instead of node names #74

Open
diegozea opened this issue May 23, 2022 · 8 comments

Comments

@diegozea
Copy link

Hi! Readding files from PhyML and FastTree like the following one gives an error with Phylo when support values are duplicated in the tree:

(A9CIT6:0.59280764,P51981:0.55926221,(Q8A861:0.99105703,((Q81IL5:0.76431643,((((A1B198:0.94287572,(Q2U1E8:0.71410953,Q5LT20:0.55480466)0.975579:0.21734008)1.000000:0.55075633,(Q92YR6:1.11236546,(((Q13PB7:1.33807955,(Q161M1:1.17720944,A1AYL4:0.93440931)0.784619:0.18922325)0.878426:0.09769089,((((Q1ASJ4:1.28537785,(A8LS88:0.87558406,(C1DMY1:0.14671933,(P77215:0.02112667,Q8ZNF9:0.01593493)1.000000:0.35900384)1.000000:0.81055398)0.999947:0.27041496)0.403932:0.04809748,(A4YVM8:1.35286455,(Q9RKF7:0.83804265,((Q8ZL58:0.21550115,Q12GE3:0.23170031)1.000000:0.65091551,(Q7CU39:0.71681321,(Q8P3K2:0.27030998,(Q1GLV3:0.34268834,Q7L5Y1:0.42965239)0.843769:0.09133540)1.000000:1.04593860)0.978319:0.19792131)0.995516:0.18393058)0.684889:0.10148106)0.965570:0.12638685)0.999013:0.10597436,(((A0A0H3LM82:1.53892471,(O06741:1.56982104,(G0L7B8:0.68911617,A9CEQ8:0.63642012)0.999148:0.26000097)0.760390:0.07007390)0.534387:0.04760860,(A0A0H3LT39:0.95322505,(Q3HKK5:1.65509086,(A8H7M5:1.21743086,A8H9D1:0.47372214)0.711619:0.14571049)0.992889:0.20930969)0.824619:0.10359095)0.957749:0.09170993,(((Q8ZNH1:0.35062372,Q5NN22:0.45517287)1.000000:0.51175191,(Q7D1T6:0.17783006,Q63IJ7:0.15483880)1.000000:0.87953156)0.879253:0.11535090,(Q28RT0:1.01784576,(B9JNP7:1.11261669,(B2UCA8:0.76348582,(((A6M2W4:0.21565444,A4W7D6:0.17558479)1.000000:0.37879482,(D4GJ14:0.06604958,(C6CBG9:0.01850349,B5R541:0.03323447)0.999985:0.05738427)1.000000:0.29155266)1.000000:0.31191076,((C6D9S0:0.03964480,Q8FHC7:0.03209453)1.000000:0.13486764,((A4XF23:0.22295702,(Q1QT89:0.18122799,B3PDB1:0.14414146)0.999998:0.06015729)0.968272:0.04285536,(B0T0B1:0.05224760,Q9AAR4:0.07234240)1.000000:0.14812779)0.999678:0.08813897)1.000000:0.36147407)1.000000:0.47112287)0.928316:0.13341811)0.550875:0.06620266)0.961209:0.13640648)1.000000:0.27744895)0.988757:0.09058974)0.866782:0.06631759,(A9CL63:0.35266112,Q92ZS5:0.19783599)1.000000:1.18369094)0.402267:0.02752454)0.999888:0.19319607,(Q9F3A5:0.58548261,((Q3KB33:0.33553686,(A0R5B5:0.21484600,D6Y7Y6:0.27848012)1.000000:0.24287080)0.942008:0.15046146,(Q1QUN0:0.25266568,(C0WBB5:0.25833170,(P0AES2:0.09059530,A6VQF1:0.05673552)1.000000:0.13075826)0.999738:0.14696003)1.000000:0.81155149)1.000000:0.47139015)1.000000:0.50975221)0.966695:0.13031427)0.863808:0.08043994)0.973288:0.08656638,(Q5P025:0.97873157,(C5CFI0:1.08126948,((A0A0H3KH80:0.00000001,(A0A0H2WWB5:0.00000001,Q53635:0.00000001)-1.000000:0.00000001)-1.000000:1.77379709,((Q838J7:0.42816989,Q927X3:0.43775742)1.000000:0.24306201,(Q5SJX8:0.32151423,Q9RYA6:0.31040549)1.000000:0.38120655)0.761411:0.11121216)0.449066:0.07920285)1.000000:0.65145569)0.934431:0.11633312)0.809578:0.05916034,(A0QTN8:1.01771865,((Q8DJP8:2.14313928,(Q8NN12:0.75309341,(P05404:0.60462842,(Q4K9X1:0.32854393,A6T9N5:0.34558716)1.000000:0.22924230)0.999793:0.18933082)0.610131:0.11821100)0.811426:0.14174769,(A8HTB8:0.54448819,(Q5LM96:0.23356964,Q28SI7:0.15669417)1.000000:0.47534595)1.000000:0.49079583)0.999955:0.20013288)0.717358:0.06863833)0.729022:0.08518511)0.998543:0.11833475,(Q607C7:1.08575204,(((O34508:0.48939847,B0TZW0:1.11933860)0.879395:0.12702564,(Q9WXM1:0.80769788,(A9B055:0.36657533,(A5UXJ3:0.34547016,A9GEI3:0.26128229)0.974762:0.12479711)1.000000:0.52751375)1.000000:0.33586771)0.610795:0.07157613,(Q11T61:0.92177095,Q834W6:0.44934225)0.999732:0.14429535)0.996719:0.09875344)0.787853:0.04813874)0.983918:0.24536033)1.000000:0.66916649);

image

Cheers,

@diegozea
Copy link
Author

In case it is helpful to someone; I'm doing this quick and dirty trick to get the files parsed:

function _make_internal_names_unique(newick_str)
	@assert isascii(newick_str)
	internal_nodes = findall(r"\d+\.\d+\:\d+\.\d+", newick_str)
	new_str = String[]
	previous = 1
	for node in internal_nodes
		a = first(node)
		b = last(node)
		push!(new_str, newick_str[previous:a-1])
		name, dist = split(newick_str[a:b], ':')
		push!(new_str, join(rand('a':'z', 20)))
		push!(new_str, ":")
		push!(new_str, dist)
		previous = b
	end
	push!(new_str, newick_str[previous:end])
	replace(join(new_str), "\n" => "")
end

tree = parsenewick(_make_internal_names_unique(newick_str))

@cossio
Copy link
Contributor

cossio commented Nov 18, 2022

Thanks @diegozea ! Could this be added in some form to the package?

@richardreeve
Copy link
Member

richardreeve commented Dec 20, 2023

Hi @diegozea - I'm just coming back to issues related to parsing files, and the problem is that this isn't legal for newick format as far as I can tell, and there's no real way of telling that this has been done either (besides the normal newick parsing failing). There is an extension that allows support values, Rich Newick, but it looks like (A,B):length:prob, with the number after the second : being the support value, not like this. Do you know why PhyML and FastTree use this different format and/or where it's defined? To be honest there are so many of these incompatible languages I'm not sure how I can help beyond support nexus comments / beast metacomments, which allow any kind of additional information to be stored inside [&support=0.974].

@diegozea
Copy link
Author

Hi @richardreeve,

I'm not certain about why they use that format. Perhaps, we could seek the opinion of @stephaneguindon from PhyML to shed some light on the matter.

Best regards,

@stephaneguindon
Copy link

Hi,
PhyML uses internal node labels in Newick trees to output internal branch supports indeed. We most likely made that choice because it was already done so in PHYLIP. I agree that it is far from perfect and better options are available at noted by @richardreeve. Yet, any NEWICK tree parser should handle internal node labels and I do not see any good reason to prohibit duplicates in these labels.
Best regards,
-Stéphane-

@richardreeve
Copy link
Member

richardreeve commented Feb 1, 2024

Oh, I see - so you're saying that the support value is really intended to be the node label. That's interesting - it hadn't occurred to me that might be what was going on. You're right that allowing duplicates isn't explicitly forbidden by the standard, but I don't think we're going to be able to support it for now at least because we store node (meta)data as key-value pair in a Dict, which doesn't support duplicate entries. I'll have a think about it though.

@diegozea
Copy link
Author

Hi!

I was thinking that a trick to make the string identifier unique without compromising the numerical value is to add a trailing 0 to the identifier if it is a number and is already present. For example, if "100.0" is already in the dictionary, we change the name to "100.00". What do you think?

Best regards.

@richardreeve
Copy link
Member

I'm totally happy with this as a stopgap if anyone wants to create a PR? Hint, hint...

Earlier in the year I started writing a completely new parser using Automa as the tokenizer, and then a new handwritten parser (since Automa can't handle recursive structures). The way that was (is!) going to work is to have a separate parser for this dialect of Newick, which just uses the label directly as the support value. Unfortunately other work has overwhelmed this, so I haven't got close enough to completion to do a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants