-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
input sequences order dependence #139
Comments
Small update report, writing it here for the record:
An easy solution would be to reorder the fasta records here, based on the fasta record id (and maybe some hashing of the sequences to break ties in case of equal ids?). We would also have to re-assign the index property of the fasta records, so that path ids are inherited accordingly. I tested it and this does guarantee always obtaining the same graph. We loose information on the order of sequences in the input fasta file, and when reconstructing the input fasta file from the graph we can no longer guarantee that sequences are in the same order. |
Second progress update:
It might also be worth discussing in the documentation the expected difference between graphs created with reshuffled sequences. These should mostly differ in blocks whose size is close to the threshold length, or whose divergence is close to the threshold divergence. |
Currently the output of pangraph is deterministic given the same input file.
However, for two different input files with the same sequences but in different order, the output can still vary slightly.
This should be due to the fact that the order of mergers is determined by the guide-tree, which is a balanced version of the neighbour-joining tree. Differences in branch order can cause differences in which pairs are merged.
If we want to make the output deterministic, irrespective of the order of sequences, we could:
(Thanks @mjohnpayne for pointing this out!)
The text was updated successfully, but these errors were encountered: