Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input sequences order dependence #139

Open
mmolari opened this issue Feb 13, 2025 · 2 comments
Open

input sequences order dependence #139

mmolari opened this issue Feb 13, 2025 · 2 comments

Comments

@mmolari
Copy link
Collaborator

mmolari commented Feb 13, 2025

Currently the output of pangraph is deterministic given the same input file.
However, for two different input files with the same sequences but in different order, the output can still vary slightly.

This should be due to the fact that the order of mergers is determined by the guide-tree, which is a balanced version of the neighbour-joining tree. Differences in branch order can cause differences in which pairs are merged.

If we want to make the output deterministic, irrespective of the order of sequences, we could:

(Thanks @mjohnpayne for pointing this out!)

@mmolari
Copy link
Collaborator Author

mmolari commented Feb 14, 2025

Small update report, writing it here for the record:

  • Yesterday I tried resolving the order here in the clade creation function, where we pass a left and right node. I added some form of hashing of the nodes that within the function can re-order left and right based on the hash. This should result in always the same merge tree.
  • However this was still not enough to get the same output file from two reshuffled input file. After looking more into it I think it’s due to the fact that the path_id is a progressive number that is always set equal to the order of the fasta record and changes depending on the order of input sequence.
  • this path_id is then used in the hashing to calculate the node id, thus changing order of inputs in the fasta also shuffles all the node ids.

An easy solution would be to reorder the fasta records here, based on the fasta record id (and maybe some hashing of the sequences to break ties in case of equal ids?). We would also have to re-assign the index property of the fasta records, so that path ids are inherited accordingly.

I tested it and this does guarantee always obtaining the same graph. We loose information on the order of sequences in the input fasta file, and when reconstructing the input fasta file from the graph we can no longer guarantee that sequences are in the same order.

@mmolari
Copy link
Collaborator Author

mmolari commented Feb 14, 2025

Second progress update:

  • reordering fasta files would require parsing all the input sequence before building the graph, which would make any attempt at streaming the sequences much harder.
  • however we could give the user the option. By introducing an optional --sort-input-sequences flag sequences might be reordered in lexicographical order of id, with ties resolved by lexicographical order of the sequences.

It might also be worth discussing in the documentation the expected difference between graphs created with reshuffled sequences. These should mostly differ in blocks whose size is close to the threshold length, or whose divergence is close to the threshold divergence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant