input sequences order dependence #139

mmolari · 2025-02-13T12:34:32Z

Currently the output of pangraph is deterministic given the same input file.
However, for two different input files with the same sequences but in different order, the output can still vary slightly.

This should be due to the fact that the order of mergers is determined by the guide-tree, which is a balanced version of the neighbour-joining tree. Differences in branch order can cause differences in which pairs are merged.

If we want to make the output deterministic, irrespective of the order of sequences, we could:

remove tree balancing. This was done to improve parallelism in tree traversal but it's not really used now.
For each internal node, deterministically decide which is the left and right child node depending on some properties of the sequence (i.e. hash)

(Thanks @mjohnpayne for pointing this out!)

mmolari · 2025-02-14T08:31:31Z

Small update report, writing it here for the record:

Yesterday I tried resolving the order here in the clade creation function, where we pass a left and right node. I added some form of hashing of the nodes that within the function can re-order left and right based on the hash. This should result in always the same merge tree.
However this was still not enough to get the same output file from two reshuffled input file. After looking more into it I think it’s due to the fact that the path_id is a progressive number that is always set equal to the order of the fasta record and changes depending on the order of input sequence.
this path_id is then used in the hashing to calculate the node id, thus changing order of inputs in the fasta also shuffles all the node ids.

An easy solution would be to reorder the fasta records here, based on the fasta record id (and maybe some hashing of the sequences to break ties in case of equal ids?). We would also have to re-assign the index property of the fasta records, so that path ids are inherited accordingly.

I tested it and this does guarantee always obtaining the same graph. We loose information on the order of sequences in the input fasta file, and when reconstructing the input fasta file from the graph we can no longer guarantee that sequences are in the same order.

mmolari · 2025-02-14T17:16:37Z

Second progress update:

reordering fasta files would require parsing all the input sequence before building the graph, which would make any attempt at streaming the sequences much harder.
however we could give the user the option. By introducing an optional --sort-input-sequences flag sequences might be reordered in lexicographical order of id, with ties resolved by lexicographical order of the sequences.

It might also be worth discussing in the documentation the expected difference between graphs created with reshuffled sequences. These should mostly differ in blocks whose size is close to the threshold length, or whose divergence is close to the threshold divergence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

input sequences order dependence #139

input sequences order dependence #139

mmolari commented Feb 13, 2025

mmolari commented Feb 14, 2025

mmolari commented Feb 14, 2025

input sequences order dependence #139

input sequences order dependence #139

Comments

mmolari commented Feb 13, 2025

mmolari commented Feb 14, 2025

mmolari commented Feb 14, 2025